1.0 Thesis
The physics of cloud-based generative NLP are structurally flawed for scale. Transmitting unstructured data via TLS over WAN introduces an inescapable latency floor (minimum 80-150ms round-trip). Simultaneously, centralized provider API tokens scale linearly with DAU (Daily Active Users).
Vectra Compute engineers a hard-fork in this architecture. We build bespoke client-side compilers that map transformer-based models directly to native browser APIs (WebGPU). By leveraging Activation-aware Weight Quantization (AWQ) at INT4, we isolate the inference entirely to the local silicon.
2.0 Hardware Execution Topology
[VRAM / SHARED MEMORY] [COMPUTE SHADER PIPELINE]
┌─────────────────────────┐ ┌──────────────────────────────────┐
│ OPFS Persistent Storage │ │ WGSL Workgroup (Size: 16x16) │
│ ├─ Weights (INT4) ├──── I/O ───►│ ├─ Tile Caching (var<workgroup>) │
│ └─ KV Cache (Paged) │ (0.0ms) │ └─ Subgroup MatMul (SIMT) │
└─────────────────────────┘ └──────────────────────────────────┘
│
[HOST APPLICATION] ▼
┌─────────────────────────┐ ┌──────────────────────────────────┐
│ DOM / Extension Context │◄── IPC ─────┤ GPUDevice.readBufferAsync() │
└─────────────────────────┘ (2-4ms) └──────────────────────────────────┘
3.0 Primitive Memory Alignment (WGSL)
// FlashAttention implementation for WebGPU limits
// Block size optimized for 32-thread subgroups (SM 6.0+)
const BLOCK_SIZE_M: u32 = 64u;
const BLOCK_SIZE_N: u32 = 64u;
struct QBuffer {
w_q: array<vec4<u32>>, // 16 bytes aligned
scales: array<f32>,
};
@group(0) @binding(0) var<storage, read> weights: QBuffer;
var<workgroup> tile_a: array<array<f32, BLOCK_SIZE_K>, BLOCK_SIZE_M>;
@compute @workgroup_size(16, 16)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
// Pre-fetch into highly restricted shared memory
workgroupBarrier();
// Compute payload executing locally...
}
4.0 Benchmarks & Telemetry
- Overhead (Cost per 1M Tokens) $0.00 (Local Compute)
- Time-To-First-Token (Hot OPFS) 24.60 ms
- Tokens/Sec (Integrated Intel/AMD) 18 - 25 t/s
- Tokens/Sec (Apple M-Series) 40 - 65 t/s
- VRAM Footprint (7B Model, INT4) ~3.9 GB
- Outbound Network Dependency Strictly 0 (Airgapped)
5.0 Deployment Restraints
Vectra Compute operates strictly as an infrastructure vendor. We do not provide public wrappers or SDKs. Local LLM execution requires a bespoke pipeline mapped directly to the client's internal DOM structure and background service workers.
Hardware limitation: Execution is locked to Chromium-based environments (v113+) supporting native navigator.gpu. Legacy fallbacks utilize WASM SIMD threading with an expected 40% latency penalty.