Vectra Compute // Spec-019

1.0 Thesis

The physics of cloud-based generative NLP are structurally flawed for scale. Transmitting unstructured data via TLS over WAN introduces an inescapable latency floor (minimum 80-150ms round-trip). Simultaneously, centralized provider API tokens scale linearly with DAU (Daily Active Users).

Vectra Compute engineers a hard-fork in this architecture. We build bespoke client-side compilers that map transformer-based models directly to native browser APIs (WebGPU). By leveraging Activation-aware Weight Quantization (AWQ) at INT4, we isolate the inference entirely to the local silicon.

2.0 Hardware Execution Topology

[VRAM / SHARED MEMORY] [COMPUTE SHADER PIPELINE] ┌─────────────────────────┐ ┌──────────────────────────────────┐ │ OPFS Persistent Storage │ │ WGSL Workgroup (Size: 16x16) │ │ ├─ Weights (INT4) ├──── I/O ───►│ ├─ Tile Caching (var<workgroup>) │ │ └─ KV Cache (Paged) │ (0.0ms) │ └─ Subgroup MatMul (SIMT) │ └─────────────────────────┘ └──────────────────────────────────┘ │ [HOST APPLICATION] ▼ ┌─────────────────────────┐ ┌──────────────────────────────────┐ │ DOM / Extension Context │◄── IPC ─────┤ GPUDevice.readBufferAsync() │ └─────────────────────────┘ (2-4ms) └──────────────────────────────────┘

3.0 Primitive Memory Alignment (WGSL)

// FlashAttention implementation for WebGPU limits
// Block size optimized for 32-thread subgroups (SM 6.0+)
const BLOCK_SIZE_M: u32 = 64u;
const BLOCK_SIZE_N: u32 = 64u;

struct QBuffer {
    w_q: array<vec4<u32>>, // 16 bytes aligned
    scales: array<f32>,
};

@group(0) @binding(0) var<storage, read> weights: QBuffer;
var<workgroup> tile_a: array<array<f32, BLOCK_SIZE_K>, BLOCK_SIZE_M>;

@compute @workgroup_size(16, 16)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
    // Pre-fetch into highly restricted shared memory
    workgroupBarrier();
    // Compute payload executing locally...
}

4.0 Benchmarks & Telemetry

Overhead (Cost per 1M Tokens) $0.00 (Local Compute)
Time-To-First-Token (Hot OPFS) 24.60 ms
Tokens/Sec (Integrated Intel/AMD) 18 - 25 t/s
Tokens/Sec (Apple M-Series) 40 - 65 t/s
VRAM Footprint (7B Model, INT4) ~3.9 GB
Outbound Network Dependency Strictly 0 (Airgapped)

5.0 Deployment Restraints

Vectra Compute operates strictly as an infrastructure vendor. We do not provide public wrappers or SDKs. Local LLM execution requires a bespoke pipeline mapped directly to the client's internal DOM structure and background service workers.

Hardware limitation: Execution is locked to Chromium-based environments (v113+) supporting native navigator.gpu. Legacy fallbacks utilize WASM SIMD threading with an expected 40% latency penalty.