About vib3
A weight-indexed inference engine that makes frontier MoE models accessible on consumer hardware.
The Problem
Mixture-of-Experts models are the architecture behind the most capable LLMs—GPT-4, Mixtral, DeepSeek, Kimi K2.5. But they have a fundamental deployment problem: even though only a fraction of experts activate per token, the entire parameter set must be resident in memory. A model with 384 experts might activate 8 per token, yet you still need 400+ GB of VRAM to hold it all.
This means frontier models are locked behind multi-GPU clusters and cloud APIs. Consumer hardware with 8–24 GB of VRAM cannot participate.
The Thesis
vib3 starts from a simple observation: if only 2% of weights are “hot” at any given time, why keep the other 98% in the most expensive memory tier? The answer is a three-tier storage hierarchy:
- T1 — GPU VRAM: Hot expert pages, pinned for immediate compute.
- T2 — Host RAM: Warm pages, ready for fast promotion to GPU.
- T3 — NVMe SSD: Cold pages, streamed on demand via io_uring.
This is the same architecture that databases have used for decades: you don't load the entire table into memory, you page in what the query needs. vib3 applies this to inference.
Virtual Expert Assembly
The deeper insight is that expert boundaries are arbitrary. A router selects “Expert 3” as a monolithic unit, but the actual useful computation might span rows from multiple experts. vib3 embeds each weight page into an HNSW vector index based on its learned representations. At inference time, instead of routing to a fixed expert, we can retrieve the most relevant weight pages directly from the hidden state.
This is inference as retrieval over indexed weight space. The model becomes a database of indexed weight pages, and each forward pass is a query against that database.
Implementation
vib3 is written in Rust for performance-critical paths (buffer management, io_uring, memory mapping) with optional CUDA kernels for GPU compute. Key design decisions:
- Zero-copy where possible—on-disk structures use
#[repr(C)]andbytemuck::Podfor direct memory mapping. - CPU fallback always works—every CUDA code path has a CPU implementation. Build with
--no-default-featuresfor pure CPU. - Single binary, single file—one
vib3binary, one.vib3file per model. No Python dependencies, no framework overhead. - OpenAI-compatible API—drop-in replacement for any OpenAI client via an Axum HTTP server with SSE streaming.
Status
vib3 is under active development. The core engine (buffer manager, compute kernels, KV cache, vector index, conversion pipeline) is implemented with 200+ passing tests. The immediate goals are empirical validation on Mixtral 8x7B and scaling to Kimi K2.5's 384-expert architecture.
The project is open source and contributions are welcome.