Introducing vib3
Mixture-of-Experts models are the dominant architecture for frontier LLMs. Mixtral, DeepSeek-V2, Kimi K2.5---the most capable open models all use MoE. But they share a deployment problem: even though only a fraction of experts activate per token, you need enough memory to hold all of them.
vib3 takes a different approach. Instead of loading the full model, it treats inference as retrieval over indexed weight space.
How it works
Every expert weight matrix is split into 2MB pages. Each page gets a signature---a compact embedding computed from its learned representations. These signatures are indexed in an HNSW vector index stored inside the .vib3 file.
At inference time, the engine has three options for routing:
- Standard routing: Use the model's router as normal, then page in the selected experts from NVMe/RAM/VRAM.
- Predictive prefetching: A co-activation graph tracks which experts fire together. Before each layer, likely-needed pages are prewarm into GPU memory.
- Virtual expert assembly: Query the HNSW index directly with the hidden state to retrieve relevant weight pages, bypassing fixed expert boundaries entirely.
The storage hierarchy
Weight pages live in three tiers:
- T1 (GPU VRAM): Hot pages, pinned for immediate compute
- T2 (Host RAM): Warm pages, ready for fast GPU transfer
- T3 (NVMe): Cold pages, streamed via io_uring
The buffer manager promotes and evicts pages based on access patterns, keeping the working set in VRAM while the full model lives on NVMe. After warmup, page fault rates drop below 2%.
Current status
The engine is implemented in ~22K lines of Rust with 200+ passing tests. The conversion pipeline handles safetensors-to-.vib3 with INT4 quantization, and the HNSW vector index is integrated end-to-end.
Next up: empirical validation on Mixtral 8x7B, then scaling to Kimi K2.5 with its 384-expert architecture.
The project is open source. Check the docs to get started.