Run trillion-parameter models on one GPU.
vib3 streams expert weights from NVMe to GPU on demand. No multi-GPU cluster. No cloud API. Just your machine.
$ vib3 pull mixtral
$ vib3 run mixtral
~8 GB
VRAM for Mixtral 8x7B
INT4 quantized, paged from NVMe
< 2%
Page fault rate
After warmup with predictive prefetch
384
Experts in Kimi K2.5
Weight-indexed retrieval, single GPU
Inference as retrieval.
MoE models activate 2% of weights per token but need 100% in memory. vib3 fixes this. Weight pages live on NVMe, get indexed into an HNSW vector index, and stream to GPU only when needed.
Three-tier paging
GPU VRAM, host RAM, NVMe. Hot pages stay pinned. Cold pages stream via io_uring. Same architecture databases have used for decades.
Virtual expert assembly
HNSW index maps hidden states to weight pages directly. Bypass the router. Retrieve sub-expert granularity fragments.
Predictive prefetch
Co-activation graph predicts next-layer experts. Pages prewarm before they're needed. Fault rate drops below 2%.
Single file format
One .vib3 file. Weights, page catalog, vector index, config. Pull and run. No Python. No framework.
Rust. Zero-copy. Single binary.
22K lines of Rust. 200+ tests. Optional CUDA. #[repr(C)] structs, mmap, io_uring. CPU fallback always works.
No Python runtime. No torch. No transformers library. Just cargo install vib3 and go.
src/ core/ Types, config, PageId, Tier, DType storage/ .vib3 format, buffer manager, io_uring index/ HNSW vector index, co-activation graph compute/ CUDA FFI, matmul, attention kernels runtime/ Engine, query planner, KV cache api/ Axum HTTP server, OpenAI-compatible