Documentation
Everything you need to install, configure, and run vib3.
Quick Start
Install vib3, pull a model, and start an interactive session in three commands.
$ cargo install vib3 $ vib3 pull mixtral $ vib3 run mixtral
CLI Reference
vib3 ships as a single binary with the following subcommands.
vib3 pull <model>Download a model from HuggingFace and convert it to .vib3 format.
vib3 run <model>Download (if needed) and run a model interactively.
vib3 serve <model>Start an OpenAI-compatible API server.
vib3 convert --model <path> --output <file>Convert safetensors to .vib3 format with optional INT4 quantization and HNSW index building.
vib3 listList locally available models.
vib3 info <model>Show model info and hardware compatibility.
vib3 hwDetect and display hardware (GPU, RAM, NVMe drives).
Architecture
vib3 is written in Rust with optional CUDA support. The core thesis is inference as retrieval over indexed weight space—an embedded HNSW vector index retrieves the most relevant weight pages directly from the hidden state.
src/ core/ # Types, config, errors (PageId, Tier, DType) storage/ # .vib3 format, buffer manager, io_uring NVMe reads index/ # HNSW vector index, co-activation graph, domain classifier compute/ # CUDA FFI, kernel launchers (matmul, attention) runtime/ # Engine orchestrator, query planner, sampler, KV cache api/ # Axum HTTP server (OpenAI-compatible, SSE streaming)
How It Works
1. Model conversion
vib3 pull downloads safetensors from HuggingFace, quantizes expert weights to INT4, computes page signatures, builds an HNSW vector index, and packs everything into a single .vib3 file.
2. Three-tier buffer manager
Weight pages live in three tiers: T1 (GPU VRAM), T2 (host RAM), T3 (NVMe). The buffer manager promotes pages on demand using io_uring for async NVMe reads, keeping hot experts pinned in VRAM.
3. Predictive prefetching
A co-activation graph tracks which experts fire together. Before each layer, the query planner predicts upcoming expert activations and prewarms their pages, driving page fault rates below 2%.
4. Virtual expert assembly
The HNSW vector index maps hidden states to weight pages directly, bypassing the router entirely. This enables sub-expert granularity retrieval—assembling “virtual experts” from arbitrary page combinations.
Supported Models
| Model | Experts | Status |
|---|---|---|
| Mixtral 8x7B | 8 | Validation target |
| Kimi K2.5 | 384 | Primary target |
Building from Source
$ git clone https://github.com/vib3dev/vib3 $ cd vib3 $ cargo build --release # CPU-only (no CUDA) $ cargo build --release --no-default-features # Run tests $ cargo test --no-default-features