Run trillion-parameter models on one GPU.

vib3 streams expert weights from NVMe to GPU on demand. No multi-GPU cluster. No cloud API. Just your machine.

Terminal

$ vib3 pull mixtral

$ vib3 run mixtral

~8 GB

VRAM for Mixtral 8x7B

INT4 quantized, paged from NVMe

< 2%

Page fault rate

After warmup with predictive prefetch

384

Experts in Kimi K2.5

Weight-indexed retrieval, single GPU

Inference as retrieval.

MoE models activate 2% of weights per token but need 100% in memory. vib3 fixes this. Weight pages live on NVMe, get indexed into an HNSW vector index, and stream to GPU only when needed.

Three-tier paging

GPU VRAM, host RAM, NVMe. Hot pages stay pinned. Cold pages stream via io_uring. Same architecture databases have used for decades.

Virtual expert assembly

HNSW index maps hidden states to weight pages directly. Bypass the router. Retrieve sub-expert granularity fragments.

Predictive prefetch

Co-activation graph predicts next-layer experts. Pages prewarm before they're needed. Fault rate drops below 2%.

Single file format

One .vib3 file. Weights, page catalog, vector index, config. Pull and run. No Python. No framework.

Rust. Zero-copy. Single binary.

22K lines of Rust. 200+ tests. Optional CUDA. #[repr(C)] structs, mmap, io_uring. CPU fallback always works.

No Python runtime. No torch. No transformers library. Just cargo install vib3 and go.

src/
  core/       Types, config, PageId, Tier, DType
  storage/    .vib3 format, buffer manager, io_uring
  index/      HNSW vector index, co-activation graph
  compute/    CUDA FFI, matmul, attention kernels
  runtime/    Engine, query planner, KV cache
  api/        Axum HTTP server, OpenAI-compatible

Open source. Ship it.