Documentation

Everything you need to install, configure, and run vib3.

Quick Start

Install vib3, pull a model, and start an interactive session in three commands.

Terminal
$ cargo install vib3
$ vib3 pull mixtral
$ vib3 run mixtral

CLI Reference

vib3 ships as a single binary with the following subcommands.

vib3 pull <model>

Download a model from HuggingFace and convert it to .vib3 format.

vib3 run <model>

Download (if needed) and run a model interactively.

vib3 serve <model>

Start an OpenAI-compatible API server.

vib3 convert --model <path> --output <file>

Convert safetensors to .vib3 format with optional INT4 quantization and HNSW index building.

vib3 list

List locally available models.

vib3 info <model>

Show model info and hardware compatibility.

vib3 hw

Detect and display hardware (GPU, RAM, NVMe drives).

Architecture

vib3 is written in Rust with optional CUDA support. The core thesis is inference as retrieval over indexed weight space—an embedded HNSW vector index retrieves the most relevant weight pages directly from the hidden state.

src/
  core/           # Types, config, errors (PageId, Tier, DType)
  storage/        # .vib3 format, buffer manager, io_uring NVMe reads
  index/          # HNSW vector index, co-activation graph, domain classifier
  compute/        # CUDA FFI, kernel launchers (matmul, attention)
  runtime/        # Engine orchestrator, query planner, sampler, KV cache
  api/            # Axum HTTP server (OpenAI-compatible, SSE streaming)

How It Works

1. Model conversion

vib3 pull downloads safetensors from HuggingFace, quantizes expert weights to INT4, computes page signatures, builds an HNSW vector index, and packs everything into a single .vib3 file.

2. Three-tier buffer manager

Weight pages live in three tiers: T1 (GPU VRAM), T2 (host RAM), T3 (NVMe). The buffer manager promotes pages on demand using io_uring for async NVMe reads, keeping hot experts pinned in VRAM.

3. Predictive prefetching

A co-activation graph tracks which experts fire together. Before each layer, the query planner predicts upcoming expert activations and prewarms their pages, driving page fault rates below 2%.

4. Virtual expert assembly

The HNSW vector index maps hidden states to weight pages directly, bypassing the router entirely. This enables sub-expert granularity retrieval—assembling “virtual experts” from arbitrary page combinations.

Supported Models

ModelExpertsStatus
Mixtral 8x7B8Validation target
Kimi K2.5384Primary target

Building from Source

$ git clone https://github.com/vib3dev/vib3
$ cd vib3
$ cargo build --release

# CPU-only (no CUDA)
$ cargo build --release --no-default-features

# Run tests
$ cargo test --no-default-features

Next steps