KV3D Engine

Run more concurrent LLM sessions on the same hardware.

KV3D is an inference server for open-weight models that converts shared prompt prefixes into reusable KV-cache snapshots, then stores per-session state as compact int8 deltas — so you get more sessions per GPU without touching the model weights.

curl -fsSL https://install.kv3d.dev | bash
kv3d serve --model ./qwen2.5-7b-instruct.Q4_K_M.gguf

The problem

Every LLM session carries a full KV cache in GPU memory. When hundreds of sessions share the same system prompt or RAG scaffold, you're paying for the same prefix over and over. Memory fills up. Sessions get queued. Cost per request climbs.

How KV3D fixes it

Session A ──┐
Session B ──┼──► [shared prefix KV snapshot] + [per-session Δ (int8, 4× smaller)]
Session C ──┘

Detect — canonicalize and hash the prompt prefix on every request
Reuse — serve the shared prefix KV snapshot from the hot/warm cache
Compress — encode the per-session residual as a quantized int8 delta
Tier — spill cold state to host RAM or SSD; restore on resume

Features

ID	Feature	Status
F1	OpenAI-compatible HTTP API	✅ MVP
F2	llama.cpp execution backend	✅ MVP
F3	Exact-prefix family detection	✅ MVP
F4	Shared prefix KV snapshot	✅ MVP
F5	Compressed session deltas	✅ MVP
F7	GPU hot / RAM warm cache tiers	✅ MVP
F8	Auto fallback / safe mode	✅ MVP
F6	Collaborative block codec	🔜 P1
F9	Workload analytics dashboard	🔜 P1

Quick start

Local (curl install)

curl -fsSL https://install.kv3d.dev | bash

kv3d doctor                                    # verify setup
kv3d serve --model ./qwen2.5-7b.Q4_K_M.gguf   # start server

Docker

docker run --rm -p 8080:8080 \
  -v $PWD/models:/models \
  ghcr.io/0xbadcaffe/kv3d:latest \
  kv3d serve --model /models/qwen2.5-7b.Q4_K_M.gguf

Build from source

git clone --recurse-submodules https://github.com/0xbadcaffe/kv3d.git
cd kv3d

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

build/src/kv3d doctor
build/src/kv3d serve --model ./models/qwen.gguf

API

Drop-in replacement for the OpenAI chat completions endpoint.

Health check

curl http://localhost:8080/health
# {"status":"ok","service":"kv3d-engine"}

Chat completion

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-7b-instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user",   "content": "What is the capital of France?"}
    ]
  }'

Streaming

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-7b-instruct",
    "stream": true,
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user",   "content": "Count to five."}
    ]
  }'

See the prefix cache working — two requests, same system prompt, different user messages:

# First request → cache MISS (computes and stores prefix snapshot)
curl -s http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"model":"m","messages":[{"role":"system","content":"You are a coding assistant."},{"role":"user","content":"Write a sort function."}]}'

# Second request → cache HIT (reuses the prefix snapshot)
curl -s http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"model":"m","messages":[{"role":"system","content":"You are a coding assistant."},{"role":"user","content":"Write a binary search."}]}'

# Confirm: hit_rate should be 0.5
curl -s http://localhost:8080/metrics | grep hit_rate
# kv3d_prefix_hit_rate 0.5

Metrics (Prometheus)

curl http://localhost:8080/metrics

Architecture

┌─────────────────────────────────────────────────────┐
│                   Client / API Layer                │
│          /v1/chat/completions · /metrics            │
└────────────────────────┬────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────┐
│               Session & Scheduler Layer             │
│     prefix detection · batching · state machine     │
└───────────┬─────────────────────────┬───────────────┘
            │                         │
┌───────────▼──────────┐  ┌──────────▼───────────────┐
│   Prefix Family Layer │  │    KV Compression Layer  │
│  FNV-1a hash · index  │  │  snapshot · int8 delta   │
└───────────┬──────────┘  └──────────┬───────────────┘
            │                         │
┌───────────▼─────────────────────────▼───────────────┐
│                Storage Tier Manager                 │
│     GPU hot cache · RAM warm cache · SSD cold       │
└────────────────────────┬────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────┐
│              Execution Backend (llama.cpp)          │
│              GGUF · GQA · CPU + CUDA               │
└─────────────────────────────────────────────────────┘

Benchmarks

Numbers will be published as the engine matures. Target claims:

Claim	Measurement	Target stage
More sessions per GPU	Max stable concurrent sessions	MVP
Lower memory per session	Avg GPU + RAM per active session	MVP
Fast resume	p95 warm-cache restore latency	MVP
Negligible quality loss	Perplexity / logit drift vs baseline	MVP
Lower serving cost	Cost per 10k requests	Post-MVP

Run the included benchmark driver:

# 1000 requests, 80% sharing a system prompt
REQUESTS=1000 SHARED_RATIO=0.8 ./scripts/bench.sh

Configuration

~/.config/kv3d/config.json (created by kv3d doctor):

{
  "server": {
    "host": "127.0.0.1",
    "port": 8080,
    "threads": 4,
    "api_key": ""
  },
  "cache": {
    "gpu_hot_mb": 2048,
    "ram_warm_mb": 8192,
    "ssd_cold_path": ""
  },
  "model": {
    "path": "./models/qwen2.5-7b-instruct.Q4_K_M.gguf",
    "id": "qwen2.5-7b-instruct"
  }
}

Development

# Debug build with sanitizers
cmake -S . -B build -DCMAKE_BUILD_TYPE=Debug -DKV3D_ENABLE_SANITIZERS=ON
cmake --build build -j$(nproc)

# Run all tests
ctest --test-dir build --output-on-failure

# Run unit tests only
build/tests/unit/kv3d_unit_tests

# Run integration tests only
build/tests/integration/kv3d_integration_tests

# Run benchmark (1000 requests, 80% sharing a system prompt)
build/tests/load/kv3d_bench --requests 1000 --shared-ratio 0.8

Repository layout

kv3d/
├── include/kv3d/       # public headers
│   ├── core/           # hashing, canonicalization, guardrails
│   ├── kv/             # KV block types, prefix store, delta codec
│   ├── api/            # OpenAI types, server interface
│   ├── sched/          # session manager
│   ├── storage/        # cache tier interfaces
│   └── metrics/        # metrics collector
├── src/                # implementations (mirrors include/)
├── tests/
│   ├── unit/           # fast, in-process tests
│   ├── integration/    # end-to-end session tests
│   └── load/           # benchmark driver
├── scripts/            # install.sh · bench.sh · package.sh
├── cmake/              # build helpers
└── third_party/        # llama.cpp (submodule)

Roadmap

Phase	Focus	Gate
P0	Baseline runner	Stable end-to-end inference
P1	Exact-prefix reuse	Hit-rate and correctness validated
P2	Delta codec	Memory savings proven
P3	Benchmark & dashboard	ROI story ready
P4	Collaborative block codec	Measured gain beyond simple deltas

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
cmake		cmake
docs		docs
include/kv3d		include/kv3d
scripts		scripts
src		src
tests		tests
third_party		third_party
web		web
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KV3D Engine

The problem

How KV3D fixes it

Features

Quick start

Local (curl install)

Docker

Build from source

API

Architecture

Benchmarks

Configuration

Development

Repository layout

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KV3D Engine

The problem

How KV3D fixes it

Features

Quick start

Local (curl install)

Docker

Build from source

API

Architecture

Benchmarks

Configuration

Development

Repository layout

Roadmap

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages