Storage-tier-aware GGUF runtime plus a KoboldCpp-compatible server profile.
Hypura has two shipped product surfaces:
hypura serve: the native Hypura runtime and HTTP serverhypura koboldcpp: a KoboldCpp-compatible supervisor/worker profile with vendored Kobold Lite, savedata bridges, OpenAI-compatible endpoints, and probe-gated multimodal surfaces
- Run models that do not fit cleanly in GPU memory by placing tensors across GPU, RAM, pinned host memory, and NVMe.
- Serve a KoboldCpp-compatible stack without standing up a separate proxy layer.
- Keep TurboQuant and Triality metadata inside GGUF workflows while preserving a plain
hypuraCLI surface. - Build Windows CUDA releases against an explicit toolkit version such as CUDA 12.8 without silently drifting to the newest installed Visual Studio CUDA integration.
- Tier-aware tensor placement across GPU, RAM, pinned host memory, and NVMe
inspect,bench, andoptimizeworkflows for real model analysis and layout work- TurboQuant and Triality-aware runtime metadata handling
- Vendored
llama.cppmain sync withtq4_1sGGML CPU support, staged CUDA dequant support, Triality ABI hardening, and fail-closed metadata handling - Apple Silicon Metal path and Windows CUDA path in the same workspace
hypura koboldcpp <model.gguf>with KoboldCpp-style defaults such as port5001- Vendored Kobold Lite surface
- Kobold extra/admin routes, state save/load, preload story,
.jsondbbridge, and.kcppslauncher config bridge - OpenAI-compatible
/v1/completions,/v1/chat/completions, and/v1/embeddings - Built-in websearch route and supervisor-managed feature probing
- Supervisor/worker split so compat reloads and feature state changes do not mutate the native
servepath
- Desktop-owned first-run asset bootstrap manifest
- Probe-gated embeddings plus STT/TTS packaged path
- Structured unavailable responses when optional multimodal backends are not ready instead of optimistic success flags
The pinned KoboldCpp baseline is v1.111.2. Current manifest status lives in docs/compat/koboldcpp-v1.111.2-parity-manifest.json.
Shipped compatibility areas:
- Kobold generation routes and admin/state endpoints
- Vendored Kobold Lite
- OpenAI chat, completions, and embeddings
- Savedata and launcher config bridges
- Probe-gated multimodal proxy routes
- Windows packaged asset bootstrap for embeddings plus audio
Known limits that remain honest release notes:
- Packaged Stable Diffusion payloads are still optional rather than part of the default packaged-ready set
- Multimodal feature flags depend on actual local assets or helper availability
- Ollama parity still needs a full audit pass even though compatibility surfaces exist
Current benchmark score is computed from the JSON corpus in benchmarks/results/ and summarized with mean +/- SD, error-bar charts, and multi-group comparison tables in benchmarks/CHARTS.md.
Current measured hardware corpus:
AMD Ryzen 5 4500 6-Core Processor / NVIDIA GeForce RTX 3060 / 31.9 GB RAM
Best observed Hypura score per model in the current corpus:
| Model | Score group | Benchmark score (tok/s) | Samples | Notes |
|---|---|---|---|---|
| Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q4_K_M | hypura four-tier + auto |
51.835 +/- 2.293 |
2 | Repeated Windows CUDA runs; paired mmproj projector was inspect-validated separately |
| Huihui-Qwen3.6-35B-A3B-abliterated.Q4_K_M | hypura four-tier + auto |
0.041 +/- 0.034 |
2 | Sparse MoE mmap path fell back to CPU-only on this machine; baseline remained faster in this corpus |
| Shadows-MoE-Q6 | hypura four-tier + off |
1.158 +/- 0.111 |
2 | Includes repeated runs and a baseline comparator in benchmarks/results/ |
| supergemma4-Q8_0 | hypura legacy-3tier + off |
29.851 +/- 0.000 |
1 | Single-run exploratory datapoint; GPU-resident and not yet a stable replicated estimate |
Multi-group summary for the same corpus:
| Model | baseline | legacy-3tier + off | four-tier + off | four-tier + auto |
|---|---|---|---|---|
| Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q4_K_M | 41.681 +/- 4.357 |
50.390 +/- 3.980 |
29.407 +/- 34.425 |
51.835 +/- 2.293 |
| Huihui-Qwen3.6-35B-A3B-abliterated.Q4_K_M | 0.059 +/- 0.050 |
0.020 +/- 0.001 |
0.038 +/- 0.014 |
0.041 +/- 0.034 |
| Shadows-MoE-Q6 | 1.121 +/- 0.023 |
1.086 +/- 0.029 |
1.158 +/- 0.111 |
0.984 +/- 0.334 |
| supergemma4-Q8_0 | N/A |
29.851 +/- 0.000 |
0.173 +/- 0.000 |
0.167 +/- 0.000 |
Read these numbers with the run count in mind:
Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q4_K_Mhasn=2;four-tier + autois currently the strongest replicated group, whilefour-tier + offshows very high variance.Huihui-Qwen3.6-35B-A3B-abliterated.Q4_K_Mhasn=2, but all groups are very slow on this hardware because Hypura's sparse MoE mmap path fell back to CPU-only (ngl=0) once the 19.7 GB model exceeded the RTX 3060 GPU budget.Shadows-MoE-Q6hasn=2for every reported group, so SD reflects actual repetition.supergemma4-Q8_0currently hasn=1, so+/- 0.000means "only one observation", not "perfectly stable".- The
supergemma4-Q8_0run is a full GPU-resident Windows CUDA datapoint, not an NVMe spill benchmark.
hypura serve .\model.ggufhypura run .\model.gguf --prompt "Hello"hypura koboldcpp .\model.ggufUseful follow-up routes:
- native server default:
http://127.0.0.1:8080 - KoboldCpp profile default:
http://127.0.0.1:5001 - Kobold Lite:
http://127.0.0.1:5001/kobold-lite
.\scripts\stop-cargo.ps1
cargo build --release.\scripts\stop-cargo.ps1
$env:CUDA_PATH = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8"
$env:HYPURA_CUDA = "1"
$env:HYPURA_CUDA_ARCHITECTURES = "86"
cargo build --releaseIf you need a fully isolated CUDA build tree:
$env:CARGO_TARGET_DIR = ".\target-cuda128"
cargo build --bin hypura --message-format shortsrc/compute/- runtime, inference, and storage-tier executionsrc/scheduler/- placement and estimation logicsrc/server/- native HTTP surface plus compat supervisor/worker layershypura-sys/- vendoredllama.cppFFI buildvendor/llama.cpp/- upstream runtime dependencydocs/compat/- pinned compatibility manifests and packaged asset manifestshypura-desktop/- packaged desktop bootstrap shell
- Use RELEASING.md for version alignment, stable branch flow, tagging, and GitHub CLI release steps.
- On Windows, stop concurrent
cargoandrustcprocesses before builds to avoid stale file locks. - After
llama.cppor FFI changes, prefer cleaninghypura-sysoutputs rather than wiping the entire workspace by default.