Context
Lattice has individual binaries (chat_metal, bench_decode_ab, quantize_q4, backfill_qwen3) but no unified user-facing CLI. To use lattice today, a consumer needs to know which binary does what and how to wire features/env vars.
Comparable tools have a single entry point:
ollama run qwen3.5:0.8b "hello"
llama-cli -m model.gguf -p "hello"
mlx_lm.generate --model Qwen/... --prompt "hello"
Goal
A single lattice binary with subcommands covering the common user flows.
Proposed surface (v1)
# Model management
lattice pull qwen3.5-0.8b # download from HF to ~/.lattice/models
lattice list # list local models with sizes
lattice rm qwen3.5-0.8b # delete local model
# Inference
lattice chat qwen3.5-0.8b # interactive REPL
lattice chat qwen3.5-0.8b -p "hello" # one-shot
lattice complete qwen3.5-0.8b -p "Once" # raw completion (no chat template)
lattice embed bge-small-en -t "hello" # embedding output
# Quantization
lattice quantize qwen3.5-0.8b --to q4 # safetensors output
lattice quantize qwen3.5-0.8b --to q4-quarot # QuaRot variant
# Server (depends on #92)
lattice serve --model qwen3.5-0.8b --port 8080
# Inspection
lattice info qwen3.5-0.8b # architecture, params, dtype
lattice bench qwen3.5-0.8b # decode tok/s, latency p50/p99
Implementation notes
- Use
clap derive macros (already in workspace deps)
- Default model dir:
~/.lattice/models (env override: LATTICE_MODEL_CACHE)
- Auto-pull on first use if model not local (with
--no-pull to disable)
- Per-subcommand
--help with examples
- Single binary in
crates/inference/src/bin/lattice.rs (or new crates/cli crate)
Acceptance
lattice --help shows all subcommands
lattice chat qwen3.5-0.8b -p "hi" works end-to-end on a fresh machine (downloads, loads, generates)
- Documented in README as the primary entry point
- Existing binaries (
chat_metal, bench_decode_ab) become aliases or get deprecated
Priority
P1 — gates real user adoption. Currently lattice is consumable only by Rust code authors who know the crate API.
Related
Context
Lattice has individual binaries (
chat_metal,bench_decode_ab,quantize_q4,backfill_qwen3) but no unified user-facing CLI. To use lattice today, a consumer needs to know which binary does what and how to wire features/env vars.Comparable tools have a single entry point:
ollama run qwen3.5:0.8b "hello"llama-cli -m model.gguf -p "hello"mlx_lm.generate --model Qwen/... --prompt "hello"Goal
A single
latticebinary with subcommands covering the common user flows.Proposed surface (v1)
Implementation notes
clapderive macros (already in workspace deps)~/.lattice/models(env override:LATTICE_MODEL_CACHE)--no-pullto disable)--helpwith examplescrates/inference/src/bin/lattice.rs(or newcrates/clicrate)Acceptance
lattice --helpshows all subcommandslattice chat qwen3.5-0.8b -p "hi"works end-to-end on a fresh machine (downloads, loads, generates)chat_metal,bench_decode_ab) become aliases or get deprecatedPriority
P1 — gates real user adoption. Currently lattice is consumable only by Rust code authors who know the crate API.
Related
lattice serveis the CLI front-end for it