Context
Loading Qwen3.5-0.8B weights takes ~2-5s (memory map + Metal buffer upload). For batch workloads (embedding pipelines, RAG, evaluation) this is acceptable. For interactive use it's painful — every CLI invocation pays the load cost.
Ollama solves this with a daemon that holds models in memory and serves multiple requests. Lattice needs the equivalent.
Goal
A long-running lattice daemon process that:
- Loads models on demand (first request → load, subsequent → reuse)
- Supports multiple models in memory simultaneously (LRU eviction when GPU/RAM constrained)
- Streams tokens via stdout, HTTP, or Unix socket
- Survives multiple client invocations without re-loading
Proposed design
Process model
lattice serve [--port 8080] [--socket /tmp/lattice.sock] starts the daemon
- Daemon runs in foreground (Ocean's preference — no systemd entanglement); user can
& or nohup it
- Lock file at
~/.lattice/serve.pid prevents double-start
Storage
Protocol
v1: simple line-protocol over Unix socket (newline-delimited JSON requests + SSE-style token stream responses). Easy to debug with nc.
v2: HTTP + OpenAI API (issue #92).
Concurrency
- Single inference at a time per model (Metal isn't trivially shareable across threads)
- Multiple model serves can run concurrently if they fit
- Request queue with backpressure (
429 if queue full)
Acceptance
lattice serve + lattice chat ... --daemon round-trip works
- Sub-second response start for cached models (vs current ~2-5s cold load)
- Survives 100 sequential requests without leaking
- Handles SIGTERM gracefully (finish in-flight request, then exit)
Priority
P1 — blocks daily usage. Tied to #91 (CLI) and #93 (OpenAI API).
Related
Context
Loading Qwen3.5-0.8B weights takes ~2-5s (memory map + Metal buffer upload). For batch workloads (embedding pipelines, RAG, evaluation) this is acceptable. For interactive use it's painful — every CLI invocation pays the load cost.
Ollama solves this with a daemon that holds models in memory and serves multiple requests. Lattice needs the equivalent.
Goal
A long-running
latticedaemon process that:Proposed design
Process model
lattice serve [--port 8080] [--socket /tmp/lattice.sock]starts the daemon&ornohupit~/.lattice/serve.pidprevents double-startStorage
Arc<RwLock<HashMap<ModelId, LoadedModel>>>Protocol
v1: simple line-protocol over Unix socket (newline-delimited JSON requests + SSE-style token stream responses). Easy to debug with
nc.v2: HTTP + OpenAI API (issue #92).
Concurrency
429if queue full)Acceptance
lattice serve+lattice chat ... --daemonround-trip worksPriority
P1 — blocks daily usage. Tied to #91 (CLI) and #93 (OpenAI API).
Related
lattice serve)