feat(serve): long-running daemon for model serving (avoid cold-start)

## Context

Loading Qwen3.5-0.8B weights takes ~2-5s (memory map + Metal buffer upload). For batch workloads (embedding pipelines, RAG, evaluation) this is acceptable. For interactive use it's painful — every CLI invocation pays the load cost.

Ollama solves this with a daemon that holds models in memory and serves multiple requests. Lattice needs the equivalent.

## Goal

A long-running `lattice` daemon process that:
1. Loads models on demand (first request → load, subsequent → reuse)
2. Supports multiple models in memory simultaneously (LRU eviction when GPU/RAM constrained)
3. Streams tokens via stdout, HTTP, or Unix socket
4. Survives multiple client invocations without re-loading

## Proposed design

### Process model
- `lattice serve [--port 8080] [--socket /tmp/lattice.sock]` starts the daemon
- Daemon runs in foreground (Ocean's preference — no systemd entanglement); user can `&` or `nohup` it
- Lock file at `~/.lattice/serve.pid` prevents double-start

### Storage
- Loaded models held in a `Arc<RwLock<HashMap<ModelId, LoadedModel>>>`
- LRU eviction when total VRAM > configured limit (default: 80% of detected Metal allocator size)
- Per-model KV cache pool (issue #93)

### Protocol
v1: simple line-protocol over Unix socket (newline-delimited JSON requests + SSE-style token stream responses). Easy to debug with `nc`.
v2: HTTP + OpenAI API (issue #92).

### Concurrency
- Single inference at a time per model (Metal isn't trivially shareable across threads)
- Multiple model serves can run concurrently if they fit
- Request queue with backpressure (`429` if queue full)

## Acceptance

- `lattice serve` + `lattice chat ... --daemon` round-trip works
- Sub-second response start for cached models (vs current ~2-5s cold load)
- Survives 100 sequential requests without leaking
- Handles SIGTERM gracefully (finish in-flight request, then exit)

## Priority

P1 — blocks daily usage. Tied to #91 (CLI) and #93 (OpenAI API).

## Related

- #90 — CLI entry point (`lattice serve`)
- #92 — OpenAI-compatible HTTP API on top
- #84 — cross-framework bench; daemon makes lattice fair to compare against ollama

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(serve): long-running daemon for model serving (avoid cold-start) #92

Context

Goal

Proposed design

Process model

Storage

Protocol

Concurrency

Acceptance

Priority

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat(serve): long-running daemon for model serving (avoid cold-start) #92

Description

Context

Goal

Proposed design

Process model

Storage

Protocol

Concurrency

Acceptance

Priority

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions