Skip to content

feat(serve): long-running daemon for model serving (avoid cold-start) #92

@ohdearquant

Description

@ohdearquant

Context

Loading Qwen3.5-0.8B weights takes ~2-5s (memory map + Metal buffer upload). For batch workloads (embedding pipelines, RAG, evaluation) this is acceptable. For interactive use it's painful — every CLI invocation pays the load cost.

Ollama solves this with a daemon that holds models in memory and serves multiple requests. Lattice needs the equivalent.

Goal

A long-running lattice daemon process that:

  1. Loads models on demand (first request → load, subsequent → reuse)
  2. Supports multiple models in memory simultaneously (LRU eviction when GPU/RAM constrained)
  3. Streams tokens via stdout, HTTP, or Unix socket
  4. Survives multiple client invocations without re-loading

Proposed design

Process model

  • lattice serve [--port 8080] [--socket /tmp/lattice.sock] starts the daemon
  • Daemon runs in foreground (Ocean's preference — no systemd entanglement); user can & or nohup it
  • Lock file at ~/.lattice/serve.pid prevents double-start

Storage

Protocol

v1: simple line-protocol over Unix socket (newline-delimited JSON requests + SSE-style token stream responses). Easy to debug with nc.
v2: HTTP + OpenAI API (issue #92).

Concurrency

  • Single inference at a time per model (Metal isn't trivially shareable across threads)
  • Multiple model serves can run concurrently if they fit
  • Request queue with backpressure (429 if queue full)

Acceptance

  • lattice serve + lattice chat ... --daemon round-trip works
  • Sub-second response start for cached models (vs current ~2-5s cold load)
  • Survives 100 sequential requests without leaking
  • Handles SIGTERM gracefully (finish in-flight request, then exit)

Priority

P1 — blocks daily usage. Tied to #91 (CLI) and #93 (OpenAI API).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions