Skip to content

Persist per-call LLM token usage to DB (audit table for tokens/cost/TPM) #218

@dratner

Description

@dratner

Problem

We don't persist individual LLM call data. The metrics middleware (pkg/agent/middleware/metrics/internal.go:12) accumulates token counts in-memory keyed by story, and only the aggregated totals get written to stories.tokens_used / stories.cost_usd on story completion (pkg/architect/request.go:645-670).

Two gaps fall out of that:

  1. Retries destroy history. A story that's retried overwrites its aggregate, so we can't see how much work a failed attempt cost or how often retries happen.
  2. No time series. We can't compute TPM/cost-over-time, peak burst rates for rate-limit sizing, or per-agent / per-model cost breakdowns from the DB. Today the only way to get this is by grep'ing logs/maestro.log for LLM call to model 'X': ... request tokens: N, response tokens: M ... lines, which works for one session but isn't queryable, doesn't survive log rotation, and can't be joined to stories/sessions.

PM and architect calls also aren't always associated with a story (bootstrap, spec review, escalations), so per-story aggregation can't represent them at all.

Proposal

Add an llm_calls audit table written from the metrics middleware on every call. One row per LLM invocation.

Suggested columns (everything in this list is already in-memory at log time per pkg/agent/middleware/metrics/middleware.go:84-100):

  • id PK
  • session_id (added by persistence layer, like other tables)
  • created_at
  • model (e.g. claude-opus-4-5, claude-sonnet-4-5, gpt-5.2)
  • agent_id
  • story_id (NULLABLE — PM/architect calls often have none)
  • state (agent state at call time)
  • request_tokens
  • response_tokens
  • total_tokens
  • cost_usd
  • latency_ms
  • success (0/1)
  • temperature

Indexes on (session_id, created_at), (model, created_at), (agent_id), (story_id).

The existing per-story aggregate columns (stories.tokens_used, stories.cost_usd) can stay as a denormalized rollup, or be derived from llm_calls — separate decision.

Why now

Came up while estimating tokens-per-minute for rate-limiter sizing. Had to parse maestro.log because the DB couldn't answer the question. Numbers from one ~2.5h rc3 session (maestro-work/rc3/.maestro/logs/maestro.log, 1,532 LLM calls):

Model Calls Total tokens Active min Avg TPM (active) Peak TPM
claude-opus-4-5 41 254K 20 12,705 46,703
claude-sonnet-4-5 593 3.8M 44 86,622 251,048
gpt-5.2 125 3.1M 41 74,631 727,096

That gpt-5.2 727k-tokens-in-one-minute peak is exactly the kind of thing we'd want to drill into (single huge call? sustained burst?) and we can't, because the data isn't queryable.

Out of scope

  • Backfill from logs.
  • Dashboard / WebUI changes (separate follow-up once the table exists).
  • Changes to the cost calculation itself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions