Problem
We don't persist individual LLM call data. The metrics middleware (pkg/agent/middleware/metrics/internal.go:12) accumulates token counts in-memory keyed by story, and only the aggregated totals get written to stories.tokens_used / stories.cost_usd on story completion (pkg/architect/request.go:645-670).
Two gaps fall out of that:
- Retries destroy history. A story that's retried overwrites its aggregate, so we can't see how much work a failed attempt cost or how often retries happen.
- No time series. We can't compute TPM/cost-over-time, peak burst rates for rate-limit sizing, or per-agent / per-model cost breakdowns from the DB. Today the only way to get this is by grep'ing
logs/maestro.log for LLM call to model 'X': ... request tokens: N, response tokens: M ... lines, which works for one session but isn't queryable, doesn't survive log rotation, and can't be joined to stories/sessions.
PM and architect calls also aren't always associated with a story (bootstrap, spec review, escalations), so per-story aggregation can't represent them at all.
Proposal
Add an llm_calls audit table written from the metrics middleware on every call. One row per LLM invocation.
Suggested columns (everything in this list is already in-memory at log time per pkg/agent/middleware/metrics/middleware.go:84-100):
id PK
session_id (added by persistence layer, like other tables)
created_at
model (e.g. claude-opus-4-5, claude-sonnet-4-5, gpt-5.2)
agent_id
story_id (NULLABLE — PM/architect calls often have none)
state (agent state at call time)
request_tokens
response_tokens
total_tokens
cost_usd
latency_ms
success (0/1)
temperature
Indexes on (session_id, created_at), (model, created_at), (agent_id), (story_id).
The existing per-story aggregate columns (stories.tokens_used, stories.cost_usd) can stay as a denormalized rollup, or be derived from llm_calls — separate decision.
Why now
Came up while estimating tokens-per-minute for rate-limiter sizing. Had to parse maestro.log because the DB couldn't answer the question. Numbers from one ~2.5h rc3 session (maestro-work/rc3/.maestro/logs/maestro.log, 1,532 LLM calls):
| Model |
Calls |
Total tokens |
Active min |
Avg TPM (active) |
Peak TPM |
| claude-opus-4-5 |
41 |
254K |
20 |
12,705 |
46,703 |
| claude-sonnet-4-5 |
593 |
3.8M |
44 |
86,622 |
251,048 |
| gpt-5.2 |
125 |
3.1M |
41 |
74,631 |
727,096 |
That gpt-5.2 727k-tokens-in-one-minute peak is exactly the kind of thing we'd want to drill into (single huge call? sustained burst?) and we can't, because the data isn't queryable.
Out of scope
- Backfill from logs.
- Dashboard / WebUI changes (separate follow-up once the table exists).
- Changes to the cost calculation itself.
Problem
We don't persist individual LLM call data. The metrics middleware (
pkg/agent/middleware/metrics/internal.go:12) accumulates token counts in-memory keyed by story, and only the aggregated totals get written tostories.tokens_used/stories.cost_usdon story completion (pkg/architect/request.go:645-670).Two gaps fall out of that:
logs/maestro.logforLLM call to model 'X': ... request tokens: N, response tokens: M ...lines, which works for one session but isn't queryable, doesn't survive log rotation, and can't be joined to stories/sessions.PM and architect calls also aren't always associated with a story (bootstrap, spec review, escalations), so per-story aggregation can't represent them at all.
Proposal
Add an
llm_callsaudit table written from the metrics middleware on every call. One row per LLM invocation.Suggested columns (everything in this list is already in-memory at log time per
pkg/agent/middleware/metrics/middleware.go:84-100):idPKsession_id(added by persistence layer, like other tables)created_atmodel(e.g.claude-opus-4-5,claude-sonnet-4-5,gpt-5.2)agent_idstory_id(NULLABLE — PM/architect calls often have none)state(agent state at call time)request_tokensresponse_tokenstotal_tokenscost_usdlatency_mssuccess(0/1)temperatureIndexes on
(session_id, created_at),(model, created_at),(agent_id),(story_id).The existing per-story aggregate columns (
stories.tokens_used,stories.cost_usd) can stay as a denormalized rollup, or be derived fromllm_calls— separate decision.Why now
Came up while estimating tokens-per-minute for rate-limiter sizing. Had to parse
maestro.logbecause the DB couldn't answer the question. Numbers from one ~2.5h rc3 session (maestro-work/rc3/.maestro/logs/maestro.log, 1,532 LLM calls):That gpt-5.2 727k-tokens-in-one-minute peak is exactly the kind of thing we'd want to drill into (single huge call? sustained burst?) and we can't, because the data isn't queryable.
Out of scope