Skip to content

feat(budget): per-session token cap with soft warn + hard block #767

@rohitg00

Description

@rohitg00

Problem

Sessions have no per-session token cap. A pathological session (long-running CC instance, runaway tool loop, agent stuck in a 50-turn correction cycle) can rack up arbitrary spend on agentmemory's background compress / summarize / consolidate calls. The existing AGENTMEMORY_LLM_TIMEOUT_MS only caps per-call duration. No safety net at the session level.

Cost-aware model selection (#613) covers per-token cost. This issue covers per-session-total cost.

Proposed shape

Per-session running budget with hard cap + soft warning threshold.

iii composition:

  • New KV scope mem:session-budget keyed by sessionId: { tokenCap, tokensUsed, costEstimate, warnEmittedAt?, exhaustedAt? }
  • iii function mem::session::budget::init({ sessionId, tokenCap? }) writes initial state on session-start
  • iii function mem::session::budget::record({ sessionId, inputTokens, outputTokens, model }) increments after each LLM call
  • Provider wrapper (in ResilientProvider) reads active sessionId via AsyncLocalStorage, increments budget after each call, blocks future calls if cap exceeded
  • Cron trigger reaps budgets for sessions where endedAt + retentionDays passed

Sessionid resolution via AsyncLocalStorage

Provider doesn't directly know sessionId today. New sessionContext AsyncLocalStorage scopes every iii function call. mem::observe, mem::compress, mem::summarize, mem::consolidate-pipeline enter the ALS scope with their sessionId at the top. Provider wrapper reads from ALS — falls back to "unknown" sessionId for system-triggered calls (cron sweeps).

Defaults

  • tokenCap default: 100k tokens per session. Configurable globally via AGENTMEMORY_SESSION_TOKEN_CAP. Per-session override via mem::session::start payload.
  • Soft warning at 80% — emits event::mem::budget::soft-warned for downstream subscribers (viewer alert).
  • Hard cap blocks further LLM calls — emits event::mem::budget::exhausted. Subsequent compress/summarize calls return synthetic-only output (no LLM).

Edge cases

  • Concurrent calls for same session — atomic increment via iii state update op. Already supported.
  • Failed calls — increment with 0 input/output tokens in finally block. Don't double-count partial calls.
  • Counter never incremented — provider returns synthetically without LLM call → no increment. Correct behavior.
  • Per-model cost normalization — record raw tokens, normalize to USD at display time using configurable rate table (defaults from cost-aware model selection table in README).
  • Forked session inherits used count? — fresh per fork. Each fork gets its own budget.
  • Budget across server restart — KV-persisted, recovers on next iii-state boot.
  • System-triggered calls (cron-fired consolidation) — no active session. Tracked under a global system-budget sentinel scope with separate cap.
  • Budget exhausted mid-summarize — abort BEFORE next chunk. Current in-flight call completes. Save partial state with truncated: true flag on the session summary.

Acceptance

  • New KV scope + 2 functions + 1 cron trigger
  • AsyncLocalStorage threads sessionId through iii function calls
  • ResilientProvider increments budget post-call (finally block)
  • Hard-cap blocks future LLM calls + emits exhausted event
  • Soft-warn at 80% + emits warn event
  • agentmemory status shows sessions: N active, M near-cap, K exhausted
  • OTEL metric agentmemory.session.tokens_used histogram
  • Tests: cap enforcement, soft-warn threshold, fork-fresh-budget, concurrent increment, system-sentinel scope
  • AGENTMEMORY_SESSION_TOKEN_CAP=N global override + per-session override on start

Why it matters

Cost safety net. Pathological sessions can't burn down a user's monthly LLM budget. Soft warning gives early visibility. Audit log + OTEL metric make it queryable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions