feat(evaluators): add built-in budget evaluator for per-agent cost tracking#144
Open
amabito wants to merge 7 commits intoagentcontrol:mainfrom
Open
feat(evaluators): add built-in budget evaluator for per-agent cost tracking#144amabito wants to merge 7 commits intoagentcontrol:mainfrom
amabito wants to merge 7 commits intoagentcontrol:mainfrom
Conversation
added 7 commits
March 21, 2026 09:30
…acking Closes agentcontrol#130 Add BudgetEvaluator -- a deterministic evaluator that tracks cumulative LLM token and cost usage per agent, per channel, per user, with configurable time windows (daily/weekly/monthly/cumulative). Components: - BudgetStore protocol + InMemoryBudgetStore (dict + threading.Lock) - BudgetSnapshot frozen dataclass for atomic state reads - BudgetEvaluator with scope key building, period key derivation, token extraction, and optional model pricing estimation - BudgetLimitRule config with scope, per, window, limit_usd, limit_tokens - 48 tests covering store, config, evaluator, registration Design: - In-memory only (no PostgreSQL, no new dependencies) - Store is "dumb" (accumulate + check), evaluator is "smart" (resolve scope, derive period, extract tokens, check limits) - record_and_check() is atomic (single lock acquisition) - Evaluator instances are cached per config (thread-safe by design) - matched=True only when limit exceeded, confidence=1.0 always - Utilization ratio in metadata, not confidence
…arial tests 3-body review findings: Security: - Sanitize pipe/equals in scope key metadata values (injection prevention) - Add max_buckets=100K to InMemoryBudgetStore (OOM prevention, fail-closed) - Block dunder attribute access in _extract_by_path - Add math.isfinite guard on extracted cost values - Skip per-user rules when per field missing from metadata (was collapsing per-user budgets into global bucket) Correctness: - Changed exceeded check from > to >= (utilization=100% now triggers exceeded) - Removed unused BudgetSnapshot import from evaluator.py Tests (6 adversarial): - Exact limit boundary (USD and tokens) - Scope key injection via pipe character - max_buckets OOM prevention - per-field missing skips rule - dunder path rejection 54 budget tests, 284 total evaluator tests passing.
… dunder guard
R2 findings:
- _sanitize_scope_value: percent-encode |/= instead of replacing with _
(was causing key collisions between "a|b" and "a_b")
- max_buckets fail-closed: spent_usd/spent_tokens now 0.0/0 (not recorded,
previously reported current-call-only values misleading callers)
- _extract_by_path: narrowed guard from startswith("_") to startswith("__")
(single-underscore dict keys are legitimate data fields)
- Fixed tautological test assertion in test_scope_key_injection_pipe
- Added 3 tests: no-collision, single-underscore access, NaN/Inf cost
57 budget tests, 287 total evaluator tests passing.
R4 finding: negative pricing rates in config caused _estimate_cost to return negative cost_usd, which subtracted from spent_usd and disabled USD limit enforcement entirely. Fix: max(0.0, cost) in _estimate_cost return. Test: negative pricing rates produce spent_usd >= 0. 58 budget tests, 288 total evaluator tests passing.
R5 finding: Inf pricing rates produced inf cost, permanently locking buckets in exceeded state. max(0.0, inf) = inf. Fix: isfinite + negative check on _estimate_cost return value. Tests: Inf pricing rate test, strengthened negative pricing assertion. 59 budget tests passing.
…dation
R8 finding: float("nan") passed the `v <= 0` validator (IEEE 754:
nan <= 0 is False). NaN limit_usd silently disabled budget enforcement
because all NaN comparisons return False.
Fix: added math.isfinite(v) guard to validate_limit_usd.
Tests: NaN and Inf limit_usd rejection.
61 budget tests, 291 total evaluator tests passing.
…pe+period R10 finding: when multiple limit rules share the same (scope_key, period_key), each rule called record_and_check() independently, causing the same tokens and cost to be counted N times in the store. Fix: track recorded (scope_key, period_key) pairs per evaluate() call. First rule records; subsequent rules for the same pair use get_snapshot(). Tests: 2 new tests for same-scope double-count prevention. 63 budget tests, 293 total evaluator tests passing. Review loop: R9 CLEAN, R10 fix, R11 CLEAN -- 3 consecutive clean achieved.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
per agent, channel, user. Configurable time windows (daily/weekly/monthly).
Scope
User-facing/API changes:
Internal changes:
Out of scope:
Risk and Rollout
Testing
Checklist