SafeRL-Lab · chauncygu · Jun 5, 2026 · Jun 5, 2026
diff --git a/README.md b/README.md
@@ -39,7 +39,8 @@ Other install methods: [pip install](#alternative-install-with-pip) | [uv instal
 
 ## 🔥🔥🔥 News (Pacific Time)
 
-- June 5, 2026 (latest, **v3.05.82**): **Adaptive Markdown streaming — live output stays correct on every device** by auto-selecting a per-device tier (`live` in-place redraw on capable terminals incl. modern SSH emulators, append-only `commit` for SSH/Apple Terminal/pipes/CJK text so frames never duplicate, `plain` fallback); also ships a visual `/context` usage grid and a 1M context window for `deepseek-v4-flash`. Details: [docs/guides/features.md](docs/guides/features.md) · [docs/news.md](docs/news.md).
+- June 5, 2026 (latest, **v3.05.82**): **User-controllable token/cost budgets** — `/budget $5` / `/budget 200k` / `/budget daily $20` cap spend per session or per day, enforced before each model call; on hit the session auto-saves and you're shown how to `/resume` or raise the cap and continue (warns at ≥80%/95%; `--budget` sets it at startup). Details: [docs/guides/features.md](docs/guides/features.md) · [docs/news.md](docs/news.md).
+- June 5, 2026: **Adaptive Markdown streaming — live output stays correct on every device** by auto-selecting a per-device tier (`live` in-place redraw on capable terminals incl. modern SSH emulators, append-only `commit` for SSH/Apple Terminal/pipes/CJK text so frames never duplicate, `plain` fallback); also ships a visual `/context` usage grid and a 1M context window for `deepseek-v4-flash`. Details: [docs/guides/features.md](docs/guides/features.md) · [docs/news.md](docs/news.md).
 - June 4, 2026 (**v3.05.81**): **Claude-Code-style quiet output** hides per-tool execution and shows one summary line per turn (on by default), with a live spinner timer + token estimate and a `✻ Worked for…` footer; `/verbose` overrides, toggle with `/quiet`. Details: [docs/guides/features.md](docs/guides/features.md) · [docs/news.md](docs/news.md).
 - June 4, 2026: **Context-window override** — `/config context_window=<N>` sets the context length that drives the prompt `%`, `/context`, the compaction trigger, and the output cap consistently (distinct from `max_tokens`; read live, no restart). Details: [docs/guides/reference.md](docs/guides/reference.md) · [docs/news.md](docs/news.md).
 - June 4, 2026: **Rich Live streaming** keeps long responses live via a bounded tail window — redrawing only the most recent screenful and committing the full output when done, fixing duplicate/stale frames (builds on PR #133). Details: [docs/guides/features.md](docs/guides/features.md) · [docs/news.md](docs/news.md).

diff --git a/agent.py b/agent.py
@@ -57,6 +57,19 @@ class PermissionRequest:
     description: str
     granted: bool = False
 
+@dataclass
+class QuotaPause:
+    """Yielded when a configured budget is reached, instead of making a billable
+    call. The REPL auto-saves the session and tells the user how to resume or
+    raise the budget. ``usage`` is the snapshot from quota.get_usage(); the
+    key/scope/unit/limit identify which cap broke so the hint targets it."""
+    reason: str
+    usage: dict = field(default_factory=dict)
+    key: str | None = None
+    scope: str | None = None
+    unit: str | None = None
+    limit: float | None = None
+
 
 # ── Agent loop ─────────────────────────────────────────────────────────────
 
@@ -149,12 +162,34 @@ def run(
                       removed=_before_len - len(state.messages))
 
         # ── Quota check — before spending tokens ──────────────────────────
+        # Project this request's INPUT so a single large (tool-heavy) call can't
+        # blow past the cap, then clamp the OUTPUT cap to the remaining headroom
+        # so the response can't either — keeping the overshoot near zero.
+        _proj_tokens, _proj_cost = 0, 0.0
+        _call_config = config
+        if any(config.get(k) for k in ("session_token_budget", "session_cost_budget",
+                                       "daily_token_budget", "daily_cost_budget")):
+            try:
+                from compaction import estimate_tokens as _est_tok
+                from providers import calc_cost as _calc_cost
+                _proj_tokens = (_est_tok(state.messages)
+                                + _est_tok([{"role": "system", "content": system_prompt}]))
+                _proj_cost = _calc_cost(config["model"], _proj_tokens, 0)
+            except Exception:
+                _proj_tokens, _proj_cost = 0, 0.0
         try:
-            _quota.check_quota(session_id, config)
+            _quota.check_quota(session_id, config,
+                               projected_tokens=_proj_tokens, projected_cost=_proj_cost)
         except _quota.QuotaExceeded as qe:
             _log.warn("quota_exceeded", session_id=session_id, reason=qe.reason)
-            yield TextChunk(f"\n[Quota exceeded — {qe.reason}]\n")
+            yield QuotaPause(qe.reason, _quota.get_usage(session_id),
+                             key=qe.key, scope=qe.scope, unit=qe.unit, limit=qe.limit)
             break
+        _room = _quota.output_room(session_id, config, _proj_tokens, _proj_cost)
+        if _room is not None:
+            _cur_cap = config.get("max_tokens") or 4096
+            if _room < _cur_cap:
+                _call_config = {**config, "max_tokens": max(256, int(_room))}
 
         # NIM-only: when build.nvidia.com rate-limits a model, cycle to
         # the next free-tier model before consuming a regular retry. Capped
@@ -177,7 +212,7 @@ def run(
                     system=system_prompt,
                     messages=state.messages,
                     tool_schemas=get_tool_schemas(),
-                    config=config,
+                    config=_call_config,
                 ):
                     if isinstance(event, (TextChunk, ThinkingChunk)):
                         yield event

diff --git a/cheetahclaws.py b/cheetahclaws.py
@@ -23,6 +23,7 @@
   /history    Print conversation history
   /context    Show context window usage
   /cost       Show API cost this session
+  /budget     View or set token/cost budgets (session + daily)
   /status     Show current session status (model, mode, tokens, cost)
   /verbose    Toggle verbose mode
   /quiet      Toggle compact tool display (hide execution, show per-turn summary)
@@ -239,7 +240,7 @@ def __getattr__(self, name):
 
 # ── Core commands ──────────────────────────────────────────────────────────
 from commands.core import (
-    cmd_help, cmd_clear, cmd_context, cmd_cost, cmd_compact,
+    cmd_help, cmd_clear, cmd_context, cmd_cost, cmd_budget, cmd_compact,
     cmd_init, cmd_export, cmd_copy, cmd_status, cmd_doctor,
     cmd_proactive, cmd_image, cmd_circuit, cmd_web, run_setup_wizard,
 )
@@ -452,6 +453,7 @@ def _proactive_watcher_loop(config):
     "search":      cmd_search,
     "context":     cmd_context,
     "cost":        cmd_cost,
+    "budget":      cmd_budget,
     "verbose":     cmd_verbose,
     "quiet":       cmd_quiet,
     "thinking":    cmd_thinking,
@@ -615,6 +617,7 @@ def handle_slash(line: str, state, config) -> Union[bool, tuple]:
     "search":      ("Search past sessions",               []),
     "context":     ("Visualize context-window usage by category", []),
     "cost":        ("Show cost estimate",                 []),
+    "budget":      ("View or set token/cost budgets (session + daily)", ["session", "daily", "clear"]),
     "verbose":     ("Toggle verbose output",              []),
     "quiet":       ("Toggle compact tool display",        []),
     "thinking":    ("Toggle extended thinking",           []),
@@ -895,7 +898,7 @@ def _headless_run_query(prompt: str, is_background: bool = False) -> None:
 def repl(config: dict, initial_prompt: str = None):
     from cc_config import HISTORY_FILE
     from context import build_system_prompt
-    from agent import AgentState, run, TextChunk, ThinkingChunk, ToolStart, ToolEnd, TurnDone, PermissionRequest
+    from agent import AgentState, run, TextChunk, ThinkingChunk, ToolStart, ToolEnd, TurnDone, PermissionRequest, QuotaPause
 
     if HAS_PROMPT_TOOLKIT:
         # Inject live providers so ui.input's completer enumerates the same
@@ -1101,6 +1104,7 @@ def run_query(user_input: str, is_background: bool = False):
             turn_start = time.monotonic()
             turn_in_tokens = 0
             turn_out_tokens = 0
+            quota_paused = False    # set when a budget is reached mid-turn
             streamed_chars = 0
 
             # Rebuild system prompt each turn (picks up cwd changes, etc.)
@@ -1251,6 +1255,38 @@ def run_query(user_input: str, is_background: bool = False):
                                 f"\n  [tokens: +{event.input_tokens} in / "
                                 f"+{event.output_tokens} out]", "dim"
                             ))
+
+                    elif isinstance(event, QuotaPause):
+                        # A configured budget was reached BEFORE making the next
+                        # (billable) call. Auto-save so nothing is lost, then tell
+                        # the user how to resume or raise the budget and continue.
+                        _stop_tool_spinner()
+                        spinner_shown = False
+                        flush_response()
+                        quota_paused = True
+                        print()
+                        print(clr(f"  ⛔ Budget reached — {event.reason}", "yellow", "bold"))
+                        # save_latest() prints the saved paths itself — don't echo.
+                        try:
+                            from commands.session import save_latest
+                            save_latest("", state, config)
+                        except Exception:
+                            pass
+                        # Suggest raising the cap that actually broke, in its own
+                        # unit/scope — a token cap can't be lifted with a $ amount.
+                        try:
+                            import quota as _q
+                            _pre = "daily " if event.scope == "daily" else ""
+                            _amt = _q.fmt_amount((event.limit or 0) * 2, event.unit or "tok")
+                            _raise_cmd = f"/budget {_pre}{_amt}" if event.limit else "/budget 40k"
+                        except Exception:
+                            _raise_cmd = "/budget 40k"
+                        print(clr("  To continue:", "bold"))
+                        print("    • raise it:   " + clr(_raise_cmd, "cyan")
+                              + "  (or " + clr("/budget clear", "cyan") + "), then resend your message")
+                        print("    • later:      restart and run " + clr("/resume", "cyan")
+                              + " to pick up where you left off")
+                        print("    • view usage: " + clr("/budget", "cyan"))
             except KeyboardInterrupt:
                 _stop_tool_spinner()
                 flush_response()
@@ -1285,6 +1321,15 @@ def run_query(user_input: str, is_background: bool = False):
             if quiet:
                 print_turn_stats(time.monotonic() - turn_start,
                                  turn_in_tokens, turn_out_tokens)
+            # Budget proximity warnings (≥80% / ≥95%) — heads-up before the hard
+            # stop arrives. Skipped when this turn already hit the cap.
+            if not quota_paused:
+                try:
+                    import quota as _quota
+                    for _level, _msg in _quota.warnings(config.get("_session_id", "default"), config):
+                        (err if _level == "crit" else warn)(f"  ⚠ Budget: {_msg} — /budget to view")
+                except Exception:
+                    pass
             print(clr("╰──────────────────────────────────────────────", "dim"))
             print()
 
@@ -1912,6 +1957,10 @@ def main():
                         help="Show each tool call instead of a per-turn summary")
     parser.add_argument("--thinking", action="store_true",
                         help="Enable extended thinking")
+    parser.add_argument("--budget", metavar="AMOUNT",
+                        help="Session budget cap, e.g. --budget $5 (cost) or "
+                             "--budget 200k (tokens). Auto-saves and prompts to "
+                             "resume / raise when reached.")
     parser.add_argument("--version", action="store_true", help="Print version")
     parser.add_argument("--setup", action="store_true", help="Run interactive setup wizard")
     parser.add_argument("--web", action="store_true",
@@ -1994,6 +2043,15 @@ def main():
         config["quiet"] = False
     if args.thinking:
         config["thinking"] = True
+    if getattr(args, "budget", None):
+        import quota as _quota
+        try:
+            _kind, _val = _quota.parse_budget(args.budget)
+            config[_quota.BUDGET_KEYS[(_kind, "session")]] = _val
+            _shown = _quota.fmt_amount(_val, "usd" if _kind == "cost" else "tok")
+            print(clr(f"  Session {'cost' if _kind == 'cost' else 'token'} budget: {_shown}", "dim"))
+        except ValueError as _e:
+            warn(f"--budget: {_e} (e.g. --budget $5 or --budget 200k); ignoring.")
 
     # ── Setup wizard: --setup flag or first-run auto-trigger ─────────────
     from cc_config import CONFIG_FILE

diff --git a/commands/core.py b/commands/core.py
@@ -221,6 +221,86 @@ def cmd_cost(_args: str, state, config) -> bool:
     return True
 
 
+def _budget_bar(pct: float | None, width: int = 16) -> str:
+    filled = int(round((pct or 0) / 100 * width))
+    filled = max(0, min(width, filled))
+    return "█" * filled + "░" * (width - filled)
+
+
+def cmd_budget(args: str, state, config) -> bool:
+    """View or set token / cost budgets (session + daily).
+
+    /budget                 show usage vs every budget (bars + %)
+    /budget $5              session cost cap (the $ means USD)
+    /budget 200k            session token cap (supports 200k / 1.5m / 200000)
+    /budget daily $20       daily cost cap   ·   /budget daily 2m  daily tokens
+    /budget clear           remove all caps (unlimited)
+    """
+    import quota as _quota
+    from cc_config import save_config
+
+    arg = args.strip()
+    sid = config.get("_session_id", "default")
+
+    # ── view ────────────────────────────────────────────────────────────────
+    if not arg:
+        rows = _quota.usage_vs_limits(sid, config)
+        print(clr("  Token Budget", "bold"))
+        any_set = False
+        for r in rows:
+            used = _quota.fmt_amount(r["used"], r["unit"])
+            if r["limit"] is None:
+                print(f"  {r['label']:<15} {used:>9}  " + clr("unlimited", "dim"))
+                continue
+            any_set = True
+            lim = _quota.fmt_amount(r["limit"], r["unit"])
+            pct = r["pct"] or 0
+            color = "red" if pct >= 95 else ("yellow" if pct >= 80 else "green")
+            print(f"  {r['label']:<15} {used:>9} / {lim:<9} "
+                  f"{clr(_budget_bar(pct), color)} {pct:4.0f}%")
+        print()
+        if any_set:
+            info("  Change: /budget $5 · /budget 200k · /budget daily $20 · /budget clear")
+        else:
+            info("  No budgets set (unlimited). Set one: /budget $5 · /budget 200k · /budget daily $20")
+        return True
+
+    # ── clear ─────────────────────────────────────────────────────────────────
+    if arg.lower() in ("clear", "off", "none", "reset", "unlimited"):
+        for key in _quota.BUDGET_KEYS.values():
+            config[key] = None
+        save_config(config)
+        ok("All budgets cleared (unlimited).")
+        return True
+
+    # ── set ───────────────────────────────────────────────────────────────────
+    parts = arg.split()
+    scope = "session"
+    if parts[0].lower() in ("session", "daily"):
+        scope, rest = parts[0].lower(), " ".join(parts[1:])
+    else:
+        rest = arg
+    if not rest.strip():
+        err("Usage: /budget [session|daily] <amount>  —  e.g. /budget $5  ·  /budget daily 2m")
+        return True
+    try:
+        kind, value = _quota.parse_budget(rest)
+    except ValueError as e:
+        err(f"{e}. Examples: /budget $5 (cost) · /budget 200k (tokens) · /budget daily $20")
+        return True
+    config[_quota.BUDGET_KEYS[(kind, scope)]] = value
+    # One budget per scope: a new cap replaces the other unit for that scope, so
+    # e.g. setting a $ cap clears a leftover token cap that would still block.
+    config[_quota.BUDGET_KEYS[("tokens" if kind == "cost" else "cost", scope)]] = None
+    save_config(config)
+    shown = _quota.fmt_amount(value, "usd" if kind == "cost" else "tok")
+    ok(f"{scope.capitalize()} budget set to {shown} "
+       f"({'cost' if kind == 'cost' else 'tokens'}).")
+    info(f"Replaces any previous {scope} cap. Checked before each model call; "
+         "auto-saves and shows how to resume when reached.")
+    return True
+
+
 def cmd_compact(args: str, state, config) -> bool:
     """Manually compact conversation history."""
     from compaction import manual_compact

diff --git a/docs/guides/features.md b/docs/guides/features.md
@@ -55,5 +55,6 @@ and indexed in the [README Documentation section](../../README.md#documentation)
 | Cloud sync | `/cloudsave` syncs sessions to private GitHub Gists; auto-sync on exit; load from cloud by Gist ID. No new dependencies (stdlib `urllib`). |
 | Extended Thinking | Toggle on/off for Claude models; native `<think>` block streaming for local Ollama reasoning models (deepseek-r1, qwen3, gemma4) |
 | Cost tracking | Token usage + estimated USD cost |
+| Token / cost budgets | `/budget` sets and views spend caps — per-session or per-day, in tokens or USD (`/budget $5`, `/budget 200k`, `/budget daily $20`, `/budget clear`; or `--budget $5` at startup). **One budget per scope**: a new cap replaces the other unit for that scope (so switching tokens↔USD just works, no stale cap left blocking). Enforced before each model call, and **tight** — it projects the next request's input and clamps its output cap, so a single tool-heavy turn can't overshoot the budget. Warns at ≥80%/95%. When a cap is hit the session is **auto-saved** and you're shown how to `/resume` later or raise the **same** cap (the hint matches the breached unit) and continue — nothing is lost. Backed by `quota.py`; the daemon ships conservative defaults (200k tok / $2 per session) in `serve` mode. |
 | Non-interactive mode | `--print` flag for scripting / CI |
 | **Web UI** | `--web` opens the browser. Multi-user accounts (bcrypt + JWT), SQLite-persisted history, session CRUD + markdown export, light/dark/system theme, `/health` + `/metrics`, auto-picks a free port if 8080 is busy. `pip install 'cheetahclaws[web]'`. See [web-ui.md](web-ui.md). |