-
Notifications
You must be signed in to change notification settings - Fork 15
Epic: Context rotation for long-running agent swarms #791
Description
Summary
Add context rotation to the il spin harness so that the swarm orchestrator (team-lead) can operate indefinitely without hitting context window limits. The harness monitors token usage by tailing the Claude Code session transcript, and when usage approaches the limit, asks the agent to checkpoint when ready. The agent decides when it's a good time, signals back, and the harness rotates the session — launching a fresh one with the same team identity and system prompt.
Motivation
Long-running swarm orchestrators managing complex epics can exhaust their context window mid-task. Today this means the agent degrades (context compression loses detail) or stops entirely. Because iloom already persists plans, progress, and artifacts on the issue, and agents communicate via inbox files, all the durable state needed for resumption is already externalized. We just need the harness to manage the session lifecycle.
Design
Core Mechanism
-
Transcript monitoring: The
il spinharness already knows the Claude Code session ID. It tails the session transcript JSONL (~/.claude/projects/*/sessions/<id>/*.jsonl), which includesusagedata (input_tokens, output_tokens) on each message. -
Threshold detection: When cumulative token usage crosses a configurable threshold (e.g., 80% of model context window), the harness sends a checkpoint request.
-
Checkpoint request (standing instruction): The harness writes a message to the team-lead's inbox: "When you're at a good stopping point and not finished, flush any pending state and signal that you're ready for rotation." This is a standing instruction — the agent decides when it's appropriate, not the harness.
-
Agent signals ready: When the orchestrator reaches a safe idle point, it flushes any in-memory state to recap/metadata/issue, then sends a "ready for rotation" signal back via inbox. The agent must not signal ready when:
- There are active workers in-flight
- A rebase or merge operation is in progress or pending
- A child branch merge has been started but not completed
- The epic is done (no rotation needed — just finish)
-
Harness rotates: On receiving the signal, the harness:
- Shuts down the current Claude Code session
- Rotates the session ID in the loom's metadata file
- Calls the internals of the
il spincommand to start a new Claude session with the new session ID but the same system prompt and same team identity flags - Prompts the agent to continue (via the initial prompt if supported, or via an inbox message: "Continue where you left off")
Harness-to-Agent Messaging
The harness writes directly to the team-lead's inbox JSON file at ~/.claude/teams/<team-name>/inboxes/team-lead.json. The harness uses "from": "iloom" to distinguish its messages from worker messages.
Note on team manifest registration: The team config.json has a members array where team participants are registered. Initially, try writing to the inbox without registering iloom in the manifest — the inbox poller reads all unread messages regardless. If delivery fails because Claude Code tries to resolve the sender against the manifest, fall back to registering iloom as a member.
Checkpoint request message:
{
"from": "iloom",
"text": "{\"type\":\"checkpoint_request\",\"reason\":\"context_rotation\",\"timestamp\":\"...\"}",
"timestamp": "2026-...",
"read": false
}Expected response (agent writes to harness inbox or a known location):
{
"type": "ready_for_rotation"
}The harness must use proper file locking when writing (the inbox uses proper-lockfile with a .lock suffix). The orchestrator prompt must document that iloom is a known sender and that checkpoint_request messages should be handled as described above.
Rotation Flow
Harness detects context threshold reached
│
├─ Writes to team-lead.json inbox (from: "iloom"):
│ { "type": "checkpoint_request", "reason": "context_rotation" }
│
│ ... agent continues working normally ...
│ ... agent reaches safe idle point ...
│
├─ Orchestrator (when safe):
│ ├─ No active workers in-flight
│ ├─ No rebase/merge in progress or pending
│ ├─ Not done with the epic
│ ├─ Flushes any pending state to recap/metadata
│ └─ Signals back: { "type": "ready_for_rotation" }
│
├─ Harness receives signal:
│ ├─ Shuts down current session
│ ├─ Rotates session ID in iloom-metadata.json
│ └─ Calls il spin internals to launch new session:
│ --agent-id <same-id>
│ --agent-name team-lead
│ --team-name <same-team>
│ + same system prompt
│ + continuation prompt (or inbox message: "Continue")
│
└─ New session:
├─ Fresh 200k context window
├─ Same team identity → same inbox
├─ Reads issue for plan/progress state
├─ Reads inbox for pending worker messages
└─ Continues orchestrating
Team Identity Preservation via Hidden CLI Flags
Claude Code has hidden flags that establish team context for a session. The required flags are:
--agent-id <id>— Teammate agent ID--agent-name <name>— Teammate display name--team-name <name>— Team name for swarm coordination
All three are required together — omitting any one causes the process to exit with an error. These call setDynamicTeamContext, which makes isTeammate() return true and enables the InboxPoller (1s polling interval in TUI mode).
Optional flags:
--agent-color <color>— UI display color--parent-session-id <id>— Links this session to a parent for analytics/telemetry correlation. Not required for team functionality. Could be useful for tracing rotated sessions back to the original.--agent-type <type>— Custom agent type--plan-mode-required— Require plan mode before implementation--teammate-mode <mode>— Spawn backend: "tmux", "in-process", or "auto"
The env var CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 must also be set.
Inbox location: ~/.claude/teams/<team-name>/inboxes/<agent-name>.json — by reusing the same --agent-name and --team-name, the new session inherits the existing inbox. No renaming, no -2 suffix. Workers continue writing to team-lead.json as normal — they have no idea the session rotated.
What Acts as the Checkpoint
iloom's existing architecture already externalizes most state. The checkpoint step ensures anything still in-flight gets flushed:
| State | Where it lives | Needs checkpoint flush? |
|---|---|---|
| Implementation plan | Issue (written by il plan) |
No — already persisted |
| Child issue status | Issue tracker | No — already persisted |
| Worker completion reports | team-lead.json inbox |
No — already persisted |
| Per-child loom state | iloom-metadata.json in each worktree |
No — already persisted |
| Dependency graph | Epic loom metadata | No — already persisted |
| In-progress orchestration decisions | Agent's context window only | Yes — must flush to recap/metadata |
| Pending worker assignments | Agent's context window only | Yes — must flush |
Because the agent chooses when to rotate (at a safe idle point with no active workers and no pending git operations), most of these in-flight items will already be resolved. The flush is typically minimal.
Key Design Decisions
- Harness monitors, agent decides timing: The harness detects the threshold and sends the request, but the agent chooses when to checkpoint. This avoids interrupting mid-coordination when in-flight state is maximal.
- Safe rotation criteria in orchestrator prompt: The orchestrator prompt must explicitly instruct the agent not to signal ready during unsafe states (active workers, pending rebases/merges, incomplete branch operations). This is a prompt-level contract, not just a suggestion.
- Minimal harness identity: The harness writes to the inbox as
"from": "iloom"without registering in the team manifest initially. If Claude Code requires manifest registration to deliver messages, fall back to adding an entry. The orchestrator prompt must documentiloomas a trusted sender with specific message types (checkpoint_request). - Reuse il spin internals: The rotation doesn't need a separate launch mechanism. Rotate the session ID in metadata, then call the same code path
il spinuses to start a Claude session. Same system prompt, same team flags. - Same identity, fresh context: By reusing the hidden team flags with the same agent name and team name, the rotated session is invisible to workers. No protocol changes needed.
- Issue as external memory: Plans, progress updates, and summaries are already written to the issue. A resumed agent reads the issue and has full context without needing conversation history.
- Continuation via prompt or inbox: Prefer passing a continuation prompt when launching the new session. If that's not supported for some reason, fall back to writing an inbox message: "Continue where you left off."
- Configurable threshold: Different models have different context windows. The threshold should be configurable and default to something conservative (e.g., 80%).
Scope
- Primary target: swarm orchestrator (team-lead) — this is the long-running session most likely to hit context limits
- Workers may benefit too, but they typically complete before hitting limits
- Should work with any model (context window sizes vary)
Open Questions
- What's the right default threshold? 80% of context window? Or should we wait for compression to kick in and rotate after N compressions?
- Should we track rotation count in metadata for observability/telemetry?
- What if the agent never signals ready (e.g., it's stuck in an infinite loop with workers)? Should the harness have a hard ceiling / force-rotation fallback?
- Does the orchestrator prompt need a dedicated section for handling
checkpoint_requestmessages, or can it be folded into existing orchestrator instructions? - Should the continuation prompt include a summary of recent activity, or is "read the issue and inbox" sufficient?
- How does the harness read the agent's
ready_for_rotationresponse? Options: (a) harness has its own inbox file that the agent writes to, (b) harness tails the transcript for a SendMessage tool call with the signal, (c) agent writes to a known file path. Need to determine the simplest reliable approach.
Fallback plan: Inbox sender resolution
If writing to the inbox with "from": "iloom" without manifest registration doesn't work (message not delivered, or Claude Code errors when trying to resolve the sender), investigate by reverse-engineering the Claude Code source (cli.js):
- Trace how
formatTeammateMessages(orn3z()) resolves thefromfield — does it look up the sender inconfig.jsonmembers, or just pass the string through? - Check if the InboxPoller (
ijq()) filters or validates messages against registered members before queuing them for delivery - Check if
teammate_idin the rendered XML (<teammate-message teammate_id="..." color="...">) needs to match a known agent for the LLM to process it correctly - If registration is required, determine the minimal member entry needed — which fields are validated by Zod schema, which are optional
- Determine if
agentTypeis validated against an enum or is freeform (existing values seen:"team-lead","iloom-swarm-worker","general-purpose")
References
- EPIC: Explore Implementing support for Claude's new tasks system in a way that allows for user to provide feedback to the running session from outside #492 — Deep investigation of Claude Code messaging internals, hidden CLI flags, inbox system, and task injection
Metadata
Metadata
Assignees
Labels
Type
Projects
Status