Skip to content

Epic: Context rotation for long-running agent swarms #791

@acreeger

Description

@acreeger

Summary

Add context rotation to the il spin harness so that the swarm orchestrator (team-lead) can operate indefinitely without hitting context window limits. The harness monitors token usage by tailing the Claude Code session transcript, and when usage approaches the limit, asks the agent to checkpoint when ready. The agent decides when it's a good time, signals back, and the harness rotates the session — launching a fresh one with the same team identity and system prompt.

Motivation

Long-running swarm orchestrators managing complex epics can exhaust their context window mid-task. Today this means the agent degrades (context compression loses detail) or stops entirely. Because iloom already persists plans, progress, and artifacts on the issue, and agents communicate via inbox files, all the durable state needed for resumption is already externalized. We just need the harness to manage the session lifecycle.

Design

Core Mechanism

  1. Transcript monitoring: The il spin harness already knows the Claude Code session ID. It tails the session transcript JSONL (~/.claude/projects/*/sessions/<id>/*.jsonl), which includes usage data (input_tokens, output_tokens) on each message.

  2. Threshold detection: When cumulative token usage crosses a configurable threshold (e.g., 80% of model context window), the harness sends a checkpoint request.

  3. Checkpoint request (standing instruction): The harness writes a message to the team-lead's inbox: "When you're at a good stopping point and not finished, flush any pending state and signal that you're ready for rotation." This is a standing instruction — the agent decides when it's appropriate, not the harness.

  4. Agent signals ready: When the orchestrator reaches a safe idle point, it flushes any in-memory state to recap/metadata/issue, then sends a "ready for rotation" signal back via inbox. The agent must not signal ready when:

    • There are active workers in-flight
    • A rebase or merge operation is in progress or pending
    • A child branch merge has been started but not completed
    • The epic is done (no rotation needed — just finish)
  5. Harness rotates: On receiving the signal, the harness:

    • Shuts down the current Claude Code session
    • Rotates the session ID in the loom's metadata file
    • Calls the internals of the il spin command to start a new Claude session with the new session ID but the same system prompt and same team identity flags
    • Prompts the agent to continue (via the initial prompt if supported, or via an inbox message: "Continue where you left off")

Harness-to-Agent Messaging

The harness writes directly to the team-lead's inbox JSON file at ~/.claude/teams/<team-name>/inboxes/team-lead.json. The harness uses "from": "iloom" to distinguish its messages from worker messages.

Note on team manifest registration: The team config.json has a members array where team participants are registered. Initially, try writing to the inbox without registering iloom in the manifest — the inbox poller reads all unread messages regardless. If delivery fails because Claude Code tries to resolve the sender against the manifest, fall back to registering iloom as a member.

Checkpoint request message:

{
  "from": "iloom",
  "text": "{\"type\":\"checkpoint_request\",\"reason\":\"context_rotation\",\"timestamp\":\"...\"}",
  "timestamp": "2026-...",
  "read": false
}

Expected response (agent writes to harness inbox or a known location):

{
  "type": "ready_for_rotation"
}

The harness must use proper file locking when writing (the inbox uses proper-lockfile with a .lock suffix). The orchestrator prompt must document that iloom is a known sender and that checkpoint_request messages should be handled as described above.

Rotation Flow

Harness detects context threshold reached
  │
  ├─ Writes to team-lead.json inbox (from: "iloom"):
  │   { "type": "checkpoint_request", "reason": "context_rotation" }
  │
  │   ... agent continues working normally ...
  │   ... agent reaches safe idle point ...
  │
  ├─ Orchestrator (when safe):
  │   ├─ No active workers in-flight
  │   ├─ No rebase/merge in progress or pending
  │   ├─ Not done with the epic
  │   ├─ Flushes any pending state to recap/metadata
  │   └─ Signals back: { "type": "ready_for_rotation" }
  │
  ├─ Harness receives signal:
  │   ├─ Shuts down current session
  │   ├─ Rotates session ID in iloom-metadata.json
  │   └─ Calls il spin internals to launch new session:
  │       --agent-id <same-id>
  │       --agent-name team-lead
  │       --team-name <same-team>
  │       + same system prompt
  │       + continuation prompt (or inbox message: "Continue")
  │
  └─ New session:
      ├─ Fresh 200k context window
      ├─ Same team identity → same inbox
      ├─ Reads issue for plan/progress state
      ├─ Reads inbox for pending worker messages
      └─ Continues orchestrating

Team Identity Preservation via Hidden CLI Flags

Claude Code has hidden flags that establish team context for a session. The required flags are:

  • --agent-id <id> — Teammate agent ID
  • --agent-name <name> — Teammate display name
  • --team-name <name> — Team name for swarm coordination

All three are required together — omitting any one causes the process to exit with an error. These call setDynamicTeamContext, which makes isTeammate() return true and enables the InboxPoller (1s polling interval in TUI mode).

Optional flags:

  • --agent-color <color> — UI display color
  • --parent-session-id <id> — Links this session to a parent for analytics/telemetry correlation. Not required for team functionality. Could be useful for tracing rotated sessions back to the original.
  • --agent-type <type> — Custom agent type
  • --plan-mode-required — Require plan mode before implementation
  • --teammate-mode <mode> — Spawn backend: "tmux", "in-process", or "auto"

The env var CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 must also be set.

Inbox location: ~/.claude/teams/<team-name>/inboxes/<agent-name>.json — by reusing the same --agent-name and --team-name, the new session inherits the existing inbox. No renaming, no -2 suffix. Workers continue writing to team-lead.json as normal — they have no idea the session rotated.

What Acts as the Checkpoint

iloom's existing architecture already externalizes most state. The checkpoint step ensures anything still in-flight gets flushed:

State Where it lives Needs checkpoint flush?
Implementation plan Issue (written by il plan) No — already persisted
Child issue status Issue tracker No — already persisted
Worker completion reports team-lead.json inbox No — already persisted
Per-child loom state iloom-metadata.json in each worktree No — already persisted
Dependency graph Epic loom metadata No — already persisted
In-progress orchestration decisions Agent's context window only Yes — must flush to recap/metadata
Pending worker assignments Agent's context window only Yes — must flush

Because the agent chooses when to rotate (at a safe idle point with no active workers and no pending git operations), most of these in-flight items will already be resolved. The flush is typically minimal.

Key Design Decisions

  • Harness monitors, agent decides timing: The harness detects the threshold and sends the request, but the agent chooses when to checkpoint. This avoids interrupting mid-coordination when in-flight state is maximal.
  • Safe rotation criteria in orchestrator prompt: The orchestrator prompt must explicitly instruct the agent not to signal ready during unsafe states (active workers, pending rebases/merges, incomplete branch operations). This is a prompt-level contract, not just a suggestion.
  • Minimal harness identity: The harness writes to the inbox as "from": "iloom" without registering in the team manifest initially. If Claude Code requires manifest registration to deliver messages, fall back to adding an entry. The orchestrator prompt must document iloom as a trusted sender with specific message types (checkpoint_request).
  • Reuse il spin internals: The rotation doesn't need a separate launch mechanism. Rotate the session ID in metadata, then call the same code path il spin uses to start a Claude session. Same system prompt, same team flags.
  • Same identity, fresh context: By reusing the hidden team flags with the same agent name and team name, the rotated session is invisible to workers. No protocol changes needed.
  • Issue as external memory: Plans, progress updates, and summaries are already written to the issue. A resumed agent reads the issue and has full context without needing conversation history.
  • Continuation via prompt or inbox: Prefer passing a continuation prompt when launching the new session. If that's not supported for some reason, fall back to writing an inbox message: "Continue where you left off."
  • Configurable threshold: Different models have different context windows. The threshold should be configurable and default to something conservative (e.g., 80%).

Scope

  • Primary target: swarm orchestrator (team-lead) — this is the long-running session most likely to hit context limits
  • Workers may benefit too, but they typically complete before hitting limits
  • Should work with any model (context window sizes vary)

Open Questions

  • What's the right default threshold? 80% of context window? Or should we wait for compression to kick in and rotate after N compressions?
  • Should we track rotation count in metadata for observability/telemetry?
  • What if the agent never signals ready (e.g., it's stuck in an infinite loop with workers)? Should the harness have a hard ceiling / force-rotation fallback?
  • Does the orchestrator prompt need a dedicated section for handling checkpoint_request messages, or can it be folded into existing orchestrator instructions?
  • Should the continuation prompt include a summary of recent activity, or is "read the issue and inbox" sufficient?
  • How does the harness read the agent's ready_for_rotation response? Options: (a) harness has its own inbox file that the agent writes to, (b) harness tails the transcript for a SendMessage tool call with the signal, (c) agent writes to a known file path. Need to determine the simplest reliable approach.

Fallback plan: Inbox sender resolution

If writing to the inbox with "from": "iloom" without manifest registration doesn't work (message not delivered, or Claude Code errors when trying to resolve the sender), investigate by reverse-engineering the Claude Code source (cli.js):

  1. Trace how formatTeammateMessages (or n3z()) resolves the from field — does it look up the sender in config.json members, or just pass the string through?
  2. Check if the InboxPoller (ijq()) filters or validates messages against registered members before queuing them for delivery
  3. Check if teammate_id in the rendered XML (<teammate-message teammate_id="..." color="...">) needs to match a known agent for the LLM to process it correctly
  4. If registration is required, determine the minimal member entry needed — which fields are validated by Zod schema, which are optional
  5. Determine if agentType is validated against an enum or is freeform (existing values seen: "team-lead", "iloom-swarm-worker", "general-purpose")

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions