Epic: Context rotation for long-running agent swarms

## Summary

Add context rotation to the `il spin` harness so that the swarm orchestrator (team-lead) can operate indefinitely without hitting context window limits. The harness monitors token usage by tailing the Claude Code session transcript, and when usage approaches the limit, asks the agent to checkpoint when ready. The agent decides when it's a good time, signals back, and the harness rotates the session — launching a fresh one with the same team identity and system prompt.

## Motivation

Long-running swarm orchestrators managing complex epics can exhaust their context window mid-task. Today this means the agent degrades (context compression loses detail) or stops entirely. Because iloom already persists plans, progress, and artifacts on the issue, and agents communicate via inbox files, all the durable state needed for resumption is already externalized. We just need the harness to manage the session lifecycle.

## Design

### Core Mechanism

1. **Transcript monitoring**: The `il spin` harness already knows the Claude Code session ID. It tails the session transcript JSONL (`~/.claude/projects/*/sessions/<id>/*.jsonl`), which includes `usage` data (input_tokens, output_tokens) on each message.

2. **Threshold detection**: When cumulative token usage crosses a configurable threshold (e.g., 80% of model context window), the harness sends a checkpoint request.

3. **Checkpoint request (standing instruction)**: The harness writes a message to the team-lead's inbox: "When you're at a good stopping point and not finished, flush any pending state and signal that you're ready for rotation." This is a standing instruction — the agent decides when it's appropriate, not the harness.

4. **Agent signals ready**: When the orchestrator reaches a safe idle point, it flushes any in-memory state to recap/metadata/issue, then sends a "ready for rotation" signal back via inbox. The agent must **not** signal ready when:
   - There are active workers in-flight
   - A rebase or merge operation is in progress or pending
   - A child branch merge has been started but not completed
   - The epic is done (no rotation needed — just finish)

5. **Harness rotates**: On receiving the signal, the harness:
   - Shuts down the current Claude Code session
   - Rotates the session ID in the loom's metadata file
   - Calls the internals of the `il spin` command to start a new Claude session with the new session ID but the **same system prompt** and **same team identity flags**
   - Prompts the agent to continue (via the initial prompt if supported, or via an inbox message: "Continue where you left off")

### Harness-to-Agent Messaging

The harness writes directly to the team-lead's inbox JSON file at `~/.claude/teams/<team-name>/inboxes/team-lead.json`. The harness uses `"from": "iloom"` to distinguish its messages from worker messages.

**Note on team manifest registration:** The team `config.json` has a `members` array where team participants are registered. Initially, try writing to the inbox **without** registering `iloom` in the manifest — the inbox poller reads all unread messages regardless. If delivery fails because Claude Code tries to resolve the sender against the manifest, fall back to registering `iloom` as a member.

**Checkpoint request message:**
```json
{
  "from": "iloom",
  "text": "{\"type\":\"checkpoint_request\",\"reason\":\"context_rotation\",\"timestamp\":\"...\"}",
  "timestamp": "2026-...",
  "read": false
}
```

**Expected response (agent writes to harness inbox or a known location):**
```json
{
  "type": "ready_for_rotation"
}
```

The harness must use proper file locking when writing (the inbox uses `proper-lockfile` with a `.lock` suffix). The orchestrator prompt must document that `iloom` is a known sender and that `checkpoint_request` messages should be handled as described above.

### Rotation Flow

```
Harness detects context threshold reached
  │
  ├─ Writes to team-lead.json inbox (from: "iloom"):
  │   { "type": "checkpoint_request", "reason": "context_rotation" }
  │
  │   ... agent continues working normally ...
  │   ... agent reaches safe idle point ...
  │
  ├─ Orchestrator (when safe):
  │   ├─ No active workers in-flight
  │   ├─ No rebase/merge in progress or pending
  │   ├─ Not done with the epic
  │   ├─ Flushes any pending state to recap/metadata
  │   └─ Signals back: { "type": "ready_for_rotation" }
  │
  ├─ Harness receives signal:
  │   ├─ Shuts down current session
  │   ├─ Rotates session ID in iloom-metadata.json
  │   └─ Calls il spin internals to launch new session:
  │       --agent-id <same-id>
  │       --agent-name team-lead
  │       --team-name <same-team>
  │       + same system prompt
  │       + continuation prompt (or inbox message: "Continue")
  │
  └─ New session:
      ├─ Fresh 200k context window
      ├─ Same team identity → same inbox
      ├─ Reads issue for plan/progress state
      ├─ Reads inbox for pending worker messages
      └─ Continues orchestrating
```

### Team Identity Preservation via Hidden CLI Flags

Claude Code has hidden flags that establish team context for a session. The **required** flags are:

- `--agent-id <id>` — Teammate agent ID
- `--agent-name <name>` — Teammate display name
- `--team-name <name>` — Team name for swarm coordination

All three are **required together** — omitting any one causes the process to exit with an error. These call `setDynamicTeamContext`, which makes `isTeammate()` return true and enables the **InboxPoller** (1s polling interval in TUI mode).

**Optional** flags:

- `--agent-color <color>` — UI display color
- `--parent-session-id <id>` — Links this session to a parent for analytics/telemetry correlation. Not required for team functionality. Could be useful for tracing rotated sessions back to the original.
- `--agent-type <type>` — Custom agent type
- `--plan-mode-required` — Require plan mode before implementation
- `--teammate-mode <mode>` — Spawn backend: "tmux", "in-process", or "auto"

The env var `CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1` must also be set.

**Inbox location:** `~/.claude/teams/<team-name>/inboxes/<agent-name>.json` — by reusing the same `--agent-name` and `--team-name`, the new session inherits the existing inbox. No renaming, no `-2` suffix. Workers continue writing to `team-lead.json` as normal — they have no idea the session rotated.

### What Acts as the Checkpoint

iloom's existing architecture already externalizes most state. The checkpoint step ensures anything still in-flight gets flushed:

| State | Where it lives | Needs checkpoint flush? |
|-------|---------------|------------------------|
| Implementation plan | Issue (written by `il plan`) | No — already persisted |
| Child issue status | Issue tracker | No — already persisted |
| Worker completion reports | `team-lead.json` inbox | No — already persisted |
| Per-child loom state | `iloom-metadata.json` in each worktree | No — already persisted |
| Dependency graph | Epic loom metadata | No — already persisted |
| **In-progress orchestration decisions** | Agent's context window only | **Yes — must flush to recap/metadata** |
| **Pending worker assignments** | Agent's context window only | **Yes — must flush** |

Because the agent chooses when to rotate (at a safe idle point with no active workers and no pending git operations), most of these in-flight items will already be resolved. The flush is typically minimal.

### Key Design Decisions

- **Harness monitors, agent decides timing**: The harness detects the threshold and sends the request, but the agent chooses when to checkpoint. This avoids interrupting mid-coordination when in-flight state is maximal.
- **Safe rotation criteria in orchestrator prompt**: The orchestrator prompt must explicitly instruct the agent not to signal ready during unsafe states (active workers, pending rebases/merges, incomplete branch operations). This is a prompt-level contract, not just a suggestion.
- **Minimal harness identity**: The harness writes to the inbox as `"from": "iloom"` without registering in the team manifest initially. If Claude Code requires manifest registration to deliver messages, fall back to adding an entry. The orchestrator prompt must document `iloom` as a trusted sender with specific message types (`checkpoint_request`).
- **Reuse il spin internals**: The rotation doesn't need a separate launch mechanism. Rotate the session ID in metadata, then call the same code path `il spin` uses to start a Claude session. Same system prompt, same team flags.
- **Same identity, fresh context**: By reusing the hidden team flags with the same agent name and team name, the rotated session is invisible to workers. No protocol changes needed.
- **Issue as external memory**: Plans, progress updates, and summaries are already written to the issue. A resumed agent reads the issue and has full context without needing conversation history.
- **Continuation via prompt or inbox**: Prefer passing a continuation prompt when launching the new session. If that's not supported for some reason, fall back to writing an inbox message: "Continue where you left off."
- **Configurable threshold**: Different models have different context windows. The threshold should be configurable and default to something conservative (e.g., 80%).

### Scope

- **Primary target: swarm orchestrator (team-lead)** — this is the long-running session most likely to hit context limits
- Workers may benefit too, but they typically complete before hitting limits
- Should work with any model (context window sizes vary)

## Open Questions

- What's the right default threshold? 80% of context window? Or should we wait for compression to kick in and rotate after N compressions?
- Should we track rotation count in metadata for observability/telemetry?
- What if the agent never signals ready (e.g., it's stuck in an infinite loop with workers)? Should the harness have a hard ceiling / force-rotation fallback?
- Does the orchestrator prompt need a dedicated section for handling `checkpoint_request` messages, or can it be folded into existing orchestrator instructions?
- Should the continuation prompt include a summary of recent activity, or is "read the issue and inbox" sufficient?
- How does the harness read the agent's `ready_for_rotation` response? Options: (a) harness has its own inbox file that the agent writes to, (b) harness tails the transcript for a SendMessage tool call with the signal, (c) agent writes to a known file path. Need to determine the simplest reliable approach.

### Fallback plan: Inbox sender resolution

If writing to the inbox with `"from": "iloom"` without manifest registration doesn't work (message not delivered, or Claude Code errors when trying to resolve the sender), investigate by reverse-engineering the Claude Code source (`cli.js`):

1. Trace how `formatTeammateMessages` (or `n3z()`) resolves the `from` field — does it look up the sender in `config.json` members, or just pass the string through?
2. Check if the InboxPoller (`ijq()`) filters or validates messages against registered members before queuing them for delivery
3. Check if `teammate_id` in the rendered XML (`<teammate-message teammate_id="..." color="...">`) needs to match a known agent for the LLM to process it correctly
4. If registration is required, determine the minimal member entry needed — which fields are validated by Zod schema, which are optional
5. Determine if `agentType` is validated against an enum or is freeform (existing values seen: `"team-lead"`, `"iloom-swarm-worker"`, `"general-purpose"`)

## References

- #492 — Deep investigation of Claude Code messaging internals, hidden CLI flags, inbox system, and task injection



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: Context rotation for long-running agent swarms #791

Summary

Motivation

Design

Core Mechanism

Harness-to-Agent Messaging

Rotation Flow

Team Identity Preservation via Hidden CLI Flags

What Acts as the Checkpoint

Key Design Decisions

Scope

Open Questions

Fallback plan: Inbox sender resolution

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

State	Where it lives	Needs checkpoint flush?
Implementation plan	Issue (written by `il plan`)	No — already persisted
Child issue status	Issue tracker	No — already persisted
Worker completion reports	`team-lead.json` inbox	No — already persisted
Per-child loom state	`iloom-metadata.json` in each worktree	No — already persisted
Dependency graph	Epic loom metadata	No — already persisted
In-progress orchestration decisions	Agent's context window only	Yes — must flush to recap/metadata
Pending worker assignments	Agent's context window only	Yes — must flush

Epic: Context rotation for long-running agent swarms #791

Description

Summary

Motivation

Design

Core Mechanism

Harness-to-Agent Messaging

Rotation Flow

Team Identity Preservation via Hidden CLI Flags

What Acts as the Checkpoint

Key Design Decisions

Scope

Open Questions

Fallback plan: Inbox sender resolution

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions