Skip to content

[Gastown] Polecat stuck in zombie hook state after clone failure causes cascading sling failures #411

@jrf0110

Description

@jrf0110

Parent Issue

#204 (Gastown Cloud — Phase 2)

Bug Description

When a polecat agent fails to start in the container (e.g., git clone fails because the repo's default branch doesn't match), the polecat is left in a zombie hook state: status=idle but current_hook_bead_id is still set. Subsequent slingBead calls pick this agent (it looks idle) but fail with a hook conflict.

Root Cause Chain

  1. Rig created with wrong default branch: User creates a rig with defaultBranch: 'main' but the repo actually uses master. This can happen because the CreateRigDialog defaults to 'main' and the user doesn't change it.

  2. Clone failure in container: startAgentInContainer sends the start request to the container, which calls git clone --no-checkout --branch main <url>fatal: Remote branch main not found in upstream origin.

  3. Agent left in zombie state: The polecat was already created and hooked to the bead in slingBead() before the container start was attempted. When the container rejects the start, schedulePendingWork logs the failure and retries on the next alarm, but the agent remains hooked to the bead with status=idle.

  4. Cascading failures: The next slingBead call finds this zombie polecat (idle), tries to hook a new bead, and gets Agent is already hooked to bead <old-bead-id>.

Logs

[Rig.do] startAgentInContainer: error response: {"error":"git clone --no-checkout --branch main https://github.com/jrf0110/8track.git ... failed: fatal: Remote branch main not found"}
[Rig.do] schedulePendingWork: FAILED to start agent in container (attempt 1/5)
...
[Rig.do] getOrCreateAgent: found existing agent id=... name=Maple role=polecat status=idle current_hook=d97de1ce-...
[Rig.do] getOrCreateAgent: returning existing agent (idle=true, singleton=false)
[Rig.do] hookBead: CONFLICT - agent ... already hooked to d97de1ce-...

Fix Required

Two issues need to be addressed:

1. getOrCreateAgent should skip agents with existing hooks when looking for idle polecats

The query currently sorts by status=idle first, but doesn't filter out agents that still have current_hook_bead_id set. An agent with status=idle AND a non-null hook is in an inconsistent state — it should either be cleaned up or skipped.

-- Current (broken)
WHERE role = ? ORDER BY CASE WHEN status = 'idle' THEN 0 ELSE 1 END

-- Fixed
WHERE role = ? AND (status != 'idle' OR current_hook_bead_id IS NULL)
ORDER BY CASE WHEN status = 'idle' THEN 0 ELSE 1 END

Or: getOrCreateAgent should unhook the zombie agent before returning it.

2. Failed container starts should unhook the agent

In schedulePendingWork, when startAgentInContainer fails, the error is logged but the agent is not unhooked. The bead stays in_progress with the agent still hooked, creating the zombie state. On failure, the agent should be unhooked and the bead reverted to open.

3. (Preventive) Auto-detect default branch

The CreateRigDialog defaults to main, but many repos use master. Consider using the GitHub/GitLab API to fetch the actual default branch when a repo is selected from integrations, or running git ls-remote --symref <url> HEAD during rig creation to detect it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingkilo-auto-fixAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions