-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Parent Issue
#204 (Gastown Cloud — Phase 2)
Bug Description
When a polecat agent fails to start in the container (e.g., git clone fails because the repo's default branch doesn't match), the polecat is left in a zombie hook state: status=idle but current_hook_bead_id is still set. Subsequent slingBead calls pick this agent (it looks idle) but fail with a hook conflict.
Root Cause Chain
-
Rig created with wrong default branch: User creates a rig with
defaultBranch: 'main'but the repo actually usesmaster. This can happen because theCreateRigDialogdefaults to'main'and the user doesn't change it. -
Clone failure in container:
startAgentInContainersends the start request to the container, which callsgit clone --no-checkout --branch main <url>→fatal: Remote branch main not found in upstream origin. -
Agent left in zombie state: The polecat was already created and hooked to the bead in
slingBead()before the container start was attempted. When the container rejects the start,schedulePendingWorklogs the failure and retries on the next alarm, but the agent remains hooked to the bead withstatus=idle. -
Cascading failures: The next
slingBeadcall finds this zombie polecat (idle), tries to hook a new bead, and getsAgent is already hooked to bead <old-bead-id>.
Logs
[Rig.do] startAgentInContainer: error response: {"error":"git clone --no-checkout --branch main https://github.com/jrf0110/8track.git ... failed: fatal: Remote branch main not found"}
[Rig.do] schedulePendingWork: FAILED to start agent in container (attempt 1/5)
...
[Rig.do] getOrCreateAgent: found existing agent id=... name=Maple role=polecat status=idle current_hook=d97de1ce-...
[Rig.do] getOrCreateAgent: returning existing agent (idle=true, singleton=false)
[Rig.do] hookBead: CONFLICT - agent ... already hooked to d97de1ce-...
Fix Required
Two issues need to be addressed:
1. getOrCreateAgent should skip agents with existing hooks when looking for idle polecats
The query currently sorts by status=idle first, but doesn't filter out agents that still have current_hook_bead_id set. An agent with status=idle AND a non-null hook is in an inconsistent state — it should either be cleaned up or skipped.
-- Current (broken)
WHERE role = ? ORDER BY CASE WHEN status = 'idle' THEN 0 ELSE 1 END
-- Fixed
WHERE role = ? AND (status != 'idle' OR current_hook_bead_id IS NULL)
ORDER BY CASE WHEN status = 'idle' THEN 0 ELSE 1 ENDOr: getOrCreateAgent should unhook the zombie agent before returning it.
2. Failed container starts should unhook the agent
In schedulePendingWork, when startAgentInContainer fails, the error is logged but the agent is not unhooked. The bead stays in_progress with the agent still hooked, creating the zombie state. On failure, the agent should be unhooked and the bead reverted to open.
3. (Preventive) Auto-detect default branch
The CreateRigDialog defaults to main, but many repos use master. Consider using the GitHub/GitLab API to fetch the actual default branch when a repo is selected from integrations, or running git ls-remote --symref <url> HEAD during rig creation to detect it.