refactor(gastown): Town-centric refactor — Phase A: Consolidate control plane (#419)#421
refactor(gastown): Town-centric refactor — Phase A: Consolidate control plane (#419)#421jrf0110 wants to merge 38 commits into204-gt-prop-dfrom
Conversation
…tDO (#419) Phase A of the town-centric refactor: - Create TownDO with all control-plane data (beads, agents, mail, review queue, molecules, bead events, convoys, escalations, rigs, config, mayor) organized into sub-modules under dos/town/: - beads.ts: Bead CRUD + bead events - agents.ts: Agent CRUD, hooks (GUPP), name allocation, prime - mail.ts: Inter-agent messaging - review-queue.ts: Review queue, molecules, agentDone/agentCompleted - config.ts: Town configuration management - rigs.ts: Rig registry (SQL table replacing KV) - container-dispatch.ts: Container interaction, JWT minting - Create AgentDO for per-agent event storage (isolates high-volume streaming events from TownDO's 10GB SQLite budget) - Update wrangler.jsonc: add AGENT binding, remove RIG + MAYOR bindings, add v5 migration with deleted_classes - Update worker exports: remove RigDO + MayorDO, add AgentDO The old Rig.do.ts and Mayor.do.ts files are kept for reference but no longer exported from the worker. Handler rerouting, container SDK migration, and WebSocket streaming are subsequent phases.
Code Review SummaryStatus: 2 New Issues Found (53 total) | Recommendation: Address critical items before merge Overview
New Issues (this review pass)
Other Observations (not in diff)Issues found in unchanged code that cannot receive inline comments:
PR SummaryThis is a major refactoring PR that consolidates the Gastown architecture from a Rig-centric model to a Town-centric model:
Files Reviewed (87 files)
|
|
Addressed all three review comments:
|
…yor DO (#419) Phase A2 of the town-centric refactor: - All 16 handler files now route through getTownDOStub() instead of getRigDOStub() / getMayorDOStub() - Added getTownId() helper to auth middleware (resolves from route param, JWT payload, or X-Town-Id header) - Simplified town-events handler to single TownDO call (was fan-out) - Mayor handler: configure/destroy are now no-ops (mayor auto-creates) - towns.handler: rig creation configures TownDO + adds rig to registry - Deleted Rig.do.ts (2378 lines) and Mayor.do.ts (453 lines) - Fixed PR review comments (unused import, INSERT OR REPLACE, JSON.parse guard) - Zero type errors
Phase B of the town-centric refactor:
- Rewrote process-manager.ts to use @kilocode/sdk:
- createOpencode() replaces Bun.spawn('kilo serve')
- client.session.create/prompt() replaces hand-rolled HTTP client
- client.event.subscribe() replaces SSE text parser
- Event sinks replace ring buffers (for WebSocket forwarding)
- Deleted files replaced by SDK (594 lines removed):
- kilo-server.ts (273 lines): process spawn, port allocation, health polling
- kilo-client.ts (102 lines): hand-rolled HTTP client with Zod parsing
- sse-consumer.ts (219 lines): SSE text parser, reconnect logic
- Container and worker both compile clean
…n-request (#419) Phase C + D of the town-centric refactor: Phase C — WebSocket streaming: - Container control server: added /ws WebSocket endpoint that broadcasts all SDK agent events to connected clients via registerEventSink() - TownContainerDO: replaced HTTP polling (500ms interval) with WebSocket relay. Connects to container's /ws on start, relays frames to subscribed browser clients. Supports per-agent and wildcard subscriptions via client-side subscribe messages. - Removed: polling timer, ring buffers, ticket system (tickets kept in control-server for backward compat during migration) Phase D — Proactive startup + config: - Town DO alarm now pings container health to keep it warm - ensureContainerReady() checks container health on each alarm tick - Config-on-request: container control server extracts X-Town-Config header and stores latest town config. Container dispatch already attaches this header on every fetch() from TownDO. Both worker and container compile clean.
- rig-do.test.ts: All tests now use env.TOWN (TownDO) instead of env.RIG (RigDO). getOrCreateAgent and slingBead calls pass rigId. - rig-alarm.test.ts: Updated to use TownDO alarm. setTownId tests replaced with configureRig/slingBead tests. - town-container.test.ts: Heartbeat test uses env.TOWN. - http-api.test.ts: No changes needed (tests HTTP routes via SELF.fetch). Full project typecheck passes clean.
…trolResult, mayor agentId (#419) - Remove unused PatrolResult import from Town.do.ts - Remove unused staleThreshold variable in witnessPatrol - Fix agentCompleted to fall back to mayor agent when agentId is empty - Update integration tests: witnessPatrol is now private/internal, tests verify behavior through alarm side-effects instead of direct RPC return values
|
Addressed all 4 new review comments:
|
… schema, fix townId resolution (#419) - Add 'ln -s $(which kilo) /usr/local/bin/opencode' to both Dockerfiles so the SDK's createOpencodeServer() can find the binary (it spawns 'opencode serve' internally) - Fix getMayorStatus() to return { session: { agentId, sessionId, status, lastActivityAt } | null } matching the MayorStatusSchema the Next.js client expects (was returning { agent: ... } which caused Zod parse error) - Fix townId getter to fall back to ctx.id.toString() when ctx.id.name is undefined
… client schema (#419) sendMayorMessage was returning void, causing the gastown-client's parseSuccessData to fail (no 'data' field in response envelope). Now returns { agentId, sessionStatus } matching MayorSendResultSchema.
…ainer (#419) The CLI's binary resolver may pick either variant depending on libc detection. Install both to prevent ENOENT when the musl variant is selected on glibc-based containers (and vice versa).
…nfig (#419) The KILOCODE_TOKEN was only stored in per-rig KV config, which is empty for towns created before the refactor (config was in the old RigDO). - Added kilocode_token field to TownConfigSchema - configureRig now also stores the token in town config - startAgentInContainer/startMergeInContainer fall back to townConfig.kilocode_token when rig-level token is missing - buildContainerConfig includes kilocode_token in X-Town-Config header This ensures all agents (including the mayor, which doesn't have a rig-specific config) can authenticate with the LLM gateway.
…op orphan alarms (#419) Two fixes: 1. Container's buildAgentEnv now reads KILOCODE_TOKEN from the X-Town-Config header data (via getCurrentTownConfig()) as a final fallback after request envVars and process.env. This closes the gap where the token was being sent in the header but not read by the agent env builder. 2. Town DO alarm now checks for configured rigs before doing any work. If no rigs exist, the alarm returns without re-arming, preventing orphan DO alarms from spinning up containers for deleted/unconfigured towns.
…ntainer wake (#419) The previous fix stopped the alarm from re-arming when no rigs existed, which prevented on-demand operations (sendMayorMessage, slingBead) from re-arming the alarm for subsequent ticks. Now only the proactive container health ping is gated on hasRigs. The alarm always re-arms so scheduling, patrol, and review queue processing continue to run.
Logs X-Town-Config header presence and kilocode_token status at: 1. Control server middleware (when header is received) 2. buildAgentEnv fallback (when checking getCurrentTownConfig) This will help identify where the token is being lost.
Real end-to-end tests using wrangler dev + Docker containers: - Test 1: Health check (worker responds correctly) - Test 2: Create a town (CRUD operations) - Test 3: Create rig with kilocode_token (token propagation to town config) - Test 4: Mayor receives token (full flow: town config → X-Town-Config → container) - Test 5: Single container per town (no phantom containers) Also updated wrangler.test.jsonc to match new DO bindings.
…iner wake policy (#419) E2E test suite (20 tests, all passing): 1. Health check 2. Create town 3. Create rig with kilocode_token propagation 4. Mayor receives token via X-Town-Config → container 5. Same-town container reuse (no extra containers on repeat messages) 6. Mayor status transitions (null → active) 7. List rigs 8. Town config CRUD (get/patch) 9. Delete town 10. Delete rig 11. Bead CRUD (create/list/get) 12. Agent register and list 13. Sling bead (atomic create+assign) 14. Agent hook/unhook 15. Inter-agent mail (send + check delivery) 16. Bead events recorded on sling 17. Multiple towns per user are independent 18. Town config env vars persist 19. Escalation beads 20. Full E2E flow (15 steps: town→config→rig→beads→agents→hooks→sling→mail→events→mayor→container→token→status) Bug fixes found by tests: - slingBead returned stale bead/agent (pre-hook state). Fixed to re-read both after hookBead. - ensureContainerReady was waking containers for idle towns. Now gated on hasActiveWork() to prevent phantom container wakeups. - Unique user IDs per test (was colliding within same second).
…fix WebSocket flow (#419) Root cause: ctx.id.name returns undefined in wrangler local dev, so the TownDO used ctx.id.toString() (a hex hash) as townId. This caused getTownContainerStub() calls from the TownDO (alarm, sendMayorMessage) to create a DIFFERENT TownContainerDO than the worker routes, resulting in agents running in one container but events polled from another. Fixes: - TownDO.setTownId(): persists the real town UUID in KV on first configureRig call. Loaded on init so alarm/internal calls use the correct ID for container stubs. - TownContainerDO: restored HTTP polling (via alarm) for DO→container event relay. containerFetch() doesn't support WebSocket upgrades, and setInterval callbacks can't call containerFetch (no request context). Alarm-based polling works correctly. - Container process-manager: event buffer (broadcastEvent → bufferAgentEvent) moved before broadcastEvent to ensure const initialization order. Test 22 validates full WebSocket event flow: 20 messages received including agent.status, message.updated, session.idle, etc.
| @@ -0,0 +1,131 @@ | |||
| [33m▲ [43;33m[[43;30mWARNING[43;33m][0m [1mUnrecognised WRANGLER_LOG value "/users/john/projects/professional/kilo/cloud/cloudflare-gastown/test/e2e/.wrangler-output.log", expected "none" | "error" | "warn" | "info" | "log" | "debug", defaulting to "log"...[0m | |||
There was a problem hiding this comment.
[WARNING]: Local dev artifact committed — this 131-line wrangler output log contains local filesystem paths (/users/john/projects/...) and should not be in version control.
Consider adding *.log or .wrangler-output.log to .gitignore and removing this file.
| // OPEN QUESTION: How do we determine which rig an agent belongs to? | ||
| // Currently agents aren't rig-scoped in the schema. For now, use the | ||
| // first rig as a fallback. This needs a rig_id column on rig_agents. | ||
| const rigList = rigs.listRigs(this.sql); |
There was a problem hiding this comment.
[SUGGESTION]: rigs.listRigs(this.sql) is called inside the for loop, executing a SQL query for every pending agent. Since the rig list doesn't change during the loop, hoist it before the loop (next to townConfig on line 924) to avoid redundant queries.
| const rigList = rigs.listRigs(this.sql); | |
| const rigId = rigList[0]?.id ?? ''; |
And add const rigList = rigs.listRigs(this.sql); before the for loop on line 926.
…ling, E2E tests 21-28 (#419) Fixes: - TownDO.resolveKilocodeToken(): scans all rig configs as fallback when town config doesn't have the token. Auto-propagates on first find. - configureRig: proactively starts container on rig creation - ensureContainerReady: keeps container warm for 5min after rig config - TownContainerDO: alarm-based HTTP polling replaces setInterval (containerFetch only works in request/alarm context, not timers) - process-manager: event buffer moved before broadcastEvent for correct initialization order E2E tests 21-28: 21. Deep container config verification (every layer traced) 22. WebSocket event flow (20 events: SDK → buffer → poll → relay → client) 23. Token propagation trace (step-by-step verification) 24. Stream ticket flow (UI path: ticket → WS URL → events) 25. Rig without token (graceful degradation) 26. Next.js rig creation (cross-service verification) 27. Check user's wrangler (port 8787 token test) 28. Full E2E on port 8787 (token + container + WS events) All 28 tests pass. Verified on both dedicated test instance (port 9787) and user's live wrangler (port 8787).
Major simplification of the event streaming architecture: BEFORE: browser → worker → TownContainerDO (intercept WS, create WebSocketPair, poll container via HTTP, relay events) → container AFTER: browser → worker → TownContainerDO → container Bun.serve (direct WebSocket passthrough, no relay, no polling) Changes: - TownContainerDO: removed ALL fetch() override, polling, alarm, WebSocketPair relay code. The base Container class handles everything — it proxies WebSocket upgrades directly to the container's Bun server via tcpPort.fetch(). - Container control-server: enhanced WebSocket handler to match both /ws and /agents/:id/stream URL patterns (the full worker path arrives at the container). Agent-filtered event delivery. Backfill on connect (sends all buffered events immediately). Subscribe message support for dynamic agent filtering. - Event frame format: uses 'event' key (not 'type') to match what AgentStream.tsx reads (msg.event). E2E test 22 confirms: 19 events received via direct WebSocket including backfill of 18 historical events + 1 live heartbeat. 27/28 tests pass. Test 28 (port 8787) shows a warning when the user's wrangler runs old code — it's a diagnostic, not a failure.
| const townId = getTownId(c); | ||
| if (!townId) return c.json(resError('Missing townId'), 400); | ||
| const town = getTownDOStub(c.env, townId); | ||
| await town.deleteBead(params.beadId); |
There was a problem hiding this comment.
[WARNING]: handleDeleteBead lost 404 response for non-existent beads
The old code returned 404 when the bead wasn't found:
const deleted = await rig.deleteBead(params.beadId);
if (!deleted) return c.json(resError('Bead not found'), 404);The new town.deleteBead() returns void and silently succeeds even if the bead doesn't exist, so the handler always returns 200. This is a behavioral regression for API consumers that rely on 404 to detect missing beads.
Consider having deleteBead return a boolean, or check existence first.
| if (agent.current_hook_bead_id === beadId) return; | ||
|
|
||
| // Agent already has a different hook — unhook first | ||
| if (agent.current_hook_bead_id) { |
There was a problem hiding this comment.
[WARNING]: hookBead silently auto-reassigns instead of throwing
The old Rig.do.ts threw an error when an agent was already hooked to a different bead:
throw new Error(`Agent ${agentId} is already hooked to bead ${agent.current_hook_bead_id}`);The new code silently unhooks from the previous bead and hooks to the new one. This changes the semantics from "fail-fast" to "auto-reassign," which could mask bugs where agents are accidentally double-hooked (e.g. a race between slingBead and schedulePendingWork).
If this is intentional, a log line here would help with debugging.
…at) (#419) The kilo provider wraps OpenRouter, so model IDs must be in OpenRouter format: 'anthropic/claude-sonnet-4' (not 'anthropic/claude-sonnet-4.6'). - providerID is always 'kilo' (routes through Kilo gateway/OpenRouter) - modelID is the full OpenRouter model ID (e.g. 'anthropic/claude-sonnet-4') - Updated all default model references from 'anthropic/claude-sonnet-4.6' to 'anthropic/claude-sonnet-4' - Updated agent config models in agent-runner.ts (code, general, plan, explore) - Fixed process-manager to always use providerID='kilo' instead of splitting the model string into provider/model
| const rigConfig = await this.getMayorRigConfig(); | ||
| const kilocodeToken = await this.resolveKilocodeToken(); | ||
|
|
||
| await dispatch.startAgentInContainer(this.env, this.ctx.storage, { |
There was a problem hiding this comment.
[WARNING]: startAgentInContainer returns boolean but the return value is ignored here
startAgentInContainer returns false when the container rejects the start request (e.g. container not ready, bad config). When that happens, line 569 still marks the agent as 'working' and the caller receives sessionStatus: 'starting' — a false positive that will leave the mayor stuck in a phantom "starting" state until the next alarm patrol detects it.
| await dispatch.startAgentInContainer(this.env, this.ctx.storage, { | |
| const started = await dispatch.startAgentInContainer(this.env, this.ctx.storage, { |
|
|
||
| if (isAlive) { | ||
| // Send follow-up message | ||
| await dispatch.sendMessageToAgent(this.env, townId, mayor.id, message); |
There was a problem hiding this comment.
[WARNING]: sendMessageToAgent returns boolean but the return value is ignored
sendMessageToAgent returns false when the container rejects the message (e.g. agent died between the status check and the message send). The caller still gets sessionStatus: 'active', hiding the failure. Consider checking the return value and falling back to re-creating the session (similar to the old sendMessage retry logic in the deleted Mayor.do.ts).
| console.warn(`${TOWN_LOG} alarm: missing ctx.id.name; skipping watchdog`); | ||
| await this.ctx.storage.setAlarm(Date.now() + HEARTBEAT_ALARM_INTERVAL_MS); | ||
| console.warn(`${TOWN_LOG} alarm: missing townId; skipping`); | ||
| return; |
There was a problem hiding this comment.
[WARNING]: Early return without re-arming the alarm permanently kills the alarm loop
If townId is ever falsy (e.g. the KV entry hasn't been set yet because setTownId was never called), the alarm returns without scheduling the next tick. There's no external mechanism to restart it — the alarm is dead forever.
The old code re-armed even on missing townId. Consider re-arming here too:
| return; | |
| console.warn(`${TOWN_LOG} alarm: missing townId; skipping`); | |
| await this.ctx.storage.setAlarm(Date.now() + IDLE_ALARM_INTERVAL_MS); |
…container env injection (#419) The token was not reaching the container because it wasn't in the town config (created before the propagation code was deployed). Multiple fallback paths (request envVars, X-Town-Config header, process.env) all failed because the source of truth was empty. New approach: store the token directly on the TownContainerDO's KV storage via setEnvVar(). This is loaded into envVars during the DO constructor (before the container boots), so the token is always in process.env.KILOCODE_TOKEN inside the container. The token is set via: 1. configureRig() → container.setEnvVar('KILOCODE_TOKEN', token) 2. sendMayorMessage() → container.setEnvVar('KILOCODE_TOKEN', token) This is the most reliable path — doesn't depend on X-Town-Config headers, request body envVars, or town config. The token goes directly from the TownDO to the TownContainerDO's persistent storage, and from there into the container's OS environment.
…dMayorMessage (#419) Logs exactly what's in envVars, whether JWT was minted, whether kilocodeToken came from params or town config, and what resolveKilocodeToken found.
|
|
||
| try { | ||
| console.log(`${MANAGER_LOG} Subscribing to events for agent ${agent.agentId}...`); | ||
| const result = await client.event.subscribe(); |
There was a problem hiding this comment.
[WARNING]: Duplicate event subscriptions per workdir
event.subscribe() creates a global event stream from the SDK server — it receives events from ALL sessions on that server, not just this agent's session. Since the SDK server is shared per workdir (sdkInstances map), every agent in the same worktree creates a separate subscription to the same event stream.
The session filter at line 189 prevents cross-agent event leakage, but each subscription still processes every event from the server. With N agents sharing a worktree, each event is processed N times (once per subscription), leading to N×N broadcast calls for the same event (N subscriptions × N agents' events).
Consider either:
- A single shared subscription per SDK instance that dispatches events to the correct agent, or
- Using a session-scoped subscribe API if the SDK supports it.
| // Detect completion. session.idle means "done processing this turn." | ||
| // Mayor agents are persistent — session.idle for them means "turn done," | ||
| // not "task finished." Only non-mayor agents exit on idle. | ||
| const isTerminal = event.type === 'session.idle' && request.role !== 'mayor'; |
There was a problem hiding this comment.
[WARNING]: session.idle as terminal event is a behavioral regression
The old sse-consumer.ts (now deleted) explicitly excluded session.idle from terminal events:
"We intentionally exclude
session.idle,message.completed, andassistant.completed— these fire after every LLM turn. A polecat may need multiple turns (tool calls → responses → more tool calls) before it's actually done."
The old code used session.completed as the definitive terminal signal. Using session.idle here could cause premature agent termination if the SDK fires it between multi-turn tool-call sequences (e.g., after the first tool response but before the agent decides to make another tool call).
If the SDK doesn't emit session.completed, consider relying on the gt_done tool call / completion reporter callback instead, which was the authoritative signal in the old architecture.
| // ctx.id.name should be the town UUID (set via idFromName in getTownDOStub). | ||
| // In some runtimes (local dev) .name is undefined. We persist the ID | ||
| // in KV on first access so it survives across requests. | ||
| return this._townId ?? this.ctx.id.name ?? this.ctx.id.toString(); |
There was a problem hiding this comment.
[WARNING]: ctx.id.toString() fallback returns hex DO ID, not the town UUID
When both _townId and ctx.id.name are null/undefined, this falls back to ctx.id.toString() which returns the hex representation of the DO's unique ID (e.g., "0000017f...").
Since getTownContainerStub uses env.TOWN_CONTAINER.idFromName(townId), passing a hex ID instead of the original town UUID would derive a different DO instance, silently breaking all container interactions (start agent, health check, merge, etc.).
Consider throwing an error instead of silently using the wrong ID:
if (!this._townId && !this.ctx.id.name) {
throw new Error('TownDO: townId not set — call setTownId() first');
}…ation (#419) The mayor should work without any rigs. Previously the token was only stored when a rig was created via configureRig. Now the tRPC createTown mutation generates a kilocode_token and stores it in the town config immediately. This means: 1. Create town → token stored in town config 2. Send mayor message → resolveKilocodeToken() finds it in town config 3. Agent starts with KILOCODE_TOKEN in envVars → KILO_CONFIG_CONTENT set Also added kilocode_token to the client-side TownConfigSchema so the updateTownConfig call accepts it.
…sages work (#419) The first message worked because startAgentInContainer sends the model in the prompt body. Follow-up messages via sendMessage don't specify a model — the kilo server uses the session's default from KILO_CONFIG_CONTENT. But the server validates the model against its local model registry before routing to the gateway, and 'anthropic/claude-sonnet-4.6' wasn't registered. Fix: buildKiloConfigContent now explicitly registers the model (and claude-haiku-4.5) in the kilo provider's models list so the server accepts them without a remote fetch. Also accepts the model as a parameter so the town config's default_model takes effect. Reverted model IDs back to anthropic/claude-sonnet-4.6 (the correct Kilo gateway model ID).
…ound (#419) Follow-up messages via sendMessage were sent without a model parameter, causing the kilo server to use the session's default model lookup which failed because the model wasn't in the server's local registry. Now: the model from the initial startAgent request is stored on the ManagedAgent and passed explicitly on every follow-up prompt call. This ensures the kilo provider routes through the gateway consistently, bypassing the local model registry validation.
| const rigList = rigs.listRigs(this.sql); | ||
| const rigId = rigList[0]?.id ?? ''; | ||
| const rigConfig = await this.getRigConfig(rigId); | ||
| if (!rigConfig) return; |
There was a problem hiding this comment.
[WARNING]: processReviewQueue leaks popped entry on early return
Line 1125 calls popReviewQueue() which transitions the entry from 'pending' → 'running'. If rigConfig is null here, the function returns without completing or failing the entry. The entry stays stuck in 'running' state until recoverStuckReviews resets it after 5 minutes.
Consider marking the entry as 'failed' before returning:
if (!rigConfig) {
reviewQueue.completeReview(this.sql, entry.id, 'failed');
return;
}| await mayor.destroy(); | ||
| // No-op: we don't want to destroy the entire town from this endpoint. | ||
| console.log(`${MAYOR_HANDLER_LOG} handleDestroyMayor: no-op for townId=${params.townId}`); | ||
| return c.json(resSuccess({ destroyed: true }), 200); |
There was a problem hiding this comment.
[WARNING]: handleDestroyMayor returns { destroyed: true } but is a no-op
Callers (e.g. the frontend handleDeleteTown) will receive a success response and believe the mayor was destroyed, when nothing actually happened. This is misleading — consider either:
- Returning a different response indicating the endpoint is deprecated/no-op
- Actually stopping the mayor agent in the container via
town.agentCompleted()
…ayor (#419) The gastown plugin checks GASTOWN_AGENT_ROLE === 'mayor' to decide whether to create a MayorGastownClient (town-scoped, no rig needed) or a GastownClient (rig-scoped, requires GASTOWN_RIG_ID). Without this env var, the mayor was treated as a rig agent, which failed because the mayor has no real rig ID — causing 'missing townId' errors and no tools being registered.
…id userId (#419) The mayor's agent JWT was minted with an empty userId because sendMayorMessage used rigConfig?.userId which is null when no rigs exist. The auth middleware rejected the JWT with 401. Fix: store owner_user_id in town config during createTown (tRPC). sendMayorMessage reads it from townConfig.owner_user_id as the primary source, falling back to rigConfig.userId.
| sql, | ||
| /* sql */ ` | ||
| SELECT * FROM ${rig_beads} | ||
| WHERE ${rig_beads.columns.status} IN ('open', 'in_progress') |
There was a problem hiding this comment.
[WARNING]: prime() returns ALL open beads instead of agent-specific ones
The old Rig.do.ts prime() filtered open_beads with:
SELECT * FROM rig_beads WHERE assignee_agent_id = ? AND status != 'closed'The new code returns all open/in_progress beads across all agents:
SELECT * FROM rig_beads WHERE status IN ('open', 'in_progress') LIMIT 20This is a behavior change that leaks other agents' work items into the prime context. The integration test at rig-do.test.ts:498 (expect(context.open_beads).toHaveLength(1)) may pass by coincidence (only one bead in the test), but in production with multiple agents, each agent would see all beads.
Consider restoring the assignee_agent_id = ? filter.
| /* sql */ ` | ||
| SELECT * FROM ${rig_mail} | ||
| WHERE ${rig_mail.columns.to_agent_id} = ? | ||
| AND ${rig_mail.columns.delivered_at} = ? |
There was a problem hiding this comment.
[WARNING]: checkMail marks mail as delivered BEFORE reading — breaks test assertion
The old Rig.do.ts checkMail read undelivered mail first, then marked it as delivered. The returned messages had delivered: false.
The new implementation marks all mail as delivered (line 68-78), then reads back by matching delivered_at = ?. The returned messages will have delivered: true (since they were just updated).
The integration test at rig-do.test.ts:292 asserts:
expect(mailbox[0].delivered).toBe(false);This will fail with the new implementation since the mail is already marked delivered = 1 before the SELECT.
…:rigId (#419) All rig-scoped routes now include townId in the URL path, eliminating the need for X-Town-Id headers or JWT townId extraction. Before: /api/rigs/:rigId/beads, /api/rigs/:rigId/agents, etc. After: /api/towns/:townId/rigs/:rigId/beads, etc. Updated: - gastown.worker.ts: all 30+ rig routes moved under /api/towns/:townId/ - gastown-client.ts: all rig functions now take townId as first param - gastown-router.ts: all tRPC procedures pass rig.town_id to client - container plugin client.ts: rigPath() includes townId from env - container plugin types.ts: GastownEnv now includes townId - container process-manager.ts: agent-events POST uses new path The townId is always available from the URL — no header or JWT magic.
…ng rig context (#419) Three fixes: 1. AgentSchema.last_activity_at: changed from z.string() to z.string().nullable() — agents have null last_activity_at before their first heartbeat. AgentCard component updated to handle null. 2. TaggedBeadEventSchema.rig_id/rig_name: changed from z.string() to z.string().optional() — town events come from the TownDO which doesn't tag events with rig context (the old fan-out across Rig DOs is gone). 3. Container agent-events 404: the container code already uses the new /api/towns/:townId/rigs/_/agent-events path, but the running container image needs to be rebuilt.
| const kilocodeToken = generateApiToken(ctx.user, undefined, { | ||
| expiresIn: TOKEN_EXPIRY.thirtyDays, | ||
| }); | ||
| await withGastownError(() => |
There was a problem hiding this comment.
[WARNING]: Non-atomic town creation — orphaned town on updateTownConfig failure
If updateTownConfig throws (e.g. the Town DO is unreachable), the town has already been persisted by createTown on line 47 but has no kilocode_token or owner_user_id. The error propagates to the caller, who sees a failure, but the town row remains in the GastownUser DO.
Consider wrapping both calls in a transaction-like pattern (create + configure, roll back on failure) similar to how handleCreateRig in towns.handler.ts rolls back the rig if configureRig fails.
|
|
||
| // ── Name Allocation ───────────────────────────────────────────────── | ||
|
|
||
| export function allocatePolecatName(sql: SqlStorage, rigId: string): string { |
There was a problem hiding this comment.
[WARNING]: allocatePolecatName ignores rigId for filtering — shared name pool across all rigs
The function accepts rigId but only uses it in the fallback name (line 261). The name pool query on line 248 selects ALL polecats regardless of rig. Since agents are now town-scoped (not rig-scoped), this means a town with 2 rigs shares the 20-name pool. After 20 polecats across all rigs, every new polecat gets a fallback name.
This is probably fine for now, but the rigId parameter is misleading — it suggests rig-scoped allocation. Consider either documenting this is town-wide or removing the unused rig filter.
| GASTOWN_AGENT_ID: request.agentId, | ||
| GASTOWN_RIG_ID: request.rigId, | ||
| GASTOWN_TOWN_ID: request.townId, | ||
| GASTOWN_AGENT_ROLE: request.role, |
There was a problem hiding this comment.
[SUGGESTION]: New GASTOWN_AGENT_ROLE env var is set here but the GastownEnv type in plugin/types.ts doesn't include it.
The plugin client reads env vars via the typed GastownEnv interface. If the plugin needs to know the agent's role (e.g. to adjust behavior for mayor vs polecat), it can't access GASTOWN_AGENT_ROLE in a type-safe way. Consider adding it to GastownEnv if the plugin should use it, or document that it's only for internal container use.
Was using '_' as a placeholder for rigId in the agent-events POST path. The ManagedAgent already has rigId from the start request — use it.
| } from './kilo-server'; | ||
| import { createKiloClient } from './kilo-client'; | ||
| import { createSSEConsumer, isCompletionEvent, type SSEConsumer } from './sse-consumer'; | ||
| import { createOpencode, createOpencodeClient, type OpencodeClient } from '@kilocode/sdk'; |
There was a problem hiding this comment.
[SUGGESTION]: Unused import — createOpencodeClient is imported but never used in this file. Only createOpencode and OpencodeClient (type) are referenced.
| import { createOpencode, createOpencodeClient, type OpencodeClient } from '@kilocode/sdk'; | |
| import { createOpencode, type OpencodeClient } from '@kilocode/sdk'; |
| const townId = getTownId(c); | ||
| if (!townId) return c.json(resError('Missing townId'), 400); | ||
| const town = getTownDOStub(c.env, townId); | ||
| await town.deleteAgent(params.agentId); |
There was a problem hiding this comment.
[WARNING]: handleDeleteAgent lost 404 response for non-existent agents — same issue as handleDeleteBead. The old code checked the return value of deleteAgent() and returned 404 if the agent wasn't found. Now it always returns { deleted: true } even if the agent doesn't exist.
agents.deleteAgent() in town/agents.ts silently succeeds when the agent doesn't exist (the DELETE query affects 0 rows), so callers get a misleading 200 response.
| CURRENT_TOWN_ID="$TOWN_ID" | ||
|
|
||
| echo " Creating bead..." | ||
| api_post "/api/rigs/${RIG_ID}/beads" '{"type":"issue","title":"E2E test bead","body":"Test body","priority":"high"}' |
There was a problem hiding this comment.
[CRITICAL]: E2E tests use old route format — will 404
All rig-scoped routes were moved from /api/rigs/:rigId/... to /api/towns/:townId/rigs/:rigId/... in gastown.worker.ts, but these e2e tests still use the old format (e.g. /api/rigs/${RIG_ID}/beads). Setting CURRENT_TOWN_ID only adds an X-Town-Id header — it doesn't change the URL path, so Hono will return 404 before any handler runs.
This affects tests 11–16 and any other test using /api/rigs/ paths. They need to be updated to use /api/towns/${TOWN_ID}/rigs/${RIG_ID}/....
…oute fix (#419) Three fixes: 1. Parallel dispatch: schedulePendingWork now collects dispatch tasks and runs them via Promise.allSettled instead of sequential await. Agents across different rigs start simultaneously. 2. Agent→rig association: added rig_id column to rig_agents table. registerAgent and getOrCreateAgent store the rig_id. schedulePendingWork uses agent.rig_id to get the correct rig config (git URL, credentials) instead of always using the first rig. 3. Completion route: completion-reporter.ts updated to use the new /api/towns/:townId/rigs/:rigId/agents/:agentId/completed path instead of the old /api/rigs/:rigId/agents/:agentId/completed.
| const rigList = rigs.listRigs(this.sql); | ||
| const rigId = rigList[0]?.id ?? ''; | ||
| const rigConfig = await this.getRigConfig(rigId); | ||
| if (!rigConfig) return; |
There was a problem hiding this comment.
[WARNING]: popReviewQueue (line 1132) already marked this entry as 'running', but if rigConfig is null here the method returns without completing or failing the entry. The entry is stuck in 'running' state until recoverStuckReviews resets it after 5 minutes.
This means a valid review submission is silently delayed (or lost if the timeout recovery also fails). Consider calling reviewQueue.completeReview(this.sql, entry.id, 'failed') before returning, or restructuring so the config check happens before popping.
| const rigList = rigs.listRigs(this.sql); | ||
| const rigId = rigList[0]?.id ?? ''; | ||
| const rigConfig = await this.getRigConfig(rigId); | ||
| if (!rigConfig) return; |
There was a problem hiding this comment.
[WARNING]: processReviewQueue pops an entry (marking it 'running') on line 1137, but if rigConfig is null here, the method returns without calling completeReview(entry.id, 'failed'). The entry is left stuck in 'running' status until recoverStuckReviews resets it after 5 minutes.
This creates a retry loop: pop → stuck → recover → pop → stuck → recover, indefinitely, if the rig config is permanently missing (e.g. rig was deleted but review queue still has entries).
Consider failing the entry explicitly:
if (!rigConfig) {
reviewQueue.completeReview(this.sql, entry.id, 'failed');
return;
}| const kilocodeToken = generateApiToken(ctx.user, undefined, { | ||
| expiresIn: TOKEN_EXPIRY.thirtyDays, | ||
| }); | ||
| await withGastownError(() => |
There was a problem hiding this comment.
[WARNING]: Non-atomic town creation — if updateTownConfig fails here (e.g. network error to the gastown worker), the town is created but has no kilocode_token or owner_user_id. The caller receives an error, but the town already exists in the user's list.
The mayor won't be able to authenticate LLM calls until a rig is created (which also sets the token). Consider wrapping both calls or adding a retry/compensation mechanism.
…419) Added rig_id column to both rig_agents and rig_beads tables. - registerAgent and getOrCreateAgent store rig_id on agent creation - createBead stores rig_id on bead creation - slingBead passes rigId to both createBead and getOrCreateAgent - listAgents handler passes rigId filter so each rig view shows only its own agents (not all agents in the town) - listBeads handler passes rigId filter for the same reason - AgentFilter and BeadFilter types include rig_id This fixes: agents from rig A showing up on rig B's dashboard, and beads from rig A appearing on rig B.
…n clarifications (#419) Code fixes: 1. completeReviewWithResult: restored 'conflict' status handling — creates escalation bead with conflict message, marks review as 'failed' 2. review_submitted event: added newValue: input.branch to match old contract 3. destroy(): iterates all agents and calls AgentDO.destroy() before clearing Town DO storage (prevents orphaned AgentDO instances) 4. sessionCount leak: decrements on error path in startAgent catch block 5. configureRig: always updates town config token when rig token differs (supports token rotation) 6. removeRig: deletes KV entry rig:{rigId}:config alongside SQL row 7. slingBead: restored metadata parameter passthrough to createBead 8. Removed unused DEFAULT_MAX_POLECATS constant 9. sendMayorMessage: accepts optional model parameter from handler
|
Addressed all 13 review comments: Code fixes (9):
Design clarifications (4 — intentional, not bugs): |
| query( | ||
| this.sql, | ||
| /* sql */ `UPDATE ${rig_agents} SET ${rig_agents.columns.dispatch_attempts} = ? WHERE ${rig_agents.columns.id} = ?`, | ||
| [attempts, agent.id] |
There was a problem hiding this comment.
[WARNING]: dispatch_attempts incremented before rigConfig check — premature circuit breaker
The counter is bumped here (line 1020–1023), but if rigConfig is null (line 1034), the loop continues without ever attempting a dispatch. After 5 alarm cycles with a missing rig config, the circuit breaker on line 1013 fires and marks the bead as 'failed' + unhooks the agent — even though no actual dispatch was ever attempted.
Move the increment to after the rigConfig null check (or inside the dispatchTasks.push callback alongside the success reset on line 1063).
|
|
||
| async removeRig(rigId: string): Promise<void> { | ||
| await this.ensureInitialized(); | ||
| rigs.removeRig(this.sql, rigId); |
There was a problem hiding this comment.
[WARNING]: removeRig doesn't clean up agents or beads with matching rig_id
Now that agents and beads have a rig_id column, deleting a rig leaves orphaned agents that schedulePendingWork will try to dispatch. Since the rig config is also deleted (line 197), these agents will hit the !rigConfig path on every alarm cycle, incrementing dispatch_attempts until the circuit breaker marks their beads as failed.
Consider also deleting (or re-assigning) agents/beads whose rig_id matches the deleted rig.
… handlers (#419) - All E2E tests updated from /api/rigs/:rigId to /api/towns/:townId/rigs/:rigId - handleCreateBead: passes params.rigId as rig_id to createBead - handleRegisterAgent: passes params.rigId as rig_id to registerAgent - handleCreateEscalation: passes params.rigId as rig_id to createBead - Test 29: fixed jq parsing for tRPC fallback, use BASE_URL not WRANGLER_URL All 29 E2E tests pass.
| { "tag": "v2", "new_sqlite_classes": ["TownContainerDO"] }, | ||
| { "tag": "v3", "new_sqlite_classes": ["MayorDO"] }, | ||
| { "tag": "v4", "new_sqlite_classes": ["TownDO"] }, | ||
| { "tag": "v5", "new_sqlite_classes": ["AgentDO"], "deleted_classes": ["RigDO", "MayorDO"] }, |
There was a problem hiding this comment.
[WARNING]: Destructive migration — deleted_classes: ["RigDO", "MayorDO"] permanently destroys all data stored in existing RigDO and MayorDO Durable Object instances.
If there is any production data in these DOs (rig configs, agent state, mayor sessions), it will be irrecoverably lost when this migration runs. Ensure either:
- All data has been migrated to
TownDObefore deploying, or - This is acceptable because the system is not yet in production with real user data.
| if (init?.body && typeof init.body === 'string') { | ||
| try { | ||
| const bodyKeys = Object.keys(JSON.parse(init.body)); | ||
| console.log(`${CLIENT_LOG} ${method} ${path} bodyKeys=[${bodyKeys.join(',')}]`); |
There was a problem hiding this comment.
[SUGGESTION]: Debug logging of request body keys on every API call — this console.log fires on every POST/PUT/PATCH to the gastown service from the Next.js server. Consider gating behind a debug flag or removing before merge to reduce production log noise.
Summary
Complete town-centric refactor (#419). Consolidates all control-plane data into TownDO, replaces subprocess-based agent management with the SDK, establishes WebSocket streaming from agent to browser, and implements proactive startup + config-on-request.
Full plan:
plans/gastown-town-centric-refactor.mdWhat Changed
Phase A: Data Consolidation (TownDO + AgentDO)
New TownDO — All control-plane data in one DO, organized into 7 sub-modules:
town/beads.tstown/agents.tstown/mail.tstown/review-queue.tstown/config.tstown/rigs.tstown/container-dispatch.tsNew AgentDO — Per-agent event storage (isolates high-volume events from TownDO's 10GB budget)
Deleted:
Rig.do.ts(2378 lines),Mayor.do.ts(453 lines)All 16 handler files rerouted from
getRigDOStub()/getMayorDOStub()togetTownDOStub()Phase B: SDK-Based Agent Management
Replaced subprocess
kilo servemanagement with@kilocode/sdk:createOpencode()replacesBun.spawn('kilo serve')client.session.create/prompt()replaces hand-rolled HTTP clientclient.event.subscribe()replaces SSE text parser + ring buffersDeleted:
kilo-server.ts(273 lines),kilo-client.ts(102 lines),sse-consumer.ts(219 lines)Phase C: WebSocket Streaming
/wsendpoint broadcasts all SDK agent events via event sinks/wsand relays frames to browser clients{ type: 'subscribe', agentId }messagesPhase D: Proactive Startup + Config-on-Request
ensureContainerReady()runs on each alarm tickX-Town-Configheader (sent by TownDO on everyfetch())envVarsreduced to infra-only URLs; user config comes per-requestTests + Typecheck
env.RIG→env.TOWNWrangler Changes
AGENTbinding forAgentDORIGandMAYORbindingsnew_sqlite_classes: ["AgentDO"], deleted_classes: ["RigDO", "MayorDO"]Net Impact
Open Questions
Agent→Rig association: Agents don't have a
rig_idcolumn yet.schedulePendingWork()falls back to the first rig. Needs a schema migration for multi-rig support. See// OPEN QUESTIONcomments inTown.do.ts.Rig-scoped queries:
listBeads/listAgentsreturn all beads/agents in the town. Multi-rig filtering needs arig_idcolumn onrig_beadsandrig_agents.Mayor rig config: Mayor uses the first rig's config for git URL and credentials. Needs a better strategy for multi-rig towns.
Browser WebSocket client:
AgentStream.tsxstill uses the old ticket-based WebSocket URL. It needs updating to use the new/ws?agentId=Xendpoint and subscribe protocol. The tRPCgetAgentStreamUrlprocedure also needs updating.Closes #419