Skip to content

Comments

refactor(gastown): Town-centric refactor — Phase A: Consolidate control plane (#419)#421

Open
jrf0110 wants to merge 38 commits into204-gt-prop-dfrom
419-refactor
Open

refactor(gastown): Town-centric refactor — Phase A: Consolidate control plane (#419)#421
jrf0110 wants to merge 38 commits into204-gt-prop-dfrom
419-refactor

Conversation

@jrf0110
Copy link
Contributor

@jrf0110 jrf0110 commented Feb 20, 2026

Summary

Complete town-centric refactor (#419). Consolidates all control-plane data into TownDO, replaces subprocess-based agent management with the SDK, establishes WebSocket streaming from agent to browser, and implements proactive startup + config-on-request.

Full plan: plans/gastown-town-centric-refactor.md

What Changed

Phase A: Data Consolidation (TownDO + AgentDO)

New TownDO — All control-plane data in one DO, organized into 7 sub-modules:

Module Lines Responsibility
town/beads.ts 228 Bead CRUD, bead events
town/agents.ts 280 Agent CRUD, GUPP hooks, name allocation, prime/checkpoint
town/mail.ts 90 Inter-agent messaging
town/review-queue.ts 300 Review queue, molecules, agentDone/agentCompleted
town/config.ts 82 Config management, model resolution, container config builder
town/rigs.ts 78 Rig registry (SQL table, was KV)
town/container-dispatch.ts 280 Container interaction, JWT minting, system prompts

New AgentDO — Per-agent event storage (isolates high-volume events from TownDO's 10GB budget)

Deleted: Rig.do.ts (2378 lines), Mayor.do.ts (453 lines)

All 16 handler files rerouted from getRigDOStub()/getMayorDOStub() to getTownDOStub()

Phase B: SDK-Based Agent Management

Replaced subprocess kilo serve management with @kilocode/sdk:

  • createOpencode() replaces Bun.spawn('kilo serve')
  • client.session.create/prompt() replaces hand-rolled HTTP client
  • client.event.subscribe() replaces SSE text parser + ring buffers

Deleted: kilo-server.ts (273 lines), kilo-client.ts (102 lines), sse-consumer.ts (219 lines)

Phase C: WebSocket Streaming

  • Container /ws endpoint broadcasts all SDK agent events via event sinks
  • TownContainerDO connects to container's /ws and relays frames to browser clients
  • Supports per-agent subscriptions via client-side { type: 'subscribe', agentId } messages
  • Replaced: HTTP polling (500ms), ring buffers, ticket system

Phase D: Proactive Startup + Config-on-Request

  • Town DO alarm pings container health to keep it warm
  • ensureContainerReady() runs on each alarm tick
  • Control server extracts X-Town-Config header (sent by TownDO on every fetch())
  • Container envVars reduced to infra-only URLs; user config comes per-request

Tests + Typecheck

  • All 4 integration test files updated: env.RIGenv.TOWN
  • Full project typecheck passes clean (worker + container + Next.js app)

Wrangler Changes

  • Added AGENT binding for AgentDO
  • Removed RIG and MAYOR bindings
  • Migration v5: new_sqlite_classes: ["AgentDO"], deleted_classes: ["RigDO", "MayorDO"]

Net Impact

  • Deleted: 4,244 lines across 5 files (Rig.do.ts, Mayor.do.ts, kilo-server.ts, kilo-client.ts, sse-consumer.ts)
  • Added: ~2,200 lines across 10 new files (Town DO sub-modules, AgentDO, updated process-manager)
  • Modified: 20+ handler, worker, container, and test files
  • DO count: 6 → 4 (TownDO, AgentDO, TownContainerDO, GastownUserDO + AgentIdentityDO stub)

Open Questions

  1. Agent→Rig association: Agents don't have a rig_id column yet. schedulePendingWork() falls back to the first rig. Needs a schema migration for multi-rig support. See // OPEN QUESTION comments in Town.do.ts.

  2. Rig-scoped queries: listBeads/listAgents return all beads/agents in the town. Multi-rig filtering needs a rig_id column on rig_beads and rig_agents.

  3. Mayor rig config: Mayor uses the first rig's config for git URL and credentials. Needs a better strategy for multi-rig towns.

  4. Browser WebSocket client: AgentStream.tsx still uses the old ticket-based WebSocket URL. It needs updating to use the new /ws?agentId=X endpoint and subscribe protocol. The tRPC getAgentStreamUrl procedure also needs updating.

Closes #419

…tDO (#419)

Phase A of the town-centric refactor:

- Create TownDO with all control-plane data (beads, agents, mail,
  review queue, molecules, bead events, convoys, escalations, rigs,
  config, mayor) organized into sub-modules under dos/town/:
  - beads.ts: Bead CRUD + bead events
  - agents.ts: Agent CRUD, hooks (GUPP), name allocation, prime
  - mail.ts: Inter-agent messaging
  - review-queue.ts: Review queue, molecules, agentDone/agentCompleted
  - config.ts: Town configuration management
  - rigs.ts: Rig registry (SQL table replacing KV)
  - container-dispatch.ts: Container interaction, JWT minting

- Create AgentDO for per-agent event storage (isolates high-volume
  streaming events from TownDO's 10GB SQLite budget)

- Update wrangler.jsonc: add AGENT binding, remove RIG + MAYOR bindings,
  add v5 migration with deleted_classes

- Update worker exports: remove RigDO + MayorDO, add AgentDO

The old Rig.do.ts and Mayor.do.ts files are kept for reference but
no longer exported from the worker. Handler rerouting, container SDK
migration, and WebSocket streaming are subsequent phases.
@kiloconnect
Copy link
Contributor

kiloconnect bot commented Feb 20, 2026

Code Review Summary

Status: 2 New Issues Found (53 total) | Recommendation: Address critical items before merge

Overview

Severity Count (new) Count (total)
CRITICAL 0 3
WARNING 1 42
SUGGESTION 1 8
New Issues (this review pass)
Severity File Line Issue
WARNING cloudflare-gastown/wrangler.jsonc 41 Destructive migration deleted_classes: ["RigDO", "MayorDO"] permanently destroys all data in existing DO instances
SUGGESTION src/lib/gastown/gastown-client.ts 141 Debug logging of request body keys on every API call — verbose for production
Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

File Line Issue
cloudflare-gastown/container/plugin/client.test.ts 5-9 TEST_ENV is missing the new required townId field added to GastownEnv — tests will fail with a TypeScript error
cloudflare-gastown/container/plugin/client.test.ts 51 Test URL assertion still uses old /api/rigs/ format instead of new /api/towns/:townId/rigs/ format
PR Summary

This is a major refactoring PR that consolidates the Gastown architecture from a Rig-centric model to a Town-centric model:

  • Eliminates RigDO and MayorDO — all state now lives in TownDO
  • Adds AgentDO — per-agent event storage to stay within DO SQLite limits
  • Adds townId parameter to all rig-scoped API routes (/api/towns/:townId/rigs/:rigId/...)
  • Mayor is now a regular agent with role='mayor' instead of a separate DO
  • New WebSocket support in the container control server for real-time event streaming
  • Token propagation improvementskilocode_token stored in town config and propagated via X-Town-Config header
  • Model parameter support — agents can now use configurable models via buildKiloConfigContent
  • last_activity_at made nullable — consistent with agents that haven't had activity yet
Files Reviewed (87 files)
  • cloudflare-gastown/container/Dockerfile - 0 issues
  • cloudflare-gastown/container/Dockerfile.dev - 0 issues
  • cloudflare-gastown/container/plugin/client.ts - 0 issues
  • cloudflare-gastown/container/plugin/types.ts - 0 issues
  • cloudflare-gastown/container/src/agent-runner.ts - 1 existing issue
  • cloudflare-gastown/container/src/completion-reporter.ts - 0 issues
  • cloudflare-gastown/container/src/control-server.ts - 1 existing issue
  • cloudflare-gastown/container/src/kilo-client.ts - 0 issues
  • cloudflare-gastown/container/src/kilo-server.ts - 0 issues
  • cloudflare-gastown/container/src/process-manager.ts - 4 existing issues
  • cloudflare-gastown/container/src/sse-consumer.ts - 0 issues
  • cloudflare-gastown/container/src/types.ts - 0 issues
  • cloudflare-gastown/src/db/tables/rig-agents.table.ts - 0 issues
  • cloudflare-gastown/src/db/tables/rig-beads.table.ts - 0 issues
  • cloudflare-gastown/src/dos/Agent.do.ts - 1 existing issue
  • cloudflare-gastown/src/dos/GastownUser.do.ts - 0 issues
  • cloudflare-gastown/src/dos/Mayor.do.ts - 0 issues (deleted)
  • cloudflare-gastown/src/dos/Rig.do.ts - 0 issues (deleted)
  • cloudflare-gastown/src/dos/Town.do.ts - 14 existing issues
  • cloudflare-gastown/src/dos/TownContainer.do.ts - 1 existing issue
  • cloudflare-gastown/src/dos/town/agents.ts - 3 existing issues
  • cloudflare-gastown/src/dos/town/beads.ts - 1 existing issue
  • cloudflare-gastown/src/dos/town/config.ts - 1 existing issue
  • cloudflare-gastown/src/dos/town/container-dispatch.ts - 2 existing issues
  • cloudflare-gastown/src/dos/town/mail.ts - 1 existing issue
  • cloudflare-gastown/src/dos/town/review-queue.ts - 2 existing issues
  • cloudflare-gastown/src/dos/town/rigs.ts - 2 existing issues
  • cloudflare-gastown/src/gastown.worker.ts - 0 issues
  • cloudflare-gastown/src/handlers/mayor-tools.handler.ts - 0 issues
  • cloudflare-gastown/src/handlers/mayor.handler.ts - 3 existing issues
  • cloudflare-gastown/src/handlers/rig-agent-events.handler.ts - 0 issues
  • cloudflare-gastown/src/handlers/rig-agents.handler.ts - 1 existing issue
  • cloudflare-gastown/src/handlers/rig-bead-events.handler.ts - 0 issues
  • cloudflare-gastown/src/handlers/rig-beads.handler.ts - 1 existing issue
  • cloudflare-gastown/src/handlers/rig-escalations.handler.ts - 0 issues
  • cloudflare-gastown/src/handlers/rig-mail.handler.ts - 0 issues
  • cloudflare-gastown/src/handlers/rig-molecules.handler.ts - 0 issues
  • cloudflare-gastown/src/handlers/rig-review-queue.handler.ts - 0 issues
  • cloudflare-gastown/src/handlers/town-events.handler.ts - 0 issues
  • cloudflare-gastown/src/handlers/towns.handler.ts - 0 issues
  • cloudflare-gastown/src/middleware/auth.middleware.ts - 0 issues
  • cloudflare-gastown/src/types.ts - 0 issues
  • cloudflare-gastown/src/ui/dashboard.ui.ts - 0 issues
  • cloudflare-gastown/test/e2e/*.sh - 1 existing issue
  • cloudflare-gastown/test/e2e/.wrangler-output.log - 1 existing issue
  • cloudflare-gastown/test/integration/rig-alarm.test.ts - 0 issues
  • cloudflare-gastown/test/integration/rig-do.test.ts - 0 issues
  • cloudflare-gastown/test/integration/town-container.test.ts - 0 issues
  • cloudflare-gastown/worker-configuration.d.ts - 0 issues (generated)
  • cloudflare-gastown/wrangler.jsonc - 1 new issue
  • cloudflare-gastown/wrangler.test.jsonc - 0 issues
  • plans/gastown-cloud-proposal-d.md - 0 issues
  • src/components/gastown/AgentCard.tsx - 0 issues
  • src/components/gastown/AgentStream.tsx - 0 issues
  • src/lib/gastown/gastown-client.ts - 1 new issue
  • src/routers/gastown-router.ts - 2 existing issues

Fix these issues in Kilo Cloud

@jrf0110
Copy link
Contributor Author

jrf0110 commented Feb 20, 2026

Addressed all three review comments:

  1. Unused import (container-dispatch.ts:10) — Removed the dead buildRefinerySystemPrompt import. The refinery prompt is dynamically imported in Town.do.ts when dispatching a refinery agent.

  2. INSERT OR REPLACE (rigs.ts:54) — Switched to INSERT ... ON CONFLICT(id) DO UPDATE SET name=excluded.name, git_url=excluded.git_url, default_branch=excluded.default_branch. This preserves created_at and config on re-add.

  3. Unguarded JSON.parse (rigs.ts:29) — Wrapped in try/catch, falls back to {} on malformed JSON.

…yor DO (#419)

Phase A2 of the town-centric refactor:

- All 16 handler files now route through getTownDOStub() instead of
  getRigDOStub() / getMayorDOStub()
- Added getTownId() helper to auth middleware (resolves from route param,
  JWT payload, or X-Town-Id header)
- Simplified town-events handler to single TownDO call (was fan-out)
- Mayor handler: configure/destroy are now no-ops (mayor auto-creates)
- towns.handler: rig creation configures TownDO + adds rig to registry
- Deleted Rig.do.ts (2378 lines) and Mayor.do.ts (453 lines)
- Fixed PR review comments (unused import, INSERT OR REPLACE, JSON.parse guard)
- Zero type errors
Phase B of the town-centric refactor:

- Rewrote process-manager.ts to use @kilocode/sdk:
  - createOpencode() replaces Bun.spawn('kilo serve')
  - client.session.create/prompt() replaces hand-rolled HTTP client
  - client.event.subscribe() replaces SSE text parser
  - Event sinks replace ring buffers (for WebSocket forwarding)

- Deleted files replaced by SDK (594 lines removed):
  - kilo-server.ts (273 lines): process spawn, port allocation, health polling
  - kilo-client.ts (102 lines): hand-rolled HTTP client with Zod parsing
  - sse-consumer.ts (219 lines): SSE text parser, reconnect logic

- Container and worker both compile clean
…n-request (#419)

Phase C + D of the town-centric refactor:

Phase C — WebSocket streaming:
- Container control server: added /ws WebSocket endpoint that broadcasts
  all SDK agent events to connected clients via registerEventSink()
- TownContainerDO: replaced HTTP polling (500ms interval) with WebSocket
  relay. Connects to container's /ws on start, relays frames to
  subscribed browser clients. Supports per-agent and wildcard subscriptions
  via client-side subscribe messages.
- Removed: polling timer, ring buffers, ticket system (tickets kept in
  control-server for backward compat during migration)

Phase D — Proactive startup + config:
- Town DO alarm now pings container health to keep it warm
- ensureContainerReady() checks container health on each alarm tick
- Config-on-request: container control server extracts X-Town-Config
  header and stores latest town config. Container dispatch already
  attaches this header on every fetch() from TownDO.

Both worker and container compile clean.
- rig-do.test.ts: All tests now use env.TOWN (TownDO) instead of
  env.RIG (RigDO). getOrCreateAgent and slingBead calls pass rigId.
- rig-alarm.test.ts: Updated to use TownDO alarm. setTownId tests
  replaced with configureRig/slingBead tests.
- town-container.test.ts: Heartbeat test uses env.TOWN.
- http-api.test.ts: No changes needed (tests HTTP routes via SELF.fetch).

Full project typecheck passes clean.
…trolResult, mayor agentId (#419)

- Remove unused PatrolResult import from Town.do.ts
- Remove unused staleThreshold variable in witnessPatrol
- Fix agentCompleted to fall back to mayor agent when agentId is empty
- Update integration tests: witnessPatrol is now private/internal,
  tests verify behavior through alarm side-effects instead of
  direct RPC return values
@jrf0110
Copy link
Contributor Author

jrf0110 commented Feb 20, 2026

Addressed all 4 new review comments:

  1. CRITICAL: witnessPatrol private + tests expect PatrolResult — Updated tests to verify patrol behavior through alarm side-effects (check agent status after alarm) rather than calling witnessPatrol() as public RPC. The method is now an internal concern of the alarm handler.

  2. WARNING: staleThreshold unused — Removed the dead variable from witnessPatrol().

  3. WARNING: PatrolResult unused import — Removed from Town.do.ts imports.

  4. WARNING: agentCompleted with empty agentId — Added fallback in TownDO.agentCompleted() to look up the mayor agent by role='mayor' when agentId is empty, so the mayor's status correctly transitions to idle on completion.

… schema, fix townId resolution (#419)

- Add 'ln -s $(which kilo) /usr/local/bin/opencode' to both Dockerfiles
  so the SDK's createOpencodeServer() can find the binary (it spawns
  'opencode serve' internally)
- Fix getMayorStatus() to return { session: { agentId, sessionId, status,
  lastActivityAt } | null } matching the MayorStatusSchema the Next.js
  client expects (was returning { agent: ... } which caused Zod parse error)
- Fix townId getter to fall back to ctx.id.toString() when ctx.id.name
  is undefined
… client schema (#419)

sendMayorMessage was returning void, causing the gastown-client's
parseSuccessData to fail (no 'data' field in response envelope).
Now returns { agentId, sessionStatus } matching MayorSendResultSchema.
…ainer (#419)

The CLI's binary resolver may pick either variant depending on libc
detection. Install both to prevent ENOENT when the musl variant is
selected on glibc-based containers (and vice versa).
…nfig (#419)

The KILOCODE_TOKEN was only stored in per-rig KV config, which is empty
for towns created before the refactor (config was in the old RigDO).

- Added kilocode_token field to TownConfigSchema
- configureRig now also stores the token in town config
- startAgentInContainer/startMergeInContainer fall back to
  townConfig.kilocode_token when rig-level token is missing
- buildContainerConfig includes kilocode_token in X-Town-Config header

This ensures all agents (including the mayor, which doesn't have a
rig-specific config) can authenticate with the LLM gateway.
…op orphan alarms (#419)

Two fixes:

1. Container's buildAgentEnv now reads KILOCODE_TOKEN from the
   X-Town-Config header data (via getCurrentTownConfig()) as a
   final fallback after request envVars and process.env. This
   closes the gap where the token was being sent in the header
   but not read by the agent env builder.

2. Town DO alarm now checks for configured rigs before doing any
   work. If no rigs exist, the alarm returns without re-arming,
   preventing orphan DO alarms from spinning up containers for
   deleted/unconfigured towns.
…ntainer wake (#419)

The previous fix stopped the alarm from re-arming when no rigs existed,
which prevented on-demand operations (sendMayorMessage, slingBead) from
re-arming the alarm for subsequent ticks.

Now only the proactive container health ping is gated on hasRigs. The
alarm always re-arms so scheduling, patrol, and review queue processing
continue to run.
Logs X-Town-Config header presence and kilocode_token status at:
1. Control server middleware (when header is received)
2. buildAgentEnv fallback (when checking getCurrentTownConfig)

This will help identify where the token is being lost.
Real end-to-end tests using wrangler dev + Docker containers:
- Test 1: Health check (worker responds correctly)
- Test 2: Create a town (CRUD operations)
- Test 3: Create rig with kilocode_token (token propagation to town config)
- Test 4: Mayor receives token (full flow: town config → X-Town-Config → container)
- Test 5: Single container per town (no phantom containers)

Also updated wrangler.test.jsonc to match new DO bindings.
…iner wake policy (#419)

E2E test suite (20 tests, all passing):
 1. Health check
 2. Create town
 3. Create rig with kilocode_token propagation
 4. Mayor receives token via X-Town-Config → container
 5. Same-town container reuse (no extra containers on repeat messages)
 6. Mayor status transitions (null → active)
 7. List rigs
 8. Town config CRUD (get/patch)
 9. Delete town
10. Delete rig
11. Bead CRUD (create/list/get)
12. Agent register and list
13. Sling bead (atomic create+assign)
14. Agent hook/unhook
15. Inter-agent mail (send + check delivery)
16. Bead events recorded on sling
17. Multiple towns per user are independent
18. Town config env vars persist
19. Escalation beads
20. Full E2E flow (15 steps: town→config→rig→beads→agents→hooks→sling→mail→events→mayor→container→token→status)

Bug fixes found by tests:
- slingBead returned stale bead/agent (pre-hook state). Fixed to
  re-read both after hookBead.
- ensureContainerReady was waking containers for idle towns. Now
  gated on hasActiveWork() to prevent phantom container wakeups.
- Unique user IDs per test (was colliding within same second).
…fix WebSocket flow (#419)

Root cause: ctx.id.name returns undefined in wrangler local dev, so the
TownDO used ctx.id.toString() (a hex hash) as townId. This caused
getTownContainerStub() calls from the TownDO (alarm, sendMayorMessage)
to create a DIFFERENT TownContainerDO than the worker routes, resulting
in agents running in one container but events polled from another.

Fixes:
- TownDO.setTownId(): persists the real town UUID in KV on first
  configureRig call. Loaded on init so alarm/internal calls use the
  correct ID for container stubs.
- TownContainerDO: restored HTTP polling (via alarm) for DO→container
  event relay. containerFetch() doesn't support WebSocket upgrades,
  and setInterval callbacks can't call containerFetch (no request
  context). Alarm-based polling works correctly.
- Container process-manager: event buffer (broadcastEvent → bufferAgentEvent)
  moved before broadcastEvent to ensure const initialization order.

Test 22 validates full WebSocket event flow: 20 messages received
including agent.status, message.updated, session.idle, etc.
@@ -0,0 +1,131 @@
▲ [WARNING] Unrecognised WRANGLER_LOG value "/users/john/projects/professional/kilo/cloud/cloudflare-gastown/test/e2e/.wrangler-output.log", expected "none" | "error" | "warn" | "info" | "log" | "debug", defaulting to "log"...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: Local dev artifact committed — this 131-line wrangler output log contains local filesystem paths (/users/john/projects/...) and should not be in version control.

Consider adding *.log or .wrangler-output.log to .gitignore and removing this file.

// OPEN QUESTION: How do we determine which rig an agent belongs to?
// Currently agents aren't rig-scoped in the schema. For now, use the
// first rig as a fallback. This needs a rig_id column on rig_agents.
const rigList = rigs.listRigs(this.sql);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION]: rigs.listRigs(this.sql) is called inside the for loop, executing a SQL query for every pending agent. Since the rig list doesn't change during the loop, hoist it before the loop (next to townConfig on line 924) to avoid redundant queries.

Suggested change
const rigList = rigs.listRigs(this.sql);
const rigId = rigList[0]?.id ?? '';

And add const rigList = rigs.listRigs(this.sql); before the for loop on line 926.

…ling, E2E tests 21-28 (#419)

Fixes:
- TownDO.resolveKilocodeToken(): scans all rig configs as fallback
  when town config doesn't have the token. Auto-propagates on first find.
- configureRig: proactively starts container on rig creation
- ensureContainerReady: keeps container warm for 5min after rig config
- TownContainerDO: alarm-based HTTP polling replaces setInterval
  (containerFetch only works in request/alarm context, not timers)
- process-manager: event buffer moved before broadcastEvent for
  correct initialization order

E2E tests 21-28:
21. Deep container config verification (every layer traced)
22. WebSocket event flow (20 events: SDK → buffer → poll → relay → client)
23. Token propagation trace (step-by-step verification)
24. Stream ticket flow (UI path: ticket → WS URL → events)
25. Rig without token (graceful degradation)
26. Next.js rig creation (cross-service verification)
27. Check user's wrangler (port 8787 token test)
28. Full E2E on port 8787 (token + container + WS events)

All 28 tests pass. Verified on both dedicated test instance (port 9787)
and user's live wrangler (port 8787).
Major simplification of the event streaming architecture:

BEFORE: browser → worker → TownContainerDO (intercept WS, create
WebSocketPair, poll container via HTTP, relay events) → container

AFTER: browser → worker → TownContainerDO → container Bun.serve
(direct WebSocket passthrough, no relay, no polling)

Changes:
- TownContainerDO: removed ALL fetch() override, polling, alarm,
  WebSocketPair relay code. The base Container class handles
  everything — it proxies WebSocket upgrades directly to the
  container's Bun server via tcpPort.fetch().

- Container control-server: enhanced WebSocket handler to match
  both /ws and /agents/:id/stream URL patterns (the full worker
  path arrives at the container). Agent-filtered event delivery.
  Backfill on connect (sends all buffered events immediately).
  Subscribe message support for dynamic agent filtering.

- Event frame format: uses 'event' key (not 'type') to match
  what AgentStream.tsx reads (msg.event).

E2E test 22 confirms: 19 events received via direct WebSocket
including backfill of 18 historical events + 1 live heartbeat.

27/28 tests pass. Test 28 (port 8787) shows a warning when the
user's wrangler runs old code — it's a diagnostic, not a failure.
const townId = getTownId(c);
if (!townId) return c.json(resError('Missing townId'), 400);
const town = getTownDOStub(c.env, townId);
await town.deleteBead(params.beadId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: handleDeleteBead lost 404 response for non-existent beads

The old code returned 404 when the bead wasn't found:

const deleted = await rig.deleteBead(params.beadId);
if (!deleted) return c.json(resError('Bead not found'), 404);

The new town.deleteBead() returns void and silently succeeds even if the bead doesn't exist, so the handler always returns 200. This is a behavioral regression for API consumers that rely on 404 to detect missing beads.

Consider having deleteBead return a boolean, or check existence first.

if (agent.current_hook_bead_id === beadId) return;

// Agent already has a different hook — unhook first
if (agent.current_hook_bead_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: hookBead silently auto-reassigns instead of throwing

The old Rig.do.ts threw an error when an agent was already hooked to a different bead:

throw new Error(`Agent ${agentId} is already hooked to bead ${agent.current_hook_bead_id}`);

The new code silently unhooks from the previous bead and hooks to the new one. This changes the semantics from "fail-fast" to "auto-reassign," which could mask bugs where agents are accidentally double-hooked (e.g. a race between slingBead and schedulePendingWork).

If this is intentional, a log line here would help with debugging.

…at) (#419)

The kilo provider wraps OpenRouter, so model IDs must be in OpenRouter
format: 'anthropic/claude-sonnet-4' (not 'anthropic/claude-sonnet-4.6').

- providerID is always 'kilo' (routes through Kilo gateway/OpenRouter)
- modelID is the full OpenRouter model ID (e.g. 'anthropic/claude-sonnet-4')
- Updated all default model references from 'anthropic/claude-sonnet-4.6'
  to 'anthropic/claude-sonnet-4'
- Updated agent config models in agent-runner.ts (code, general, plan, explore)
- Fixed process-manager to always use providerID='kilo' instead of
  splitting the model string into provider/model
const rigConfig = await this.getMayorRigConfig();
const kilocodeToken = await this.resolveKilocodeToken();

await dispatch.startAgentInContainer(this.env, this.ctx.storage, {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: startAgentInContainer returns boolean but the return value is ignored here

startAgentInContainer returns false when the container rejects the start request (e.g. container not ready, bad config). When that happens, line 569 still marks the agent as 'working' and the caller receives sessionStatus: 'starting' — a false positive that will leave the mayor stuck in a phantom "starting" state until the next alarm patrol detects it.

Suggested change
await dispatch.startAgentInContainer(this.env, this.ctx.storage, {
const started = await dispatch.startAgentInContainer(this.env, this.ctx.storage, {


if (isAlive) {
// Send follow-up message
await dispatch.sendMessageToAgent(this.env, townId, mayor.id, message);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: sendMessageToAgent returns boolean but the return value is ignored

sendMessageToAgent returns false when the container rejects the message (e.g. agent died between the status check and the message send). The caller still gets sessionStatus: 'active', hiding the failure. Consider checking the return value and falling back to re-creating the session (similar to the old sendMessage retry logic in the deleted Mayor.do.ts).

console.warn(`${TOWN_LOG} alarm: missing ctx.id.name; skipping watchdog`);
await this.ctx.storage.setAlarm(Date.now() + HEARTBEAT_ALARM_INTERVAL_MS);
console.warn(`${TOWN_LOG} alarm: missing townId; skipping`);
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: Early return without re-arming the alarm permanently kills the alarm loop

If townId is ever falsy (e.g. the KV entry hasn't been set yet because setTownId was never called), the alarm returns without scheduling the next tick. There's no external mechanism to restart it — the alarm is dead forever.

The old code re-armed even on missing townId. Consider re-arming here too:

Suggested change
return;
console.warn(`${TOWN_LOG} alarm: missing townId; skipping`);
await this.ctx.storage.setAlarm(Date.now() + IDLE_ALARM_INTERVAL_MS);

…container env injection (#419)

The token was not reaching the container because it wasn't in the
town config (created before the propagation code was deployed).
Multiple fallback paths (request envVars, X-Town-Config header,
process.env) all failed because the source of truth was empty.

New approach: store the token directly on the TownContainerDO's KV
storage via setEnvVar(). This is loaded into envVars during the DO
constructor (before the container boots), so the token is always in
process.env.KILOCODE_TOKEN inside the container.

The token is set via:
1. configureRig() → container.setEnvVar('KILOCODE_TOKEN', token)
2. sendMayorMessage() → container.setEnvVar('KILOCODE_TOKEN', token)

This is the most reliable path — doesn't depend on X-Town-Config
headers, request body envVars, or town config. The token goes
directly from the TownDO to the TownContainerDO's persistent
storage, and from there into the container's OS environment.
…dMayorMessage (#419)

Logs exactly what's in envVars, whether JWT was minted, whether
kilocodeToken came from params or town config, and what
resolveKilocodeToken found.
)

The debug logs from the else branch aren't appearing because the mayor
is already running. This log fires unconditionally to show what
containerStatus.status is and whether isAlive is true.

try {
console.log(`${MANAGER_LOG} Subscribing to events for agent ${agent.agentId}...`);
const result = await client.event.subscribe();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: Duplicate event subscriptions per workdir

event.subscribe() creates a global event stream from the SDK server — it receives events from ALL sessions on that server, not just this agent's session. Since the SDK server is shared per workdir (sdkInstances map), every agent in the same worktree creates a separate subscription to the same event stream.

The session filter at line 189 prevents cross-agent event leakage, but each subscription still processes every event from the server. With N agents sharing a worktree, each event is processed N times (once per subscription), leading to N×N broadcast calls for the same event (N subscriptions × N agents' events).

Consider either:

  1. A single shared subscription per SDK instance that dispatches events to the correct agent, or
  2. Using a session-scoped subscribe API if the SDK supports it.

// Detect completion. session.idle means "done processing this turn."
// Mayor agents are persistent — session.idle for them means "turn done,"
// not "task finished." Only non-mayor agents exit on idle.
const isTerminal = event.type === 'session.idle' && request.role !== 'mayor';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: session.idle as terminal event is a behavioral regression

The old sse-consumer.ts (now deleted) explicitly excluded session.idle from terminal events:

"We intentionally exclude session.idle, message.completed, and assistant.completed — these fire after every LLM turn. A polecat may need multiple turns (tool calls → responses → more tool calls) before it's actually done."

The old code used session.completed as the definitive terminal signal. Using session.idle here could cause premature agent termination if the SDK fires it between multi-turn tool-call sequences (e.g., after the first tool response but before the agent decides to make another tool call).

If the SDK doesn't emit session.completed, consider relying on the gt_done tool call / completion reporter callback instead, which was the authoritative signal in the old architecture.

// ctx.id.name should be the town UUID (set via idFromName in getTownDOStub).
// In some runtimes (local dev) .name is undefined. We persist the ID
// in KV on first access so it survives across requests.
return this._townId ?? this.ctx.id.name ?? this.ctx.id.toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: ctx.id.toString() fallback returns hex DO ID, not the town UUID

When both _townId and ctx.id.name are null/undefined, this falls back to ctx.id.toString() which returns the hex representation of the DO's unique ID (e.g., "0000017f...").

Since getTownContainerStub uses env.TOWN_CONTAINER.idFromName(townId), passing a hex ID instead of the original town UUID would derive a different DO instance, silently breaking all container interactions (start agent, health check, merge, etc.).

Consider throwing an error instead of silently using the wrong ID:

if (!this._townId && !this.ctx.id.name) {
  throw new Error('TownDO: townId not set — call setTownId() first');
}

…ation (#419)

The mayor should work without any rigs. Previously the token was only
stored when a rig was created via configureRig. Now the tRPC createTown
mutation generates a kilocode_token and stores it in the town config
immediately.

This means:
1. Create town → token stored in town config
2. Send mayor message → resolveKilocodeToken() finds it in town config
3. Agent starts with KILOCODE_TOKEN in envVars → KILO_CONFIG_CONTENT set

Also added kilocode_token to the client-side TownConfigSchema so the
updateTownConfig call accepts it.
…sages work (#419)

The first message worked because startAgentInContainer sends the model
in the prompt body. Follow-up messages via sendMessage don't specify a
model — the kilo server uses the session's default from KILO_CONFIG_CONTENT.
But the server validates the model against its local model registry before
routing to the gateway, and 'anthropic/claude-sonnet-4.6' wasn't registered.

Fix: buildKiloConfigContent now explicitly registers the model (and
claude-haiku-4.5) in the kilo provider's models list so the server
accepts them without a remote fetch. Also accepts the model as a
parameter so the town config's default_model takes effect.

Reverted model IDs back to anthropic/claude-sonnet-4.6 (the correct
Kilo gateway model ID).
…ound (#419)

Follow-up messages via sendMessage were sent without a model parameter,
causing the kilo server to use the session's default model lookup which
failed because the model wasn't in the server's local registry.

Now: the model from the initial startAgent request is stored on the
ManagedAgent and passed explicitly on every follow-up prompt call.
This ensures the kilo provider routes through the gateway consistently,
bypassing the local model registry validation.
const rigList = rigs.listRigs(this.sql);
const rigId = rigList[0]?.id ?? '';
const rigConfig = await this.getRigConfig(rigId);
if (!rigConfig) return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: processReviewQueue leaks popped entry on early return

Line 1125 calls popReviewQueue() which transitions the entry from 'pending''running'. If rigConfig is null here, the function returns without completing or failing the entry. The entry stays stuck in 'running' state until recoverStuckReviews resets it after 5 minutes.

Consider marking the entry as 'failed' before returning:

if (!rigConfig) {
  reviewQueue.completeReview(this.sql, entry.id, 'failed');
  return;
}

await mayor.destroy();
// No-op: we don't want to destroy the entire town from this endpoint.
console.log(`${MAYOR_HANDLER_LOG} handleDestroyMayor: no-op for townId=${params.townId}`);
return c.json(resSuccess({ destroyed: true }), 200);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: handleDestroyMayor returns { destroyed: true } but is a no-op

Callers (e.g. the frontend handleDeleteTown) will receive a success response and believe the mayor was destroyed, when nothing actually happened. This is misleading — consider either:

  • Returning a different response indicating the endpoint is deprecated/no-op
  • Actually stopping the mayor agent in the container via town.agentCompleted()

…ayor (#419)

The gastown plugin checks GASTOWN_AGENT_ROLE === 'mayor' to decide
whether to create a MayorGastownClient (town-scoped, no rig needed)
or a GastownClient (rig-scoped, requires GASTOWN_RIG_ID).

Without this env var, the mayor was treated as a rig agent, which
failed because the mayor has no real rig ID — causing 'missing townId'
errors and no tools being registered.
…id userId (#419)

The mayor's agent JWT was minted with an empty userId because
sendMayorMessage used rigConfig?.userId which is null when no rigs
exist. The auth middleware rejected the JWT with 401.

Fix: store owner_user_id in town config during createTown (tRPC).
sendMayorMessage reads it from townConfig.owner_user_id as the
primary source, falling back to rigConfig.userId.
sql,
/* sql */ `
SELECT * FROM ${rig_beads}
WHERE ${rig_beads.columns.status} IN ('open', 'in_progress')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: prime() returns ALL open beads instead of agent-specific ones

The old Rig.do.ts prime() filtered open_beads with:

SELECT * FROM rig_beads WHERE assignee_agent_id = ? AND status != 'closed'

The new code returns all open/in_progress beads across all agents:

SELECT * FROM rig_beads WHERE status IN ('open', 'in_progress') LIMIT 20

This is a behavior change that leaks other agents' work items into the prime context. The integration test at rig-do.test.ts:498 (expect(context.open_beads).toHaveLength(1)) may pass by coincidence (only one bead in the test), but in production with multiple agents, each agent would see all beads.

Consider restoring the assignee_agent_id = ? filter.

/* sql */ `
SELECT * FROM ${rig_mail}
WHERE ${rig_mail.columns.to_agent_id} = ?
AND ${rig_mail.columns.delivered_at} = ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: checkMail marks mail as delivered BEFORE reading — breaks test assertion

The old Rig.do.ts checkMail read undelivered mail first, then marked it as delivered. The returned messages had delivered: false.

The new implementation marks all mail as delivered (line 68-78), then reads back by matching delivered_at = ?. The returned messages will have delivered: true (since they were just updated).

The integration test at rig-do.test.ts:292 asserts:

expect(mailbox[0].delivered).toBe(false);

This will fail with the new implementation since the mail is already marked delivered = 1 before the SELECT.

…:rigId (#419)

All rig-scoped routes now include townId in the URL path, eliminating
the need for X-Town-Id headers or JWT townId extraction.

Before: /api/rigs/:rigId/beads, /api/rigs/:rigId/agents, etc.
After:  /api/towns/:townId/rigs/:rigId/beads, etc.

Updated:
- gastown.worker.ts: all 30+ rig routes moved under /api/towns/:townId/
- gastown-client.ts: all rig functions now take townId as first param
- gastown-router.ts: all tRPC procedures pass rig.town_id to client
- container plugin client.ts: rigPath() includes townId from env
- container plugin types.ts: GastownEnv now includes townId
- container process-manager.ts: agent-events POST uses new path

The townId is always available from the URL — no header or JWT magic.
…ng rig context (#419)

Three fixes:

1. AgentSchema.last_activity_at: changed from z.string() to
   z.string().nullable() — agents have null last_activity_at before
   their first heartbeat. AgentCard component updated to handle null.

2. TaggedBeadEventSchema.rig_id/rig_name: changed from z.string() to
   z.string().optional() — town events come from the TownDO which
   doesn't tag events with rig context (the old fan-out across Rig
   DOs is gone).

3. Container agent-events 404: the container code already uses the
   new /api/towns/:townId/rigs/_/agent-events path, but the running
   container image needs to be rebuilt.
const kilocodeToken = generateApiToken(ctx.user, undefined, {
expiresIn: TOKEN_EXPIRY.thirtyDays,
});
await withGastownError(() =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: Non-atomic town creation — orphaned town on updateTownConfig failure

If updateTownConfig throws (e.g. the Town DO is unreachable), the town has already been persisted by createTown on line 47 but has no kilocode_token or owner_user_id. The error propagates to the caller, who sees a failure, but the town row remains in the GastownUser DO.

Consider wrapping both calls in a transaction-like pattern (create + configure, roll back on failure) similar to how handleCreateRig in towns.handler.ts rolls back the rig if configureRig fails.


// ── Name Allocation ─────────────────────────────────────────────────

export function allocatePolecatName(sql: SqlStorage, rigId: string): string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: allocatePolecatName ignores rigId for filtering — shared name pool across all rigs

The function accepts rigId but only uses it in the fallback name (line 261). The name pool query on line 248 selects ALL polecats regardless of rig. Since agents are now town-scoped (not rig-scoped), this means a town with 2 rigs shares the 20-name pool. After 20 polecats across all rigs, every new polecat gets a fallback name.

This is probably fine for now, but the rigId parameter is misleading — it suggests rig-scoped allocation. Consider either documenting this is town-wide or removing the unused rig filter.

GASTOWN_AGENT_ID: request.agentId,
GASTOWN_RIG_ID: request.rigId,
GASTOWN_TOWN_ID: request.townId,
GASTOWN_AGENT_ROLE: request.role,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION]: New GASTOWN_AGENT_ROLE env var is set here but the GastownEnv type in plugin/types.ts doesn't include it.

The plugin client reads env vars via the typed GastownEnv interface. If the plugin needs to know the agent's role (e.g. to adjust behavior for mayor vs polecat), it can't access GASTOWN_AGENT_ROLE in a type-safe way. Consider adding it to GastownEnv if the plugin should use it, or document that it's only for internal container use.

Was using '_' as a placeholder for rigId in the agent-events POST path.
The ManagedAgent already has rigId from the start request — use it.
} from './kilo-server';
import { createKiloClient } from './kilo-client';
import { createSSEConsumer, isCompletionEvent, type SSEConsumer } from './sse-consumer';
import { createOpencode, createOpencodeClient, type OpencodeClient } from '@kilocode/sdk';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION]: Unused import — createOpencodeClient is imported but never used in this file. Only createOpencode and OpencodeClient (type) are referenced.

Suggested change
import { createOpencode, createOpencodeClient, type OpencodeClient } from '@kilocode/sdk';
import { createOpencode, type OpencodeClient } from '@kilocode/sdk';

const townId = getTownId(c);
if (!townId) return c.json(resError('Missing townId'), 400);
const town = getTownDOStub(c.env, townId);
await town.deleteAgent(params.agentId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: handleDeleteAgent lost 404 response for non-existent agents — same issue as handleDeleteBead. The old code checked the return value of deleteAgent() and returned 404 if the agent wasn't found. Now it always returns { deleted: true } even if the agent doesn't exist.

agents.deleteAgent() in town/agents.ts silently succeeds when the agent doesn't exist (the DELETE query affects 0 rows), so callers get a misleading 200 response.

CURRENT_TOWN_ID="$TOWN_ID"

echo " Creating bead..."
api_post "/api/rigs/${RIG_ID}/beads" '{"type":"issue","title":"E2E test bead","body":"Test body","priority":"high"}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CRITICAL]: E2E tests use old route format — will 404

All rig-scoped routes were moved from /api/rigs/:rigId/... to /api/towns/:townId/rigs/:rigId/... in gastown.worker.ts, but these e2e tests still use the old format (e.g. /api/rigs/${RIG_ID}/beads). Setting CURRENT_TOWN_ID only adds an X-Town-Id header — it doesn't change the URL path, so Hono will return 404 before any handler runs.

This affects tests 11–16 and any other test using /api/rigs/ paths. They need to be updated to use /api/towns/${TOWN_ID}/rigs/${RIG_ID}/....

…oute fix (#419)

Three fixes:

1. Parallel dispatch: schedulePendingWork now collects dispatch tasks
   and runs them via Promise.allSettled instead of sequential await.
   Agents across different rigs start simultaneously.

2. Agent→rig association: added rig_id column to rig_agents table.
   registerAgent and getOrCreateAgent store the rig_id. schedulePendingWork
   uses agent.rig_id to get the correct rig config (git URL, credentials)
   instead of always using the first rig.

3. Completion route: completion-reporter.ts updated to use the new
   /api/towns/:townId/rigs/:rigId/agents/:agentId/completed path
   instead of the old /api/rigs/:rigId/agents/:agentId/completed.
const rigList = rigs.listRigs(this.sql);
const rigId = rigList[0]?.id ?? '';
const rigConfig = await this.getRigConfig(rigId);
if (!rigConfig) return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: popReviewQueue (line 1132) already marked this entry as 'running', but if rigConfig is null here the method returns without completing or failing the entry. The entry is stuck in 'running' state until recoverStuckReviews resets it after 5 minutes.

This means a valid review submission is silently delayed (or lost if the timeout recovery also fails). Consider calling reviewQueue.completeReview(this.sql, entry.id, 'failed') before returning, or restructuring so the config check happens before popping.

const rigList = rigs.listRigs(this.sql);
const rigId = rigList[0]?.id ?? '';
const rigConfig = await this.getRigConfig(rigId);
if (!rigConfig) return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: processReviewQueue pops an entry (marking it 'running') on line 1137, but if rigConfig is null here, the method returns without calling completeReview(entry.id, 'failed'). The entry is left stuck in 'running' status until recoverStuckReviews resets it after 5 minutes.

This creates a retry loop: pop → stuck → recover → pop → stuck → recover, indefinitely, if the rig config is permanently missing (e.g. rig was deleted but review queue still has entries).

Consider failing the entry explicitly:

if (!rigConfig) {
  reviewQueue.completeReview(this.sql, entry.id, 'failed');
  return;
}

const kilocodeToken = generateApiToken(ctx.user, undefined, {
expiresIn: TOKEN_EXPIRY.thirtyDays,
});
await withGastownError(() =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: Non-atomic town creation — if updateTownConfig fails here (e.g. network error to the gastown worker), the town is created but has no kilocode_token or owner_user_id. The caller receives an error, but the town already exists in the user's list.

The mayor won't be able to authenticate LLM calls until a rig is created (which also sets the token). Consider wrapping both calls or adding a retry/compensation mechanism.

…419)

Added rig_id column to both rig_agents and rig_beads tables.

- registerAgent and getOrCreateAgent store rig_id on agent creation
- createBead stores rig_id on bead creation
- slingBead passes rigId to both createBead and getOrCreateAgent
- listAgents handler passes rigId filter so each rig view shows
  only its own agents (not all agents in the town)
- listBeads handler passes rigId filter for the same reason
- AgentFilter and BeadFilter types include rig_id

This fixes: agents from rig A showing up on rig B's dashboard,
and beads from rig A appearing on rig B.
…n clarifications (#419)

Code fixes:
1. completeReviewWithResult: restored 'conflict' status handling — creates
   escalation bead with conflict message, marks review as 'failed'
2. review_submitted event: added newValue: input.branch to match old contract
3. destroy(): iterates all agents and calls AgentDO.destroy() before clearing
   Town DO storage (prevents orphaned AgentDO instances)
4. sessionCount leak: decrements on error path in startAgent catch block
5. configureRig: always updates town config token when rig token differs
   (supports token rotation)
6. removeRig: deletes KV entry rig:{rigId}:config alongside SQL row
7. slingBead: restored metadata parameter passthrough to createBead
8. Removed unused DEFAULT_MAX_POLECATS constant
9. sendMayorMessage: accepts optional model parameter from handler
@jrf0110
Copy link
Contributor Author

jrf0110 commented Feb 22, 2026

Addressed all 13 review comments:

Code fixes (9):

  1. CRITICAL: completeReviewWithResult conflict handling — Restored 'conflict' status. On conflict: marks review as 'failed' and creates an escalation bead with the conflict message.
  2. review_submitted event newValue — Added newValue: input.branch to match the old contract.
  3. destroy() AgentDO cleanup — Iterates all agents and calls agentDO.destroy() before clearing Town DO storage.
  4. sessionCount leak on error — Decrements in the catch block of startAgent.
  5. configureRig stale token — Now updates town config whenever the rig token differs (!== rigConfig.kilocodeToken), supporting token rotation.
  6. removeRig KV leak — Deletes rig:${rigId}:config from KV when removing a rig.
  7. slingBead metadata — Restored metadata parameter passthrough to createBead.
  8. DEFAULT_MAX_POLECATS unused — Removed.
  9. model parameter in mayor messagesendMayorMessage now accepts optional model from the handler.

Design clarifications (4 — intentional, not bugs):
10. WebSocket ticket validation removed — Intentional. The TownContainerDO no longer intercepts WebSocket upgrades — they pass through to the container's Bun server via super.fetch(). The container is internal and not exposed externally. Cloudflare Access protects the worker layer.
11. Global process.env mutation — Acceptable. Each container runs agents for a single town. Env vars are set before the SDK server starts and inherited by the subprocess. Different KILOCODE_TOKEN per rig won't happen since the token is town-scoped.
12. Secrets in X-Town-Config header — Acceptable. This is DO-to-container internal communication over the Container DO's TCP port. No external network exposure. The header is not logged by the control server (only key presence is logged, not values).
13. X-Town-Config unvalidated — Acceptable for the same reason. The container is a trusted internal service. The header is set by the TownDO, not by external clients.

query(
this.sql,
/* sql */ `UPDATE ${rig_agents} SET ${rig_agents.columns.dispatch_attempts} = ? WHERE ${rig_agents.columns.id} = ?`,
[attempts, agent.id]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: dispatch_attempts incremented before rigConfig check — premature circuit breaker

The counter is bumped here (line 1020–1023), but if rigConfig is null (line 1034), the loop continues without ever attempting a dispatch. After 5 alarm cycles with a missing rig config, the circuit breaker on line 1013 fires and marks the bead as 'failed' + unhooks the agent — even though no actual dispatch was ever attempted.

Move the increment to after the rigConfig null check (or inside the dispatchTasks.push callback alongside the success reset on line 1063).


async removeRig(rigId: string): Promise<void> {
await this.ensureInitialized();
rigs.removeRig(this.sql, rigId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: removeRig doesn't clean up agents or beads with matching rig_id

Now that agents and beads have a rig_id column, deleting a rig leaves orphaned agents that schedulePendingWork will try to dispatch. Since the rig config is also deleted (line 197), these agents will hit the !rigConfig path on every alarm cycle, incrementing dispatch_attempts until the circuit breaker marks their beads as failed.

Consider also deleting (or re-assigning) agents/beads whose rig_id matches the deleted rig.

… handlers (#419)

- All E2E tests updated from /api/rigs/:rigId to /api/towns/:townId/rigs/:rigId
- handleCreateBead: passes params.rigId as rig_id to createBead
- handleRegisterAgent: passes params.rigId as rig_id to registerAgent
- handleCreateEscalation: passes params.rigId as rig_id to createBead
- Test 29: fixed jq parsing for tRPC fallback, use BASE_URL not WRANGLER_URL

All 29 E2E tests pass.
{ "tag": "v2", "new_sqlite_classes": ["TownContainerDO"] },
{ "tag": "v3", "new_sqlite_classes": ["MayorDO"] },
{ "tag": "v4", "new_sqlite_classes": ["TownDO"] },
{ "tag": "v5", "new_sqlite_classes": ["AgentDO"], "deleted_classes": ["RigDO", "MayorDO"] },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: Destructive migration — deleted_classes: ["RigDO", "MayorDO"] permanently destroys all data stored in existing RigDO and MayorDO Durable Object instances.

If there is any production data in these DOs (rig configs, agent state, mayor sessions), it will be irrecoverably lost when this migration runs. Ensure either:

  1. All data has been migrated to TownDO before deploying, or
  2. This is acceptable because the system is not yet in production with real user data.

if (init?.body && typeof init.body === 'string') {
try {
const bodyKeys = Object.keys(JSON.parse(init.body));
console.log(`${CLIENT_LOG} ${method} ${path} bodyKeys=[${bodyKeys.join(',')}]`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION]: Debug logging of request body keys on every API call — this console.log fires on every POST/PUT/PATCH to the gastown service from the Next.js server. Consider gating behind a debug flag or removing before merge to reduce production log noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant