fix(codex): reliable activity signals — session-flag hooks, trust bypass, no_signal watchdog#170
Conversation
…ee at launch
Codex (0.136+) never loads hook config from AO's per-session worktrees:
project-local .codex/ layers only load from trusted directories, and for
linked git worktrees codex sources hook declarations from the matching
folder in the root checkout — so the workspace-local .codex/hooks.json AO
wrote was dead config and codex sessions never reported activity.
Deliver the hooks on the launch/resume command instead:
- -c 'hooks.<Event>=[...]' session-flag config for SessionStart,
UserPromptSubmit, PermissionRequest, and Stop; the session-flags layer
is not trust-gated and aggregates with the user's own hooks. The
existing --dangerously-bypass-hook-trust flag lets them run without a
persisted trust hash.
- -c 'projects={"<worktree>"={trust_level="trusted"}}' (inline-table
form; the dotted projects."<path>".trust_level key is corrupted by
codex's naive -c dot-split) so spawns into never-trusted repos don't
hang invisibly on the interactive directory-trust prompt. Both the
literal and symlink-resolved worktree paths are trusted.
- -c notice.hide_rate_limit_model_nudge=true so the "switch to a cheaper
model?" dialog can't hang a headless pane and swallow the spawn prompt.
GetAgentHooks no longer writes workspace files (worktrees stay clean); it
only strips entries older AO versions left in .codex/hooks.json,
preserving user hooks. UninstallHooks/AreHooksInstalled now operate on
those legacy files only.
Verified with a real spawn into a fresh untrusted repo: activity
transitions idle -> active -> idle hands-free, no .codex dir in the
worktree, no hook delivery failures.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A codex upgrade broke activity tracking silently: sessions showed a confident "idle" forever while the agent worked. This bundle makes hook delivery verifiable end to end and makes any future breakage loud instead of invisible. Watchdog (no_signal status): - sessions.first_signal_at (migration 0010) records the FIRST hook callback per spawn/restore — raw signal receipt, independent of the derived activity state. lifecycle.ApplyActivitySignal stamps it (and writes through same-state repeats until stamped, e.g. Codex SessionStart reporting idle on an idle-seeded row); MarkSpawned clears it so every relaunch re-proves its hook pipeline. - deriveStatus downgrades a live session with no receipt to the new no_signal display status after a 90s grace, instead of idle. Terminated/PR-derived statuses still win. The sessions CDC update trigger now also fires on first-signal receipt so the dashboard transition is pushed live. - frontend maps no_signal -> needs_you (a human should look at the pane). Hook callback hardening (re-landed from the closed redesign PR #156): - the session manager pins each spawned session's PATH with the daemon executable's directory first, so the bare `ao` in hook commands resolves to the daemon that installed them, with a spawn-time warning when the pin cannot apply. - `ao hooks` failures append to $AO_DATA_DIR/hooks.log (size-capped); `ao doctor` gains a hooks-log check that warns on failures from the last 24h, and an ao-binary identity check. Codex launch-surface canary: - `ao doctor` gains codex-launch-flags: it runs probes exported by the codex adapter (built from the same flag builders as the real spawn argv) against the installed binary, warning when codex rejects the hook-trust bypass flag or AO's -c session-flag overrides. - codex hook callback timeout drops 30s -> 5s so a hung daemon cannot stall the agent's turn. Docs: the agent PRD callback section now describes the implemented flow (derive state, POST /sessions/{id}/activity, hooks.log) instead of the unbuilt SQLite/metadata merge, and notes that hook-derived metadata persistence (codex resume) is still not implemented. Frontend note: main's renderer test suite has 7 pre-existing failing files and a vite-config typecheck error unrelated to this change; workspace.test.ts (the only frontend file touched) passes 26/26. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Greptile SummaryThis PR redesigns codex activity hook delivery from workspace-local
Confidence Score: 5/5Safe to merge. The core hook-delivery rearchitecture is well-tested end-to-end with real spawn verification, the DB migration is reversible and backfills correctly, and the no_signal derivation is purely additive to existing status logic. All changed logic paths have direct unit tests (hook flag builders, TOML escaping, trust flag symlink handling, no_signal grace period, first_signal_at stamping, PATH pin, doctor probes). The migration adds a nullable column with a correct backfill and a reversible down-migration. The two flagged observations are a known CDC-entry limitation for the no_signal transition and a cosmetic TOML hex-casing nit, neither of which affects correctness. backend/internal/service/session/status.go: the no_signal grace-period derivation is read-side only, so CDC-driven dashboards won’t see the entry transition without a separate push mechanism. Important Files Changed
Reviews (3): Last reviewed commit: "fix(status): only derive no_signal for h..." | Re-trigger Greptile |
| flags := os.O_APPEND | os.O_CREATE | os.O_WRONLY | ||
| if info, err := os.Stat(path); err == nil && info.Size() > maxHooksLogBytes { | ||
| flags = os.O_TRUNC | os.O_CREATE | os.O_WRONLY | ||
| } |
There was a problem hiding this comment.
TOCTOU race between the size check and the open: two concurrent hook processes can both observe
size > maxHooksLogBytes, both open with O_TRUNC, and the second truncation discards the line written by the first. Since this is a best-effort diagnostic log the impact is limited, but the race can be closed by only ever appending and rotating on a separate, deliberate path — or by accepting that truncation may drop one entry and documenting it explicitly.
| flags := os.O_APPEND | os.O_CREATE | os.O_WRONLY | |
| if info, err := os.Stat(path); err == nil && info.Size() > maxHooksLogBytes { | |
| flags = os.O_TRUNC | os.O_CREATE | os.O_WRONLY | |
| } | |
| flags := os.O_APPEND | os.O_CREATE | os.O_WRONLY | |
| // NOTE: size check and open are not atomic; two concurrent writers may | |
| // both truncate. Acceptable for a best-effort diagnostic log. | |
| if info, err := os.Stat(path); err == nil && info.Size() > maxHooksLogBytes { | |
| flags = os.O_TRUNC | os.O_CREATE | os.O_WRONLY | |
| } |
| Message: fmt.Sprintf("codex rejected AO's launch flags (`codex %s`: %v) — codex sessions may spawn without activity hooks; a codex CLI update likely changed its flag/config surface", strings.Join(probe, " "), err), | ||
| } | ||
| } | ||
| if strings.Contains(string(out), "unknown configuration field") { |
There was a problem hiding this comment.
The sentinel
"unknown configuration field" is matched case-sensitively against combined stdout+stderr. If a future codex CLI capitalizes the first word or rephrases the message, the canary silently passes. Wrapping with strings.ToLower makes the detection robust to message casing changes.
| if strings.Contains(string(out), "unknown configuration field") { | |
| if strings.Contains(strings.ToLower(string(out)), "unknown configuration field") { |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
…ding port The branch ported store_test.go wholesale from the closed redesign branch, whose copy predates #165 — silently dropping the session-worktrees round-trip test #165 added. Restore main's file and re-apply only this branch's addition (TestSessionFirstSignalRoundTrip). No other ported file lost main-side content (audited per-file against main; the remaining deletions are this branch's intended refactors). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Good catch on Also audited every other ported file against main for the same failure mode — the remaining deletions are this branch's intended refactors (PATH-pin constructor changes, codex adapter rewrite, doc updates); no other main-side content was lost. #168's 🤖 Generated with Claude Code |
…eline Review finding: the no_signal downgrade had no harness-capability gate, but first_signal_at can only ever be stamped by an `ao hooks` callback. Ten spawnable harnesses (amp, aider, crush, grok, kimi, devin, auggie, continue, vibe, pi) install no hooks at all, so every live session of theirs would have flipped from idle to a permanent no_signal -> needs_you after the 90s grace. The session service now takes a SignalCapable predicate; daemon wiring injects activitydispatch.SupportsHarness (the deriver registry is the source of truth for "this harness can signal"). Left nil, the service never claims no_signal. A new dispatch test pins that every deriver token is a known harness name. Also from the same review: - lifecycle/manager.go and the 0010 migration claimed Codex's SessionStart reports idle as the first signal; both codex and claude-code derivers deliberately return no signal for session-start, so the comments now cite a real case (a lost "active" POST followed by a Stop hook landing idle). - docs/agent/README.md documents the gate and the restore caveat: a restored session the user never prompts has nothing to signal, so it shows no_signal after the grace until a receipt-only session-start signal exists. - 0010 migration uses DROP TRIGGER IF EXISTS per house style. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Conflict: backend/internal/adapters/agent/codex/hooks.go. Resolved by taking main's no-write codex adapter wholesale: #170 delivers codex hooks as -c session flags on the launch command, so GetAgentHooks no longer writes .codex/hooks.json into the worktree and this branch's EnsureWorkspaceGitignore call for codex has nothing left to ignore. The self-ignoring-gitignore fix still applies to adapters that do write worktree hook files (e.g. opencode). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Why
Codex sessions spawned by AO showed a confident "idle" forever while the agent was visibly working, and spawns into never-trusted repos hung invisibly. Root causes, verified against codex-cli 0.136.0 and its source:
.codex/layers only load from trusted directories, and for linked git worktrees codex sources hook declarations from the matching folder in the root checkout — so the workspace.codex/hooks.jsonAO wrote was dead config.--dangerously-bypass-hook-trustdoes not bypass the interactive "Do you trust this directory?" prompt, which blocked headless panes.This re-lands work stranded on the closed redesign PR #156 (cherry-picked codex fix) plus a watchdog so the next silent breakage of this class is visible.
What
Commit 1 — codex hook delivery + trust (verified by real spawn: idle → active → idle hands-free):
-c 'hooks.<Event>=[...]'session-flag config (not trust-gated; aggregates with user hooks) alongside the existing--dangerously-bypass-hook-trust.-c 'projects={"<worktree>"={trust_level="trusted"}}'(inline-table form — the dottedprojects."<path>".trust_levelkey is corrupted by codex's-cdot-splitting) suppresses the directory-trust prompt, scoped to the single invocation.-c notice.hide_rate_limit_model_nudge=truekeeps the rate-limit dialog from hanging headless panes.GetAgentHookswrites nothing into worktrees anymore (they stay git-clean); it only strips legacy AO entries.Commit 2 — watchdog + hardening:
sessions.first_signal_at(migration 0010): raw receipt of the first hook callback per spawn/restore. A live session with no receipt past a 90s grace derives the newno_signaldisplay status instead ofidle; frontend maps it to needs-you. CDC trigger updated so the transition pushes live.aoin hook commands can't resolve to a foreign CLI;ao hooksfailures land inhooks.log.ao doctorgains:hooks-log(warns on deliveries failed in the last 24h),ao-binary(PATH identity), andcodex-launch-flags— a canary that probes the installed codex with the exact flags AO injects (exported by the adapter from the same builders as the spawn argv), so a codex CLI change that breaks injection becomes a doctor warning.Verification
go build ./...,go test ./...(50 packages) green; changed packages lint clean (golangci-lint v2.12.2)..codex/and stayed git-clean;ao doctorshows the new checks passing against codex-cli 0.136.0.workspace.test.tspasses 26/26 (the only frontend file touched). Note: main's renderer suite has 7 pre-existing failing test files plus a vite-config typecheck error from the feat(frontend): scaffold for frontend with complimentary backend changes #168 scaffold, unaffected by this PR.Notes
codex/hooks.gotoo; whichever lands second resolves by keeping this PR's no-write codex adapter and dropping itsEnsureWorkspaceGitignorecall there (the conformance test passes vacuously for adapters that write nothing).agentSessionIdforcodex resume) remains unimplemented and is now documented as such — follow-up candidate.🤖 Generated with Claude Code