Skip to content

fix(codex): reliable activity signals — session-flag hooks, trust bypass, no_signal watchdog#170

Merged
ashish921998 merged 4 commits into
mainfrom
fix/codex-activity-signals
Jun 11, 2026
Merged

fix(codex): reliable activity signals — session-flag hooks, trust bypass, no_signal watchdog#170
ashish921998 merged 4 commits into
mainfrom
fix/codex-activity-signals

Conversation

@ashish921998

Copy link
Copy Markdown
Collaborator

Why

Codex sessions spawned by AO showed a confident "idle" forever while the agent was visibly working, and spawns into never-trusted repos hung invisibly. Root causes, verified against codex-cli 0.136.0 and its source:

  1. Codex never loads hook config from AO's per-session worktrees: project-local .codex/ layers only load from trusted directories, and for linked git worktrees codex sources hook declarations from the matching folder in the root checkout — so the workspace .codex/hooks.json AO wrote was dead config.
  2. --dangerously-bypass-hook-trust does not bypass the interactive "Do you trust this directory?" prompt, which blocked headless panes.
  3. Near the account rate limit, codex's "switch to a cheaper model?" dialog blocks the first turn and swallows the auto-submitted spawn prompt.

This re-lands work stranded on the closed redesign PR #156 (cherry-picked codex fix) plus a watchdog so the next silent breakage of this class is visible.

What

Commit 1 — codex hook delivery + trust (verified by real spawn: idle → active → idle hands-free):

  • Hooks ride the launch/resume argv as -c 'hooks.<Event>=[...]' session-flag config (not trust-gated; aggregates with user hooks) alongside the existing --dangerously-bypass-hook-trust.
  • -c 'projects={"<worktree>"={trust_level="trusted"}}' (inline-table form — the dotted projects."<path>".trust_level key is corrupted by codex's -c dot-splitting) suppresses the directory-trust prompt, scoped to the single invocation.
  • -c notice.hide_rate_limit_model_nudge=true keeps the rate-limit dialog from hanging headless panes.
  • GetAgentHooks writes nothing into worktrees anymore (they stay git-clean); it only strips legacy AO entries.

Commit 2 — watchdog + hardening:

  • New sessions.first_signal_at (migration 0010): raw receipt of the first hook callback per spawn/restore. A live session with no receipt past a 90s grace derives the new no_signal display status instead of idle; frontend maps it to needs-you. CDC trigger updated so the transition pushes live.
  • Session PATH pinned to the daemon binary so the bare ao in hook commands can't resolve to a foreign CLI; ao hooks failures land in hooks.log.
  • ao doctor gains: hooks-log (warns on deliveries failed in the last 24h), ao-binary (PATH identity), and codex-launch-flags — a canary that probes the installed codex with the exact flags AO injects (exported by the adapter from the same builders as the spawn argv), so a codex CLI change that breaks injection becomes a doctor warning.
  • Codex hook timeout 30s → 5s; agent PRD docs updated to the implemented callback flow.

Verification

  • go build ./..., go test ./... (50 packages) green; changed packages lint clean (golangci-lint v2.12.2).
  • Live: real AO spawn of a codex worker into a fresh, never-trusted repo — activity transitioned idle → active → idle with zero human interaction; worktree contained no .codex/ and stayed git-clean; ao doctor shows the new checks passing against codex-cli 0.136.0.
  • Frontend: workspace.test.ts passes 26/26 (the only frontend file touched). Note: main's renderer suite has 7 pre-existing failing test files plus a vite-config typecheck error from the feat(frontend): scaffold for frontend with complimentary backend changes #168 scaffold, unaffected by this PR.

Notes

  • fix(sessions): stop AO hook files from making every worktree permanently dirty #169 (dirty-worktree teardown) touches codex/hooks.go too; whichever lands second resolves by keeping this PR's no-write codex adapter and dropping its EnsureWorkspaceGitignore call there (the conformance test passes vacuously for adapters that write nothing).
  • Hook-derived metadata persistence (agentSessionId for codex resume) remains unimplemented and is now documented as such — follow-up candidate.

🤖 Generated with Claude Code

ashish921998 and others added 2 commits June 11, 2026 09:32
…ee at launch

Codex (0.136+) never loads hook config from AO's per-session worktrees:
project-local .codex/ layers only load from trusted directories, and for
linked git worktrees codex sources hook declarations from the matching
folder in the root checkout — so the workspace-local .codex/hooks.json AO
wrote was dead config and codex sessions never reported activity.

Deliver the hooks on the launch/resume command instead:
- -c 'hooks.<Event>=[...]' session-flag config for SessionStart,
  UserPromptSubmit, PermissionRequest, and Stop; the session-flags layer
  is not trust-gated and aggregates with the user's own hooks. The
  existing --dangerously-bypass-hook-trust flag lets them run without a
  persisted trust hash.
- -c 'projects={"<worktree>"={trust_level="trusted"}}' (inline-table
  form; the dotted projects."<path>".trust_level key is corrupted by
  codex's naive -c dot-split) so spawns into never-trusted repos don't
  hang invisibly on the interactive directory-trust prompt. Both the
  literal and symlink-resolved worktree paths are trusted.
- -c notice.hide_rate_limit_model_nudge=true so the "switch to a cheaper
  model?" dialog can't hang a headless pane and swallow the spawn prompt.

GetAgentHooks no longer writes workspace files (worktrees stay clean); it
only strips entries older AO versions left in .codex/hooks.json,
preserving user hooks. UninstallHooks/AreHooksInstalled now operate on
those legacy files only.

Verified with a real spawn into a fresh untrusted repo: activity
transitions idle -> active -> idle hands-free, no .codex dir in the
worktree, no hook delivery failures.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A codex upgrade broke activity tracking silently: sessions showed a
confident "idle" forever while the agent worked. This bundle makes hook
delivery verifiable end to end and makes any future breakage loud
instead of invisible.

Watchdog (no_signal status):
- sessions.first_signal_at (migration 0010) records the FIRST hook
  callback per spawn/restore — raw signal receipt, independent of the
  derived activity state. lifecycle.ApplyActivitySignal stamps it (and
  writes through same-state repeats until stamped, e.g. Codex
  SessionStart reporting idle on an idle-seeded row); MarkSpawned clears
  it so every relaunch re-proves its hook pipeline.
- deriveStatus downgrades a live session with no receipt to the new
  no_signal display status after a 90s grace, instead of idle.
  Terminated/PR-derived statuses still win. The sessions CDC update
  trigger now also fires on first-signal receipt so the dashboard
  transition is pushed live.
- frontend maps no_signal -> needs_you (a human should look at the pane).

Hook callback hardening (re-landed from the closed redesign PR #156):
- the session manager pins each spawned session's PATH with the daemon
  executable's directory first, so the bare `ao` in hook commands
  resolves to the daemon that installed them, with a spawn-time warning
  when the pin cannot apply.
- `ao hooks` failures append to $AO_DATA_DIR/hooks.log (size-capped);
  `ao doctor` gains a hooks-log check that warns on failures from the
  last 24h, and an ao-binary identity check.

Codex launch-surface canary:
- `ao doctor` gains codex-launch-flags: it runs probes exported by the
  codex adapter (built from the same flag builders as the real spawn
  argv) against the installed binary, warning when codex rejects the
  hook-trust bypass flag or AO's -c session-flag overrides.
- codex hook callback timeout drops 30s -> 5s so a hung daemon cannot
  stall the agent's turn.

Docs: the agent PRD callback section now describes the implemented flow
(derive state, POST /sessions/{id}/activity, hooks.log) instead of the
unbuilt SQLite/metadata merge, and notes that hook-derived metadata
persistence (codex resume) is still not implemented.

Frontend note: main's renderer test suite has 7 pre-existing failing
files and a vite-config typecheck error unrelated to this change;
workspace.test.ts (the only frontend file touched) passes 26/26.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR redesigns codex activity hook delivery from workspace-local .codex/hooks.json files to launch-command session-flag config (-c 'hooks.<Event>=[...]'), fixing the root cause where project-local .codex/ layers are trust-gated and dead for AO's per-session worktrees. It also adds a no_signal watchdog status (new first_signal_at DB column + CDC trigger update), pins spawned session PATH to the daemon binary so bare ao in hook commands resolves correctly, and adds ao doctor checks for hooks-log failures, AO binary identity, and codex launch-flag compatibility.

  • Hook delivery rearchitected: hooks now ride the codex --dangerously-bypass-hook-trust launch argv instead of .codex/hooks.json; GetAgentHooks only strips legacy file entries from reused worktrees.
  • no_signal watchdog: a new first_signal_at column tracks the first callback per spawn/restore; live sessions silent past a 90 s grace derive StatusNoSignal instead of a confident StatusIdle; the CDC trigger is updated to push the first-signal receipt live.
  • PATH pin + doctor canary: spawned sessions get the daemon binary's directory prepended to PATH; ao doctor gains hooks-log, ao-binary, and codex-launch-flags checks built from the same flag builders as the real spawn argv.

Confidence Score: 5/5

Safe to merge. The core hook-delivery rearchitecture is well-tested end-to-end with real spawn verification, the DB migration is reversible and backfills correctly, and the no_signal derivation is purely additive to existing status logic.

All changed logic paths have direct unit tests (hook flag builders, TOML escaping, trust flag symlink handling, no_signal grace period, first_signal_at stamping, PATH pin, doctor probes). The migration adds a nullable column with a correct backfill and a reversible down-migration. The two flagged observations are a known CDC-entry limitation for the no_signal transition and a cosmetic TOML hex-casing nit, neither of which affects correctness.

backend/internal/service/session/status.go: the no_signal grace-period derivation is read-side only, so CDC-driven dashboards won’t see the entry transition without a separate push mechanism.

Important Files Changed

Filename Overview
backend/internal/adapters/agent/codex/hooks.go Core rearchitecture: hooks moved from filesystem to launch-argv -c session flags; legacy cleanup logic retained for backwards-compat. TOML escaping and symlink path handling are carefully implemented and tested.
backend/internal/adapters/agent/codex/codex.go Launch/restore commands updated with new flag builders; DoctorLaunchProbes exported for canary checks.
backend/internal/storage/sqlite/migrations/0010_add_first_signal_at.sql New nullable column + backfill + updated CDC trigger. Backfill strategy is correct; trigger updated to detect NULL to non-NULL transition for live push.
backend/internal/service/session/status.go New no_signal grace-period logic added; signalCapable gate ensures hook-less harnesses are never downgraded.
backend/internal/session_manager/manager.go PATH pin logic (hookPATH/runtimeEnv) cleanly added; workspace-failure rollback now deletes the seed row outright. Logger dependency wired in.
backend/internal/lifecycle/manager.go first_signal_at stamped on first arrival, cleared on MarkSpawned; same-state repeat still writes when it is the first signal.
backend/internal/cli/doctor.go Three new doctor checks (ao-binary, hooks-log, codex-launch-flags); sentinel string for unknown config field is case-sensitive (flagged in previous review).
backend/internal/cli/hooks.go appendHooksLog added with size-cap/truncate logic; TOCTOU race noted in previous review (acknowledged as best-effort).
frontend/src/renderer/types/workspace.ts no_signal added to SessionStatus union and workerDisplayStatus mapping; tests extended.
backend/internal/storage/sqlite/store/session_store.go nullTimeToTime/timeToNullTime helpers bridge NullTime to zero-value convention; all read/update/insert paths updated.

Reviews (3): Last reviewed commit: "fix(status): only derive no_signal for h..." | Re-trigger Greptile

Comment on lines +115 to +118
flags := os.O_APPEND | os.O_CREATE | os.O_WRONLY
if info, err := os.Stat(path); err == nil && info.Size() > maxHooksLogBytes {
flags = os.O_TRUNC | os.O_CREATE | os.O_WRONLY
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 TOCTOU race between the size check and the open: two concurrent hook processes can both observe size > maxHooksLogBytes, both open with O_TRUNC, and the second truncation discards the line written by the first. Since this is a best-effort diagnostic log the impact is limited, but the race can be closed by only ever appending and rotating on a separate, deliberate path — or by accepting that truncation may drop one entry and documenting it explicitly.

Suggested change
flags := os.O_APPEND | os.O_CREATE | os.O_WRONLY
if info, err := os.Stat(path); err == nil && info.Size() > maxHooksLogBytes {
flags = os.O_TRUNC | os.O_CREATE | os.O_WRONLY
}
flags := os.O_APPEND | os.O_CREATE | os.O_WRONLY
// NOTE: size check and open are not atomic; two concurrent writers may
// both truncate. Acceptable for a best-effort diagnostic log.
if info, err := os.Stat(path); err == nil && info.Size() > maxHooksLogBytes {
flags = os.O_TRUNC | os.O_CREATE | os.O_WRONLY
}

Message: fmt.Sprintf("codex rejected AO's launch flags (`codex %s`: %v) — codex sessions may spawn without activity hooks; a codex CLI update likely changed its flag/config surface", strings.Join(probe, " "), err),
}
}
if strings.Contains(string(out), "unknown configuration field") {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The sentinel "unknown configuration field" is matched case-sensitively against combined stdout+stderr. If a future codex CLI capitalizes the first word or rephrases the message, the canary silently passes. Wrapping with strings.ToLower makes the detection robust to message casing changes.

Suggested change
if strings.Contains(string(out), "unknown configuration field") {
if strings.Contains(strings.ToLower(string(out)), "unknown configuration field") {

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

…ding port

The branch ported store_test.go wholesale from the closed redesign
branch, whose copy predates #165 — silently dropping the
session-worktrees round-trip test #165 added. Restore main's file and
re-apply only this branch's addition (TestSessionFirstSignalRoundTrip).
No other ported file lost main-side content (audited per-file against
main; the remaining deletions are this branch's intended refactors).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ashish921998

Copy link
Copy Markdown
Collaborator Author

Good catch on TestSessionWorktreesRoundTrip — its removal was not intentional. The branch ported store_test.go wholesale from the closed redesign branch, whose copy predates #165, silently dropping the test #165 added there. Restored in bd7009b: the file is now main's version plus only this branch's TestSessionFirstSignalRoundTrip addition; both round-trip tests pass.

Also audited every other ported file against main for the same failure mode — the remaining deletions are this branch's intended refactors (PATH-pin constructor changes, codex adapter rewrite, doc updates); no other main-side content was lost. #168's TerminalHandleID hunks in domain/session.go / service/session/service.go are present.

🤖 Generated with Claude Code

…eline

Review finding: the no_signal downgrade had no harness-capability gate, but
first_signal_at can only ever be stamped by an `ao hooks` callback. Ten
spawnable harnesses (amp, aider, crush, grok, kimi, devin, auggie, continue,
vibe, pi) install no hooks at all, so every live session of theirs would have
flipped from idle to a permanent no_signal -> needs_you after the 90s grace.

The session service now takes a SignalCapable predicate; daemon wiring injects
activitydispatch.SupportsHarness (the deriver registry is the source of truth
for "this harness can signal"). Left nil, the service never claims no_signal.
A new dispatch test pins that every deriver token is a known harness name.

Also from the same review:
- lifecycle/manager.go and the 0010 migration claimed Codex's SessionStart
  reports idle as the first signal; both codex and claude-code derivers
  deliberately return no signal for session-start, so the comments now cite a
  real case (a lost "active" POST followed by a Stop hook landing idle).
- docs/agent/README.md documents the gate and the restore caveat: a restored
  session the user never prompts has nothing to signal, so it shows no_signal
  after the grace until a receipt-only session-start signal exists.
- 0010 migration uses DROP TRIGGER IF EXISTS per house style.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ashish921998 ashish921998 merged commit 8d0c53e into main Jun 11, 2026
9 checks passed
ashish921998 added a commit that referenced this pull request Jun 11, 2026
Conflict: backend/internal/adapters/agent/codex/hooks.go. Resolved by taking
main's no-write codex adapter wholesale: #170 delivers codex hooks as -c
session flags on the launch command, so GetAgentHooks no longer writes
.codex/hooks.json into the worktree and this branch's EnsureWorkspaceGitignore
call for codex has nothing left to ignore. The self-ignoring-gitignore fix
still applies to adapters that do write worktree hook files (e.g. opencode).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant