Skip to content

fix: hide resume button for non-resumable Codex sessions (provider-aware)#56

Merged
lis186 merged 7 commits into
mainfrom
fix/codex-resume-provider-aware
Jun 7, 2026
Merged

fix: hide resume button for non-resumable Codex sessions (provider-aware)#56
lis186 merged 7 commits into
mainfrom
fix/codex-resume-provider-aware

Conversation

@lis186

@lis186 lis186 commented Jun 7, 2026

Copy link
Copy Markdown
Owner

繁中摘要

Dashboard 的 resume copy button 以前對所有 codex session 都產出 codex resume <sid>,但 Codex 只在成功完成至少一個 API turn(有 usage)後才寫 rollout 檔 — 只有 502 startup error 的 session 複製出來的指令會失敗:No saved session found with ID

這個 PR 把 resume 政策變成 provider 宣告式設定(UPSTREAM_PROFILES.resume,沿用 #43 B4 cache profile pattern),server 端算好每個 session 的 resumeCommand,client 變純 view:拿到 null 就不渲染按鈕(fail closed)。實作前先用三個隔離 worktree 平行實驗三種架構(declarative profile / server-computed / client-only),由 codex CLI 評審後採 hybrid 方案;最終程式碼經 codex gate 三輪複審至 clean。

Problem

Before:  every codex session card → copy button → codex resume <sid>
         502-only session (no rollout file) → ERROR: No saved session found
After:   codex session with a completed turn → codex resume <sid>  ✓
         codex session with only startup errors → no button         ✓
         claude sessions → claude --resume <sid> (unchanged)        ✓

Root cause: Codex only writes ~/.codex/sessions/.../rollout-*.jsonl after a successful API turn. ccxray treated every wire-observed session as resumable.

Architecture

server/providers.js           server/store.js               client (pure view)
┌────────────────────┐        ┌──────────────────────┐      ┌────────────────────┐
│ UPSTREAM_PROFILES  │        │ markSessionUsage()   │      │ sess.resumeCommand │
│ anthropic.resume:  │──read─→│  non-subagent+usage  │      │  (sticky)          │
│  {…, 'always'}     │        │  (monotonic)         │ SSE/ │                    │
│ openai.resume:     │        │ computeSessionResume │ API  │ null → no button   │
│  {…, 'has-usage'}  │        │  interprets condition│─────→│ value → copy exact │
└────────────────────┘        └──────────────────────┘      │  string, no rebuild│
        │                            ▲ ▲                    └────────────────────┘
        ▼ /api/settings              │ └ sse-broadcast.summarizeEntry (single funnel)
   providerProfiles                  └── restore loop pre-marks usage before
                                         serialization (order-independent)
  • Declarative policy: resume: { template, condition } per upstream profile — adding a future provider is a one-line data change, no client edits
  • Server owns the verdict: computeSessionResume interprets the profile; client renders sess.resumeCommand or nothing
  • Fail closed: no usage → null; unknown or empty-string provider → null; sentinel sessions (direct-api/codex-raw/unknown) → null. Only provider == null (legacy pre-tagging entries) falls back to anthropic
  • Restore correctness: the restore loop marks usage for all entries before any serialization, so an index where the usage turn appears after an earlier usage-less turn still reports the final monotonic value (mutation-tested: removing the pre-mark fails 2 tests)

Process

Three architecture candidates were implemented in parallel isolated worktrees, then judged by codex CLI against the committed diffs (correctness after restore, YAGNI, B4-pattern consistency, extension cost, test quality). Verdict: hybrid — B's server-computed backbone + A's declarative profile home. Post-implementation codex review gate ran 3 rounds (2 minor findings + 1 edge fixed) to clean.

Test plan

  • 799 tests pass (16 new: profile shape, settings exposure, store eligibility edge cases, SSE ordering, restore order-independence, render-level button gating)
  • Mutation test: commenting out restore pre-marking fails 2 tests
  • Real-browser smoke (isolated CCXRAY_HOME, port 5613, browser-harness/CDP): codex-with-usage shows button with correct command, 502-only codex session shows no button, claude session unchanged
  • Codex review gate: clean

🤖 Generated with Claude Code

Justin Lee and others added 7 commits June 8, 2026 00:29
…ettings

Codex only writes a rollout file after a successful API turn, so resume
eligibility is provider policy: anthropic resumes always, openai requires
usage. Declarative {template, condition} follows the B4 cache-profile shape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…iles)

markSessionUsage flags a session once any non-subagent turn reports usage
(monotonic). computeSessionResume interprets UPSTREAM_PROFILES[provider]
.resume {template, condition} instead of hardcoding per-agent branches;
missing provider falls back to anthropic, sentinel sessions never resume.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
summarizeEntry (the single funnel for SSE broadcast and /api/entries) now
emits resumable + resumeCommand per entry. The restore loop marks usage
for all entries before any serialization, so eligibility is rebuilt from
the index alone and is order-independent after a restart.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both hardcoded agent ternaries are gone. The client sticky-accumulates
e.resumeCommand onto the session and renders the copy button only when
non-null — a no-usage codex session (e.g. 502-only) gets no button instead
of a known-bad 'codex resume <sid>'. copySessionContinue now copies the
exact string it was handed rather than re-deriving the command. Fails
closed: no client-side fallback that could reintroduce the bug.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vm-loads session-label.js + miller-columns.js with DOM stubs and asserts
renderSessionItem omits the copy button when sess.resumeCommand is null
and renders the exact server-computed command when present.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codex review findings: (1) the anthropic fallback in computeSessionResume
was meant for legacy entries with no provider, but also mapped unknown
future providers to 'claude --resume' — now only a missing provider falls
back, unknown providers get no resume command. (2) the restore-loop
pre-marking had no dedicated coverage (summarizeEntry's own marking masked
its removal); added an index-order test that fails without it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Distinguish missing (== null, legacy anthropic fallback) from
present-but-empty '' which now gets no resume command like any other
unknown provider.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lis186 lis186 merged commit bc00b36 into main Jun 7, 2026
2 checks passed
@lis186 lis186 deleted the fix/codex-resume-provider-aware branch June 7, 2026 17:23
lis186 added a commit that referenced this pull request Jun 7, 2026
…#57)

* fix: resume eligibility requires output_tokens > 0, not just usage presence

PR #56's has-usage heuristic shows a resume button for codex sessions
whose only turns billed input but produced zero output (hung WS turns,
cross-session retries). Those sessions have no rollout file on disk, so
`codex resume` fails — 4/19 real sessions were false positives against
~/.codex/sessions ground truth (issue #44, specimen 2/3).

status and stopReason are unreliable discriminators (101 and 'completed'
appear on failed turns); output_tokens > 0 separates all 19 sessions
correctly. Tighten markSessionUsage to that condition.

Refs #44

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: legacy usage without output_tokens field fails closed

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Justin Lee <justinlee@91app.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant