Skip to content

feat(parser): migrate codex provider#775

Draft
mariusvniekerk wants to merge 1 commit into
provider-opencode-familyfrom
provider-codex
Draft

feat(parser): migrate codex provider#775
mariusvniekerk wants to merge 1 commit into
provider-opencode-familyfrom
provider-codex

Conversation

@mariusvniekerk

Copy link
Copy Markdown
Collaborator

Codex now has a concrete parser provider for dated and archived JSONL transcripts plus the session_index.jsonl title sidecar. The provider owns discovery, watch planning, changed-path classification, lookup, index-aware fingerprinting, full parse output, and append parsing with full-parse fallback signals.\n\nIndex-file events conservatively map to matching local session sources because the provider layer does not receive stored DB session names; skip and replacement decisions remain with the sync layer.

@roborev-ci

roborev-ci Bot commented Jun 19, 2026

Copy link
Copy Markdown

roborev: Combined Review (3388a3f)

Summary verdict: The PR has two medium-severity correctness issues; no security vulnerabilities were reported.

Medium

  • Location: internal/parser/codex_provider.go:125

    • Problem: ParseIncremental returns IncrementalNoNewData when the transcript size has not grown, but Codex fingerprints include session_index.jsonl mtime. A title-only index change can therefore be marked fresh without a full parse, leaving SessionName stale.
    • Fix: Detect metadata-only freshness changes, such as index mtime newer than the transcript mtime, and return IncrementalNeedsFullParse so Parse refreshes session metadata.
  • Location: internal/parser/codex_provider.go:380

    • Problem: Codex source keys are file paths, so provider-level dedupe by provider/key will not collapse the same session UUID appearing in both live and archived roots. Both sources parse to the same codex:<uuid> session and can race or overwrite with stale archived data.
    • Fix: Dedupe Codex sources by canonical session UUID and preserve the existing preference for the dated live path over archived flat paths.

Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 4m27s), codex_security (codex/security, done, 1m17s) | Total: 5m52s

@roborev-ci

roborev-ci Bot commented Jun 20, 2026

Copy link
Copy Markdown

roborev: Review Unavailable (e4d0f82)

The review agent repeatedly failed to run (likely an agent or configuration error). roborev will try again on the next commit.

Last error: agent: claude-code failed stream: stream errors: You've hit your session limit · resets 5:50am (UTC): exit status 1

@roborev-ci

roborev-ci Bot commented Jun 21, 2026

Copy link
Copy Markdown

roborev: Combined Review (7bdb632)

Medium finding: Codex provider changed-path fallback is broader than legacy discovery.

  • Severity: Medium
  • Location: internal/parser/codex_provider.go:385
  • Problem: directPathSource accepts any .jsonl under a Codex root. Because Codex is now in shadow-compare mode, provider changed-path classification is added to real sync work, so a non-Codex JSONL file written under .codex/sessions can be parsed and persisted as an empty or bogus Codex session. Legacy classification only accepted Codex rollout-shaped paths.
  • Fix: Restrict the fallback to legacy Codex rollout-shaped paths, for example require isCodexSessionFilename(filepath.Base(path)) plus the same flat or year/month/day layout shape, while allowing UUID extraction to fail.

Panel: ci_default_security | Synthesis: codex, 6s | Members: codex_default (codex/default, done, 4m3s), codex_security (codex/security, done, 3m6s) | Total: 7m15s

@mariusvniekerk

mariusvniekerk commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator Author

This change is part of the following stack:

Change managed by git-spice.

@roborev-ci

roborev-ci Bot commented Jun 21, 2026

Copy link
Copy Markdown

roborev: Combined Review (c9ca596)

High: Codex provider shadow changed-path results can alter real sync behavior before the provider is authoritative.

High

  • internal/parser/codex_provider.go:220 and internal/parser/codex_provider.go:345
    Codex changed-path classification canonicalizes duplicate sources to the dated/live path before returning them. Because shadow provider classifications are merged into real classifyPaths, an archived/stored transcript can be replaced by a stale live duplicate during session_index.jsonl or archived-file events, bypassing the legacy stored-path safeguard.
    Fix: Keep changed-path results tied to the changed or DB-stored source path, or suppress Codex provider changed-path output while still in shadow mode until it can preserve the same stored-path preference.

Medium

  • internal/parser/codex_provider.go:301
    The provider fingerprint/parse path never carries file inode/device into ParsedSession.File, while legacy processCodex does. On Unix this makes Codex shadow comparisons report session mismatches for otherwise identical parses.
    Fix: Populate inode/device in SourceFingerprint and copy them into the parsed session, or normalize those fields in the shadow comparison.

Panel: ci_default_security | Synthesis: codex, 7s | Members: codex_default (codex/default, done, 7m21s), codex_security (codex/security, done, 1m28s) | Total: 8m56s

@roborev-ci

roborev-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

roborev: Combined Review (4066dbd)

Summary verdict: Changes need fixes for Codex duplicate source handling; no additional security issues were reported.

High

  • internal/parser/codex_provider.go:374: Codex changed-path classification canonicalizes a concrete transcript event to the preferred live duplicate. If an archived duplicate is the file that changed while a stale live copy also exists, SyncPaths will parse the live file and overwrite the archived session with stale content.
    • Fix: For direct transcript change events, return a SourceRef pinned to req.Path instead of calling canonicalSource; keep canonicalization for discovery/raw-ID lookup where that preference is intended, and add a watcher test for changed archived duplicate plus stale live copy.

Medium

  • internal/parser/codex_provider.go:411: FindSource also canonicalizes an exact StoredFilePath/FingerprintKey. SyncSingleSession resolves the DB-stored path first, but this provider lookup can silently switch an archived stored path to a live duplicate, so explicit resync can refresh the wrong source.
    • Fix: Honor exact stored path hints, at least when RequireFreshSource is false, or pass a pinned Codex ProviderSource from SyncSingleSession.

Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 5m30s), codex_security (codex/security, done, 2m11s) | Total: 7m49s

Codex sessions have an append-only JSONL transcript plus a session_index.jsonl title sidecar. Moving Codex behind a concrete provider keeps that composite source identity and incremental append capability explicit at the provider boundary.

The provider preserves dated and archived discovery, live-over-archived lookup, shallow index watch planning, index-event classification, index-aware mtimes, source hashing, full parse output, and append parsing with full-parse fallback signals.

fix(parser): preserve codex provider sidecar semantics

Codex index changes are part of source freshness, so the provider cannot treat unchanged transcript size as no new data when the index mtime drove the fingerprint. The provider also needs to keep legacy live-over-archived UUID behavior and classify removed transcript paths syntactically.

Index events now conservatively refresh sibling Codex sources because this provider layer has no DB state for title diffing; the sync engine can still apply its DB-aware filtering before provider dispatch is fully authoritative.

Validation: go test -tags "fts5" ./internal/parser -run TestCodexProvider -count=1; go vet ./...; git diff --check. go test -tags "fts5" ./internal/parser -count=1 currently fails on TestProviderMigrationModes because inherited lower provider branches such as claude still need their branch-local shadow opt-ins.

fix(parser): make codex provider sidecars authoritative

The Codex provider could not safely infer sidecar-only freshness from a single max mtime. Rather than advertise append-only parsing with incomplete sidecar state, keep provider-authoritative Codex parses on the full-parse path until the facade can model sidecar dirtiness explicitly.

Also route persisted path lookup and changed-path classification through the same UUID canonicalization as discovery so archived duplicates do not win over live dated transcripts.

Validation: go test -tags "fts5" ./internal/parser -run 'Test(CodexProvider|ProviderMigrationModes)' -count=1; go test -tags "fts5" ./internal/parser -count=1; go vet ./...; git diff --check

test(sync): compare codex shadow parity

Codex is shadow-compared on this branch, so add source-level migration coverage that compares provider observation with ParseCodexSession.

The fixture uses the real sessions/YYYY/MM/DD layout plus sibling session_index.jsonl, proving the provider preserves title sidecar behavior, parser output, and data-version planning.

Validation: go test -tags "fts5" ./internal/parser ./internal/sync -run 'TestObserveProviderSourceMatchesCodexLegacyParser|TestCodexProvider|TestParseCodex|TestProviderMigrationModes' -count=1; go test -tags "fts5" ./internal/parser ./internal/sync -count=1; go fmt ./...; go vet ./...; ./custom-gcl run --config .golangci.nilaway.yml ./internal/parser/... ./internal/sync/...; git diff --check

fix(parser): accept codex legacy-shaped sources

Provider-authoritative Codex sync still has to rediscover sessions that were stored by the legacy parser even when their rollout filename does not expose a UUID-shaped session id. Without that compatibility path, the later dispatch migration can drop or fail to reprocess valid Codex transcripts that ParseCodexSession can read from session metadata.

Keep the UUID-aware source contract as the preferred path and fall back to root-scoped JSONL sources only when Codex path metadata does not apply, so normal duplicate canonicalization remains unchanged while legacy-shaped fixtures stay reachable.

Validation: go test ./internal/parser -count=1; go fmt ./...; go vet ./...; go test -tags "fts5" ./internal/parser ./internal/sync -run 'TestCodexProvider|TestSyncEngineCodex|TestSyncSingleSessionHashCodex|TestSyncEngineSkipCache' -count=1; git diff --check

refactor(parser): fold codex into provider

Make the Codex provider own its source discovery, lookup, and parse
behavior instead of shimming the package-level free functions. Delete
DiscoverCodexSessions, FindCodexSourceFile, ParseCodexSession, and
ParseCodexSessionFrom: discovery and find-source bodies move onto the
codex source set (discoverSessionPaths, findSourceFile), and parse moves
onto the provider (parseSession, parseSessionFrom). Drop the Codex
AgentDef DiscoverFunc/FindSourceFunc hooks and make Codex
provider-authoritative; ShallowWatchRootsFunc and the exec-source helpers
(IsCodexExecSessionFile, ResolveCodexShallowWatchRoots, the one-time
codex_exec skip migration) stay since only the four parser entrypoints
must go.

A provider has no database handle, so the engine reproduces the DB-aware
and mtime-aware bookkeeping the legacy single-session JSONL path
performed, scoped to Codex to preserve behavior exactly:

  - shouldSkipProviderSourceByDB folds the session_index.jsonl sidecar
    into a DB-stored fingerprint skip, so an unchanged transcript is not
    reparsed when only the shared index mtime advanced and this session's
    title did not change, and a resync still skips after the in-memory
    skip cache is cleared.
  - The provider Parse force-replaces stored rows because Codex emits a
    full parse (it does not advertise incremental append); a late
    token_count line appended to an existing turn rewrites the stored
    message instead of being dropped by an append-only write.
  - Index events keep flowing through the engine's DB-aware
    classifyCodexIndexPath rather than the provider's broad index
    fan-out: the engine fans out only to sessions whose stored title
    changed and pins the chosen on-disk copy (SourceRefForPath) so the
    provider's live-over-archived canonicalization cannot resurrect a
    stale duplicate over the stored copy.
  - SyncAllSince re-expands a UUID's live and archived duplicates
    (AllSourcePathsForUUID) before the mtime cutoff filter, restoring the
    legacy discover-then-filter order so a changed archived copy newer
    than the cutoff is not lost behind an older live copy.

Route parse-diff, the token-use disk probe, and the SSH remote resolve
script through provider Discover/FindSource for provider-authoritative
agents that no longer carry a DiscoverFunc, so Codex sources stay
discoverable, resolvable on disk, and transferable (including the
session_index.jsonl sidecar).

Replace the deleted shadow-baseline test with provider-API coverage
(provider Discover/Parse through ObserveProviderSource) plus a guard that
the four legacy entrypoints stay gone, route the package and engine tests
through the provider methods, and remove codex_provider.go from the
pending shim scan list. This also fixes the previously known-failing
TestSyncPathsCodexIndexEventRefreshesStoredDuplicate, since the index
event now honors the stored archived copy.

test(sync): host shared shadow source helper at codex fold

The per-provider shadow/parse tests share writeProviderShadowSourceFile
to write source fixtures. The Codex fold is the lowest branch that calls
it, so the canonical definition lives here; later provider folds inherit
it instead of redeclaring their own copies.

test(sync): remove unused codex stat assignments

The pre-commit lint hook rejects two Codex appended-fixture tests because they assign os.Stat results back to info without using the value. The tests already assert the append and close operations that matter for setup.

Removing the unused assignments keeps staticcheck clean for the Codex provider migration branch.

fix(parser): pin codex duplicate sources

Codex discovery and raw-ID lookup should still prefer the live dated transcript, but exact filesystem events and DB-stored source hints are different: the caller has already selected a concrete source path. Canonicalizing those paths back to a stale live duplicate can overwrite an updated archived transcript.

Changed-path classification now returns the source pinned to the event path, and non-fresh stored path/fingerprint lookup returns the exact source so SyncSingleSession preserves the archived path already recorded in the database.

Validation: go test -tags "fts5" ./internal/parser -run 'TestCodexProvider(FindSourcePinsExactArchivedDuplicate|ChangedPathPinsArchivedDuplicate|SourceMethods|DiscoverDedupesLiveAndArchivedByUUID)' -count=1; go test -tags "fts5" ./internal/sync -run 'TestSync(PathsCodexArchivedDuplicateEventPinsChangedFile|SingleSessionCodexPreservesStoredArchivedDuplicate|PathsCodexIndexEventRefreshesStoredDuplicate|AllSinceCodexKeepsChangedArchivedDuplicate)' -count=1; go test -tags "fts5" ./internal/parser -run 'TestCodexProvider|TestParseCodex|TestDiscoverCodex' -count=1; go test -tags "fts5" ./internal/sync -run 'Test.*Codex.*' -count=1; go vet ./...; git diff --check

fix(sync): keep codex freshness skips out of cache

Codex provider DB-fresh skips are successful freshness decisions, not parse failures or intentional no-session skips. Recording them in the persistent skip cache can hide a later parser data-version bump because the cache check runs before the DB freshness check.\n\nKeep DB-fresh provider skips non-cacheable and make existing skip-cache entries fall through when a stored row at that path has a stale data version. The same bypass helper still preserves the existing stale-project self-healing behavior.\n\nValidation: go test -tags "fts5" ./internal/sync -run 'TestProcessFile(SkipCacheReparsesStaleCodex(Project|DataVersion)|CodexDBFreshSkipIsNotCached)|Test.*Codex.*' -count=1; go test -tags "fts5" ./internal/parser -run 'TestCodexProvider|TestParseCodex|TestDiscoverCodex' -count=1; go test -tags "fts5" ./internal/parser ./internal/sync -count=1; go vet ./...; git diff --check

fix(sync): surface codex provider discovery failures

Provider-backed parse-diff should not report a clean or incomplete diff when provider discovery failed. Returning that error keeps requested provider-authoritative agents honest and matches the expectation that parse-diff is a verification surface, not a best-effort sync.\n\nAlso pin coverage for stale Codex index entries whose transcripts no longer resolve, so the existing empty-candidate guard cannot regress into an invalid empty work item.\n\nValidation: go test -tags "fts5" ./internal/sync -run 'Test(ParseDiffProviderDiscoveryErrorFails|ClassifyCodexIndexPathSkipsMissingTranscript|ProcessFile(SkipCacheReparsesStaleCodex(Project|DataVersion)|CodexDBFreshSkipIsNotCached))' -count=1; go test -tags "fts5" ./internal/parser ./internal/sync -count=1; go vet ./...; git diff --check

fix(sync): drop duplicate shadowCallerProvider Discover in codex test
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (082eec1)

High-level verdict: the PR has one test-breaking issue and one medium-risk Codex sync regression to address.

High

  • internal/parser/provider_shim_scan_test.go:52
    pendingShimProviderFiles includes provider files such as claude_provider.go, cowork_provider.go, hermes_provider.go, opencode_provider.go, and vibe_provider.go that no longer reference legacy free-function entrypoints. TestProviderFilesDoNotReferenceLegacyEntrypoints expects pending files to still have offenders, so these entries will fail parser package tests.
    Fix: Remove already-folded provider files from pendingShimProviderFiles; only keep files that still reference legacy free-function entrypoints.

Medium

  • internal/sync/engine.go:2367
    Quick sync now dedupes Codex duplicates by newest raw file mtime after filtering with the provider fingerprint mtime, but Codex fingerprints include session_index.jsonl. A title/index touch can include both live and archived duplicates even when neither transcript changed, then choose a newer archived duplicate and overwrite the DB path/messages instead of using the DB-aware Codex index refresh path.
    Fix: Keep Codex cutoff filtering based on transcript raw mtime and let codexIndexNeedsRefreshSince handle index-only refreshes, or preserve the DB-tracked duplicate when candidates were included only because of the shared index mtime.

Panel: ci_default_security | Synthesis: codex, 10s | Members: codex_default (codex/default, done, 7m24s), codex_security (codex/security, done, 3m51s) | Total: 11m25s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant