Skip to content

feat(parser): migrate claude provider#772

Draft
mariusvniekerk wants to merge 1 commit into
provider-hermesfrom
provider-claude
Draft

feat(parser): migrate claude provider#772
mariusvniekerk wants to merge 1 commit into
provider-hermesfrom
provider-claude

Conversation

@mariusvniekerk

Copy link
Copy Markdown
Collaborator

Claude now uses a concrete provider for regular project transcripts and nested subagent transcripts. The provider keeps recursive project discovery, symlinked project directories, standard and subagent lookup, changed-path classification, content hashing, project normalization, excluded-session reporting, and relationship inference.

The provider also exposes Claude's existing incremental append parser as an optional provider capability so linear JSONL growth can continue to avoid full reparses while full-parse fallback remains available for DAG or row-rewrite cases.

@roborev-ci

roborev-ci Bot commented Jun 19, 2026

Copy link
Copy Markdown

roborev: Combined Review (2a56766)

The PR has two medium correctness issues in the Claude provider; no high or critical security findings were reported.

Medium

  • internal/parser/claude_provider.go:125
    ParseIncremental reports IncrementalNoNewData when the current source size is smaller than the stored offset. A truncated or atomically replaced file is not an append-only no-op, so this can leave stale messages and file metadata instead of forcing a full parse.
    Fix: Return IncrementalNeedsFullParse when Fingerprint.Size < Offset, and reserve IncrementalNoNewData for the equal-size unchanged case.

  • internal/parser/claude_provider.go:330
    Changed-path classification rejects deleted or renamed Claude source paths because sourceRef requires IsRegularFile(path). Other JSONL providers classify missing remove/rename paths by shape so the watcher can still route the event; this provider silently drops those events despite advertising ClassifyChangedPath.
    Fix: Add missing-path classification for remove/rename events that validates the path under the root and Claude source shape without requiring the file to still exist, plus a regression test.


Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 4m55s), codex_security (codex/security, done, 2m6s) | Total: 7m9s

@roborev-ci

roborev-ci Bot commented Jun 20, 2026

Copy link
Copy Markdown

roborev: Combined Review (fc3595c)

I've reviewed the Claude provider migration commit. This commit wires the existing Claude session parser into the Provider interface (factory, source discovery, fingerprinting, parse/incremental-parse), replacing the legacy adapter. The actual parsing, discovery, and lookup logic (ParseClaudeSessionWithExclusions, ParseClaudeSessionFrom, DiscoverClaudeProjects, FindClaudeSourceFile) lives in unchanged existing code.

I focused on the security-relevant surface — the path-handling and source-resolution paths, since FindSource/SourcesForChangedPath could in principle receive externally-influenced inputs (RawSessionID, stored paths, watcher paths):

  • RawSessionIDFindClaudeSourceFile: The only lookup that builds a filesystem path from a session-ID-shaped input is gated by IsValidSessionID, which restricts to [a-zA-Z0-9_-] (no ., /, or separators), so path traversal via the session ID is not reachable.
  • StoredFilePath/FingerprintKey/watcher PathsourceForPathsourceRef: Every candidate is run through claudeProjectHintFromPath, which uses filepath.Rel and explicitly rejects .. escapes, plus enforces a strict <project>/<session>.jsonl or <project>/.../subagents/.../agent-*.jsonl shape and an IsRegularFile check before producing a SourceRef. Paths that escape the configured root are dropped.
  • Symlink following (covered by TestClaudeProviderDiscoversSymlinkedProjectDirectory) and any TOCTOU between stat/hash/parse operate only on user-owned files under the configured roots — excluded by the project threat model (local same-user access, user-owned ~/.agentsview data).
  • The ref.Opaque.(claudeSource) unchecked type assertion at claude_provider.go:315 is reachable only with values built by sourceRef, which always sets Opaque: claudeSource{...}, so it cannot panic in practice and has no security impact.

No SQL, command execution, HTML rendering, network calls, secret handling, or auth-boundary logic is introduced or modified. Input validation on the externally-influenceable path is present and adequate, and the behavior mirrors the legacy adapter it replaces.

No issues found.


Panel: ci_default_security | Synthesis: claude-code | Members: codex_default (claude-code/default, failed, 1s), codex_security (claude-code/security, done, 1m12s) | Total: 1m13s

@roborev-ci

roborev-ci Bot commented Jun 21, 2026

Copy link
Copy Markdown

roborev: Combined Review (59b1626)

Medium confidence: one Medium issue should be fixed before merge; no Critical or High findings were reported.

Medium

  • internal/parser/claude_provider.go:298 - The new Claude provider fingerprint/parse path does not preserve file inode/device, while the legacy Claude sync path sets Session.File.Inode and Session.File.Device. With Claude in shadow-compare mode, Unix shadow comparisons can report mismatches for otherwise identical parses, and the provider result is not metadata-parity with legacy. Populate inode/device from the stat result and copy them onto each parsed session, matching processClaude.

Panel: ci_default_security | Synthesis: codex, 6s | Members: codex_default (codex/default, done, 8m37s), codex_security (codex/security, done, 2m12s) | Total: 10m55s

@mariusvniekerk

mariusvniekerk commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator Author

This change is part of the following stack:

Change managed by git-spice.

@roborev-ci

roborev-ci Bot commented Jun 21, 2026

Copy link
Copy Markdown

roborev: Combined Review (ea93a29)

Medium-risk issue found: Claude provider shadow comparison can falsely report mismatches because file identity fields are not populated.

Medium

  • internal/parser/claude_provider.go:93
    The Claude provider copies the hash onto parsed sessions but never populates file inode/device, while the legacy Claude sync path does before shadow comparison. On Unix this makes otherwise matching Claude shadow parses report session mismatches, undermining the new shadow-compare mode.
    Fix: Populate inode/device in the provider fingerprint and copy them onto each parsed session, or enrich provider results in sync before comparison/writes.

Panel: ci_default_security | Synthesis: codex, 6s | Members: codex_default (codex/default, done, 5m10s), codex_security (codex/security, done, 2m17s) | Total: 7m33s

@roborev-ci

roborev-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

roborev: Combined Review (35a9f85)

Summary verdict: Two medium issues remain; no high or critical findings were reported.

Medium

  • cmd/agentsview/token_use.go:95: resolveRawSessionID still only probes FindSourceFunc, but Claude now has no FindSourceFunc. Unsynced on-disk Claude sessions are treated as unknown, so token-use / session usage skips on-demand sync and returns not found.

    • Fix: Add a provider-backed disk probe for migrated providers using ProviderFactoryByType(...).NewProvider(...).FindSource(...) in both canonical and raw lookup paths.
  • internal/sync/engine.go:4067: Claude provider sync computes a full file hash in Fingerprint before the DB freshness skip, so every unchanged Claude file is reread on full sync. Legacy Claude skipped by size/mtime before hashing.

    • Fix: Perform the stat/data-version freshness check before hashing, or split Claude fingerprinting so the hash is computed only when the source will actually be parsed/written.

Panel: ci_default_security | Synthesis: codex, 7s | Members: codex_default (codex/default, done, 5m17s), codex_security (codex/security, done, 2m11s) | Total: 7m35s

Claude has both regular project transcripts and nested subagent transcripts, plus an existing append-only incremental parser. Moving it behind a concrete provider keeps those source shapes and optional incremental capability explicit at the provider boundary.\n\nThe provider preserves recursive project discovery, symlinked project directories, standard and subagent raw-ID lookup, changed-path classification, content hashing, project-name normalization, excluded-session reporting, relationship inference, and incremental append parsing for linear JSONL growth.

fix(parser): preserve claude provider edge events

Claude provider sync must distinguish true append idleness from files that were truncated or replaced, and watcher classification must still identify deleted primary and subagent transcripts after the file is gone. Otherwise provider-path sync can retain stale messages or miss removals.

Return full-parse status for truncated incremental inputs, add missing-path classification for valid Claude source shapes, and make raw subagent lookup follow symlinked project directories like discovery does. This branch now opts Claude into shadow comparison.

Validation: go test -tags "fts5" ./internal/parser -run 'Test(ClaudeProvider|FindClaudeSourceFile|ProviderMigrationModes)' -count=1; go test -tags "fts5" ./internal/parser -count=1; go vet ./...; git diff --check

fix(sync): replace claude content after file rewrites

Claude incremental parsing is append-oriented, so any fallback caused by truncation or file replacement must replace persisted messages instead of flowing through the append-preserving write path. Otherwise stale higher ordinals or stale tool rows can survive a full parse fallback.

The provider now marks truncated incremental inputs as force-replace, and the legacy engine path carries forceReplace when file identity changes or the file shrinks before falling back to a full parse.

Validation: go test -tags "fts5" ./internal/parser ./internal/sync -run 'TestClaudeProviderParseIncremental|TestIncrementalSync_Claude(FileReplaced|TruncatedFileReplacesStoredMessages|SameSizeFileReplaceUsesFullParse|MidStreamSplitFallsBackToFullParse|AgentIDFallbackUpdatesStoredToolCall)' -count=1; go test -tags "fts5" ./internal/parser ./internal/sync -count=1; go fmt ./...; go vet ./...; ./custom-gcl run --config .golangci.nilaway.yml ./internal/parser/... ./internal/sync/...; git diff --check

fix(sync): replace claude same-size rewrites

A same-size rewrite can reach the full-parse fallback when the normal skip check did not skip the file, which means the content changed even though the byte count did not. That fallback must replace persisted rows, or stale higher ordinals and tool rows can survive the parse.

The regression rewrites a Claude file in place to the same byte length with fewer logical messages and verifies the stale assistant row is deleted.

Validation: go test -tags "fts5" ./internal/parser ./internal/sync -run 'TestObserveProviderSourceMatchesClaudeLegacyParser|TestClaudeProviderParseIncremental|TestIncrementalSync_Claude(FileReplaced|TruncatedFileReplacesStoredMessages|SameSizeFileReplaceUsesFullParse|SameSizeInPlaceRewriteClearsStaleRows|MidStreamSplitFallsBackToFullParse|AgentIDFallbackUpdatesStoredToolCall)' -count=1; go test -tags "fts5" ./internal/parser ./internal/sync -count=1; go fmt ./...; go vet ./...; ./custom-gcl run --config .golangci.nilaway.yml ./internal/parser/... ./internal/sync/...; git diff --check

test(sync): compare claude shadow parity

Claude is shadow-compared on this branch, so add source-level migration coverage that compares provider observation with ParseClaudeSessionWithExclusions.

The fixture exercises the project-directory source shape and verifies session, message, usage, exclusion, and data-version planning parity while preserving provider-computed file hashes.

Validation: go test -tags "fts5" ./internal/sync -run TestObserveProviderSourceMatchesClaudeLegacyParser -count=1

test(sync): cover claude provider usage exclusions

Roborev job 2721 caught that the Claude shadow parity fixture only compared a plain exchange, so it did not prove provider parity for per-message token usage or /usage-only session exclusions.

Add assistant message usage metadata to the normal fixture and a separate /usage-only source discovered by the provider, then assert non-empty token metadata and excluded IDs against the legacy parser.

Validation: go test -tags "fts5" ./internal/sync -run TestObserveProviderSourceMatchesClaudeLegacyParser -count=1; go fmt ./...; go vet ./...; git diff --check

refactor(parser): fold claude into provider

Move Claude source discovery, lookup, full parse, exclusion handling,
and append-only incremental parse ownership onto the concrete
claudeProvider and delete the package-level DiscoverClaudeProjects,
FindClaudeSourceFile, ParseClaudeSessionFrom, and
ParseClaudeSessionWithExclusions free functions. The discover and
find-source bodies stay as provider-neutral helpers
(ClaudeProjectSessionFiles, claudeFindSourceFile) and the parse bodies
become claudeParseWithExclusions and claudeParseSessionFrom; the public
ParseClaudeSession wrapper and the Cowork parser (which reuses the
Claude transcript format) call the shared helper, so no provider file
references a legacy Discover/Find/Parse entrypoint.

Make Claude provider-authoritative and drop its legacy sync dispatch:
the classifyOnePath Claude block, the processFile case arm, and the
processClaude method. Source classification, project resolution, and
exclusion handling are reproduced through the provider's changed-path
and parse paths. The provider's SourcesForChangedPath also reproduces
the legacy "classify despite a transient stat error" behavior so a
changed path under a momentarily unreadable parent is not dropped.

Wire the provider-authoritative engine path to preserve Claude's
DB-aware single-file semantics, which a stateless provider cannot do
alone:
- tryProviderIncrementalAppend drives the provider's ParseIncremental
  through the shared tryIncrementalJSONL bookkeeping (session lookup,
  data-version and inode/device identity guards, ordinal resume,
  cross-sync split detection, cumulative counters, and forceReplace
  fallback), so append-only syncs keep the stored file hash and append
  rows instead of recomputing and rewriting.
- providerSingleSessionFresh reproduces the shouldSkipFile gate so an
  unchanged, already-synced session is skipped instead of re-parsed
  every full sync and a single-session resync does not reapply a
  worktree project mapping to an unchanged file.
- stampProviderFileIdentity stamps inode/device on parsed results so
  the incremental path can later detect an atomic file replacement.
- processProviderFile honors a caller-supplied file.Project as the
  source ProjectHint when no explicit ProviderSource was given, so a
  SyncSingleSession does not revert a user's project override.

The engine's expandClaudeDuplicateCandidates and
dedupeClaudeDiscoveredFiles stay as provider-neutral engine-level dedup
plumbing; expansion now enumerates via ClaudeProjectSessionFiles. The
duplicate-candidate expansion and session-ID dedup/precedence behavior
is unchanged.

Because dropping the Claude DiscoverFunc would otherwise remove Claude
from surfaces that gate on DiscoverFunc != nil, parse-diff (engine and
CLI flag validation) and the SSH remote resolve script now also include
file-based agents that have left legacy-only mode through the provider
facade, restoring Claude (and the other already-folded agents) to those
surfaces.

Drop the Claude AgentDef DiscoverFunc/FindSourceFunc hooks, set its
provider migration mode to ProviderAuthoritative, remove
claude_provider.go from the pending shim scan list, replace the shadow
baseline test with provider-API coverage plus a guard asserting the
four legacy entrypoints stay gone, and re-vehicle the generic
shadow-mechanism caller tests onto the still-legacy Cowork agent since
Claude no longer has a legacy process arm to observe in shadow.

refactor(parser): fold ParseClaudeSession onto the Claude provider

Delete the ParseClaudeSession free function and route its only production
caller (the session upload handler) plus the test suite through the Claude
provider's new ParseUploadedTranscript method, exposed via the
ClaudeUploadParser interface. Uploads live outside any configured root, so
the method parses the staged transcript directly under the caller-supplied
project. That project stays authoritative rather than being overridden by
the transcript's recorded cwd, matching the prior upload behavior and
unlike the discovered-session Parse path.

Unexport ClassifyClaudeSystemMessage to classifyClaudeSystemMessage; it is
a Claude-internal classifier with no callers outside the package. Both
removals clear the last provider-specific legacy parse/classify entrypoints
this branch owned.

fix(sync): skip fresh claude before fingerprinting

The Claude provider migration preserved DB freshness skipping, but only after provider fingerprinting had already hashed the whole transcript. That lost the legacy cheap size/mtime/data-version gate for unchanged files.\n\nRun the single-session freshness check before provider fingerprinting, and pass the computed fingerprint into incremental parsing so truncation detection can distinguish appended files from zero-byte rewrites. Zero-byte truncation now forces a full replacement parse instead of reporting no new data.\n\nValidation: go test -tags "fts5" ./internal/parser -run 'TestClaudeProviderParseIncremental(Truncated|EmptyTruncation)NeedsFullParse' -count=1; go test -tags "fts5" ./internal/sync -run 'TestIncrementalSync_ClaudeAppend|TestProcessFileProviderAuthoritativeSkipsFreshClaudeBeforeFingerprint' -count=1; go test -tags "fts5" ./internal/parser ./internal/sync -count=1; go vet ./...; git diff --check
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (7f39c89)

Summary verdict: One medium issue remains; no critical or high findings were reported.

Medium

  • Location: internal/parser/types.go:99 / cmd/agentsview/token_use.go:91
  • Problem: Claude drops FindSourceFunc, but resolveRawSessionID still only probes on-disk sessions through def.FindSourceFunc. An unsynced Claude session ID that exists on disk now resolves as unknown, so session usage / token-use skips the on-demand SyncSingleSession and reports not found.
  • Fix: Make the resolver use provider FindSource for provider-authoritative file-based agents, or keep a Claude lookup hook until the CLI resolver is migrated.

Panel: ci_default_security | Synthesis: codex, 6s | Members: codex_default (codex/default, done, 4m25s), codex_security (codex/security, done, 3m3s) | Total: 7m34s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant