Skip to content

feat(parser): migrate hermes provider#771

Draft
mariusvniekerk wants to merge 1 commit into
provider-vibefrom
provider-hermes
Draft

feat(parser): migrate hermes provider#771
mariusvniekerk wants to merge 1 commit into
provider-vibefrom
provider-hermes

Conversation

@mariusvniekerk

Copy link
Copy Markdown
Collaborator

Hermes now uses a concrete provider for both transcript-file roots and state.db archive roots. Transcript roots remain single-session sources, while state.db roots are modeled as multi-session force-replace sources.

The provider keeps legacy transcript discovery and lookup behavior and gives archive roots a composite fingerprint over state.db plus sibling transcripts, matching the files ParseHermesArchive can read when selecting the richest message stream.

@roborev-ci

roborev-ci Bot commented Jun 19, 2026

Copy link
Copy Markdown

roborev: Combined Review (e64fb73)

Summary verdict: Hermes provider migration has two medium-risk correctness issues; no high or critical findings were reported.

Medium

  • internal/parser/hermes_provider.go:402 - Archive fingerprints can miss JSON transcript changes. The fingerprint uses discoverHermesTranscriptFiles, which deduplicates session_*.json when a matching .jsonl exists, but ParseHermesArchive may read and prefer that JSON file first. Provider-based syncing can therefore skip stale Hermes archive data.

    • Fix: Build the archive fingerprint from every transcript file the parser may read, including both *.jsonl and session_*.json, sorted deterministically.
  • internal/parser/hermes_provider.go:170 - Live sync may miss Hermes state DB updates. WatchPlan watches only the configured root, but with the default Hermes root shape (.../.hermes/sessions), discovery resolves the source to sibling .../.hermes/state.db, outside the watch root.

    • Fix: When hermesStatePaths(root) resolves a state DB outside the configured root, add a watch for filepath.Dir(stateDB) or otherwise include the state DB path, and make SourcesForChangedPath accept that watch root.

Panel: ci_default_security | Synthesis: codex, 7s | Members: codex_default (codex/default, done, 4m26s), codex_security (codex/security, done, 1m48s) | Total: 6m21s

@roborev-ci

roborev-ci Bot commented Jun 20, 2026

Copy link
Copy Markdown

roborev: Combined Review (40a682f)

I've reviewed the hermes provider migration diff and traced the file-path handling against the codebase's validation helpers.

Summary

This commit adds internal/parser/hermes_provider.go, migrating the hermes agent from the legacy adapter to the new Provider interface (discover/watch/find/fingerprint/parse), plus its test file and a one-line factory registration in provider.go. It is a structural migration of a local session-file parser.

I focused on the only plausible attack surface here: whether attacker-influenced input (e.g. a session ID from a URL/DB) can drive file reads outside the configured roots.

  • FindSource passes req.RawSessionID to FindHermesSourceFile, which first calls IsValidSessionID (internal/parser/discovery.go:828). That allows only [A-Za-z0-9_-], so /, ., and .. are rejected — no traversal via the session ID.
  • Every resolved path is re-validated through sourceForPathsourceRefhermesPathInTranscriptDir, which requires the file to be a direct child of the transcript root (samePath(filepath.Dir(path), dir)) with a .jsonl or session_*.json name, plus IsRegularFile. A traversal path could not survive this even if an upstream helper produced one.
  • The state.db archive path is gated by samePath(path, stateDB) against hermesStatePaths(root), keeping it inside a configured root.
  • The non-ok type assertion ref.Opaque.(hermesSource) in pathFromSource is safe because ref always originates from sourceRef, which sets Opaque to a hermesSource value; it's a robustness nit, not a security issue, and not attacker-reachable.

No SQL/command construction, auth/authz changes, credential handling, or new cross-trust-boundary exposure is introduced. Per the project threat model, local session files are non-adversarial and local files are user-owned, and the path validation here is consistent with the rest of the parser package.

No issues found.


Panel: ci_default_security | Synthesis: claude-code | Members: codex_default (claude-code/default, failed, 1s), codex_security (claude-code/security, done, 56s) | Total: 57s

@roborev-ci

roborev-ci Bot commented Jun 21, 2026

Copy link
Copy Markdown

roborev: Combined Review (eb4ee67)

Review verdict: medium correctness issues remain in Hermes sync/provider parity.

Medium

  • internal/sync/engine.go:1738 - classifyHermesPath maps transcript events to state.db whenever the configured root is named sessions or has a sessions child, even if state.db does not exist. Transcript-only Hermes installs will enqueue a missing state.db job on every live transcript event, producing sync errors and failed stats alongside the valid transcript sync. Fix by only routing transcript events to state.db when the state DB actually exists; otherwise classify the changed transcript path directly.

  • internal/parser/hermes_provider.go:91 - The Hermes provider returns raw ParseHermesArchive results for state.db sources without stamping the archive fingerprint onto each result. The legacy path now rewrites all archive result File.Path, Size, and Mtime to the aggregate state.db source, so shadow compare will report systematic mismatches and provider-authoritative mode would persist different file metadata. Fix by mirroring stampHermesArchiveResults in the provider archive path, using the source path plus req.Fingerprint.Size, MTimeNS, and Hash.

  • internal/parser/hermes_provider.go:526 - Archive fingerprinting uses discoverHermesTranscriptFiles, which skips session_*.json when a matching .jsonl exists, but archive parsing checks session_<id>.json first and may select it. A change to the selected JSON transcript can leave the provider fingerprint unchanged. Fix by fingerprinting every direct transcript file that can affect archive parsing, or aligning the parser’s transcript selection with the fingerprint file list.


Panel: ci_default_security | Synthesis: codex, 11s | Members: codex_default (codex/default, done, 4m59s), codex_security (codex/security, done, 12s) | Total: 5m22s

@mariusvniekerk

mariusvniekerk commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator Author

This change is part of the following stack:

Change managed by git-spice.

@roborev-ci

roborev-ci Bot commented Jun 21, 2026

Copy link
Copy Markdown

roborev: Combined Review (384821f)

Medium findings require fixes before merge.

Medium

  • internal/sync/engine.go:1766 - hermesSyncArchivePaths treats any configured sessions root as an archive even when the sibling state.db does not exist. Transcript-only Hermes installs will classify transcript writes as the missing state.db, causing spurious stat failures during live sync.
    Fix: Only map transcript events to state.db when the archive DB actually exists; still allow the state.db create/write event itself to classify as the archive source.

  • internal/sync/engine.go:1729 and internal/sync/engine.go:6461 - Hermes archive freshness ignores SQLite state.db-wal and state.db-shm sidecars. WAL-only commits can be missed by the watcher and skipped by periodic sync because the aggregate mtime/size only considers the main DB and transcripts.
    Fix: Classify state.db-wal/state.db-shm events back to state.db, and include SQLite sidecars in the archive effective info and provider fingerprint.


Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 6m44s), codex_security (codex/security, done, 1m30s) | Total: 8m22s

@roborev-ci

roborev-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

roborev: Combined Review (31dd18e)

Medium confidence: one Medium issue found; no High or Critical findings.

Medium

  • internal/parser/hermes_provider.go:235 - FindSource returns the Hermes archive state.db for any syntactically valid raw session ID as soon as a state DB exists, without verifying that the requested session is actually in that archive. This can make FindSourceFile/SyncSingleSession resolve missing sessions, or sessions that exist in a later configured root, to the wrong archive and then parse/write unrelated sessions instead of returning not found.

    Fix: Before returning the archive source for a raw session ID, verify that the ID exists in state.db or in the archive transcript directory; otherwise continue searching remaining roots and return not found when no matching source exists.


Panel: ci_default_security | Synthesis: codex, 7s | Members: codex_default (codex/default, done, 5m14s), codex_security (codex/security, done, 3m39s) | Total: 9m0s

Hermes can represent a configured root as either individual transcript files or as a state.db archive that fans out into multiple sessions. Moving it behind a concrete provider makes that source choice explicit instead of leaving archive behavior inside the legacy adapter path.\n\nThe provider preserves transcript discovery and lookup while treating state.db as a multi-session, force-replace source. Its fingerprint covers the archive database plus sibling transcripts so transcript-quality changes can refresh the archive source that ParseHermesArchive reads.

fix(parser): preserve hermes archive event coverage

Hermes archive discovery can normalize a configured sessions directory or direct state.db path into a sibling archive source, but the watch plan and changed-path classifier still assumed the configured root was the only event root. That left state.db updates and removed primary files invisible to provider-path sync.

Normalize archive watch roots, map delete and rename-style events syntactically when primary files are gone, and cover archive-parent, sessions-directory, and direct-state roots. This lets Hermes enter shadow comparison as an actual migration branch.

Validation: go test -tags "fts5" ./internal/parser -run 'Test(HermesProvider|ProviderMigrationModes)' -count=1; go test -tags "fts5" ./internal/parser -count=1; go vet ./...; git diff --check

fix(parser): watch hermes archive roots syntactically

Hermes archive configs can point at the archive parent, its sessions directory, or the state.db file before the sibling archive components have been created. Watch planning needs to treat those shapes as archive roots from their paths, not from startup-time existence checks, otherwise late-created metadata or transcripts are invisible until a full sync.

The transcript watch root is now retained for archive-shaped roots even when sessions/ is not present yet, while ordinary transcript-only roots keep their recursive file watch.

Validation: go test -tags "fts5" ./internal/parser -run 'TestHermesProvider' -count=1; go test -tags "fts5" ./internal/parser ./internal/sync -count=1; go fmt ./...; go vet ./...; git diff --check

fix(parser): feed hermes archive roots to runtime watcher

Hermes provider watch planning now knows how to follow archive-shaped roots, but the actual serve-time watcher still reads registry watch resolvers. Without a matching Hermes resolver there, the default .hermes/sessions config can miss sibling state.db creation or updates in live sync.

Expose Hermes shallow archive-parent watch roots through the registry while keeping transcript roots recursive, and add shadow parity coverage so this branch remains a migration rather than an additive provider implementation.

Validation: go test -tags "fts5" ./cmd/agentsview ./internal/parser ./internal/sync -run 'TestCollectWatchRootsHermesSessionsWatchesStateDBParent|TestHermesProvider|TestParseHermes|TestProviderMigrationModes|TestObserveProviderSourceMatchesHermesLegacyParser' -count=1; go test -tags "fts5" ./cmd/agentsview ./internal/parser ./internal/sync -count=1; go fmt ./...; go vet ./...; ./custom-gcl run --config .golangci.nilaway.yml ./cmd/agentsview/... ./internal/parser/... ./internal/sync/...; git diff --check

fix(sync): classify hermes archive watcher events

Roborev jobs 2715 and 2716 caught that Hermes archive watch roots were subscribed but the legacy SyncPaths classifier still ignored sibling state.db events. That meant live sync could wait for a periodic full sync even though the watcher saw the change.

Map configured Hermes archive roots, state.db events, and direct archive transcript events back to the state.db source that processHermes already parses, while preserving transcript-only root classification for standalone Hermes session files.

Validation: go test -tags "fts5" ./internal/sync -run TestSyncPathsHermesStateDBEventRefreshesArchive -count=1; go test -tags "fts5" ./internal/parser ./internal/sync -run 'Test(HermesProvider|ObserveProviderSourceMatchesHermesLegacyParser|SyncPathsHermesStateDBEventRefreshesArchive)' -count=1; go fmt ./...; go vet ./...; git diff --check

fix(sync): include hermes transcripts in archive skips

Roborev job 2803 caught that Hermes transcript watcher events could still be suppressed by state.db-only skip metadata after being routed to the archive source. In mixed state-db/transcript archives, state.db can be unchanged while a sibling transcript is new or updated.

Use archive-effective size and mtime for state.db skip checks by folding direct transcript files from the sibling sessions directory into the snapshot, and add a regression where a transcript event refreshes an already-indexed archive.

Validation: go test -tags "fts5" ./internal/sync -run 'TestSyncPathsHermes(ArchiveTranscriptEventRefreshesArchive|StateDBEventRefreshesArchive)' -count=1; go test -tags "fts5" ./internal/parser ./internal/sync -run 'Test(HermesProvider|ObserveProviderSourceMatchesHermesLegacyParser|SyncPathsHermes)' -count=1; go fmt ./...; go vet ./...; git diff --check

fix(sync): use aggregate hermes archive fingerprints

Hermes archive freshness needs the state.db sync path to compare the same aggregate fingerprint it persists. Discovering through the public Hermes session lister reselected state.db and missed sibling transcripts, so state.db events could avoid real skip-cache parity.\n\nEnumerate direct transcript files for the archive snapshot and stamp archive parse results with the aggregate state.db fingerprint before writing. This keeps unchanged archive syncs comparable while still refreshing when sibling transcripts change.\n\nValidation: go test -tags "fts5" ./internal/parser ./internal/sync; go vet ./...; make nilaway

fix(sync): apply hermes archive fingerprints consistently

Hermes archive refresh paths need to compare and persist the same aggregate fingerprint for state.db plus sibling transcripts. Otherwise cached parse skips and single-session refreshes can fall back to raw state.db metadata and miss transcript-only archive changes.

Use the aggregate archive file info before generic skip-cache checks and share the archive parse-and-stamp helper between full archive processing and single-session refreshes. The regression coverage now persists the metadata, checks unchanged archive skips, and covers transcript discovery/removal behavior.

Validation: go test -tags "fts5" ./internal/sync -run 'TestHermesArchive|TestProcessFileHermes|TestProcessHermesArchive|TestSyncSingleHermesArchive' -count=1; go test -tags "fts5" ./internal/parser ./internal/sync; go vet ./...; make nilaway

refactor(parser): fold hermes into provider

Move Hermes source discovery, lookup, and parse ownership onto the
concrete hermesProvider and delete the package-level
DiscoverHermesSessions, FindHermesSourceFile, ParseHermesArchive, and
ParseHermesSession free functions. Discovery and find-source bodies now
live as provider-owned helpers (discoverHermesSessions,
findHermesSourceFile); parse, archive parse, the state-db reader, and the
transcript-archive fallback become hermesProvider methods (parseSession,
parseArchive, parseStateDB, parseTranscriptArchive).

Reproduce Hermes archive behavior on the provider. The provider's archive
Parse now stamps every state.db session with the state.db path plus the
aggregate (state.db + direct transcripts) size and mtime, replacing the
engine's stampHermesArchiveResults/hermesArchiveEffectiveInfo so a
transcript-only change still refreshes the archive's stored freshness. The
new provider helpers hermesArchiveEffectiveFileInfo and
hermesArchiveTranscriptFiles mirror the legacy engine aggregation (every
.jsonl and session_*.json directly under the sessions directory, no
dedup). The existing composite archive Fingerprint and archive watch/
classify source-set methods already carried the rest.

Make Hermes provider-authoritative and drop its legacy sync dispatch:
remove classifyHermesPath (and its hermesSyncArchivePaths,
hermesSyncDirExists, hermesSyncTranscriptPath helpers), the processFile
hermesArchiveEffectiveInfo stat hook and case arm, processHermes,
parseHermesArchive, stampHermesArchiveResults, hermesArchiveEffectiveInfo,
hermesArchiveTranscriptFiles, hermesArchiveSourcePaths, and the
syncSingleHermesArchive special-case plus its method. Single-session
resync of an archive now falls through to the generic provider path, which
reparses the whole archive (ForceReplace) the same way a full sync does.

Drop the Hermes AgentDef DiscoverFunc/FindSourceFunc hooks (the
provider-owned WatchRootsFunc/ShallowWatchRootsFunc stay), remove
hermes_provider.go from the pending shim scan list, replace the
shadow-baseline test with provider-API coverage plus a guard that the
legacy entrypoints stay gone, and route the package and engine archive
tests through provider methods and the provider-authoritative processFile/
SyncPaths paths.

Add internal/sync/provider_shadow_support_test.go defining the shared
writeProviderShadowSourceFile test helper that the remaining vibe shadow
test still references, which was orphaned by a predecessor commit.

test(sync): drop unused shadow source-file helper

The hermes fold left writeProviderShadowSourceFile in a dedicated test
support file, but every shadow test writes its fixtures inline, so the
helper has no callers and trips the unused linter. Remove the dead
scaffolding.
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (e0d9089)

Medium issues remain in Hermes archive single-session sync and archive fingerprint freshness.

Medium

  • internal/sync/engine.go:8847
    Hermes archive sessions now fall through the generic single-session resync path, so SyncSingleSessionContext("hermes:<id>") writes every parsed result from state.db instead of only the requested session. Because FindSource returns state.db for any syntactically valid Hermes ID, a nonexistent ID can appear successfully synced while importing unrelated archive sessions.
    Fix: Restore archive-specific filtering for Hermes single-session sync, or add a provider-targeted path that writes only the requested ID and errors when it is absent.

  • internal/parser/hermes_provider.go:535
    The archive fingerprint uses discoverHermesTranscriptFiles, which deduplicates session_*.json when a matching .jsonl exists. The parser and hermesArchiveEffectiveFileInfo still consider direct session_*.json files, so quick sync / fingerprint-based freshness can miss JSON snapshot changes or removals.
    Fix: Build the archive fingerprint from the same direct transcript set as hermesArchiveEffectiveFileInfo (hermesArchiveTranscriptFiles) so size, mtime, and hash cover every .jsonl and session_*.json.


Panel: ci_default_security | Synthesis: codex, 22s | Members: codex_default (codex/default, done, 6m45s), codex_security (codex/security, done, 4m22s) | Total: 11m29s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant