Skip to content

feat(sync): local-first multi-machine artifact sync#731

Open
maphew wants to merge 5 commits into
kenn-io:mainfrom
maphew:docs/local-first-multi-machine-sync
Open

feat(sync): local-first multi-machine artifact sync#731
maphew wants to merge 5 commits into
kenn-io:mainfrom
maphew:docs/local-first-multi-machine-sync

Conversation

@maphew

@maphew maphew commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Implements the local-first multi-machine sync design proposed in #692:
every machine keeps the complete archive and machines converge by
exchanging immutable, content-addressed artifacts over any dumb transport
instead of depending on an always-on PostgreSQL hub. SQLite stays a local,
rebuildable derivation — the live database file never crosses the wire.

Design rationale and the full set of alternatives considered (Automerge,
cr-sqlite, the SQLite session extension, whole-DB replication, raw-file
mirroring) live in docs/design/local-first-sync.md; user-facing setup is
in docs/artifact-sync.md.

What this adds

  • Artifact store and canonical format — each install maintains a
    write-once, content-addressed store under
    $AGENTSVIEW_DATA_DIR/artifacts/<origin>/: append-only checkpoints,
    session manifests, zstd-compressed NDJSON message segments, a metadata
    change feed, and an optional raw-source fallback. Serialization is a
    pinned forever-contract enforced by golden tests; readers ignore unknown
    fields and skip unknown future ops so mixed app versions keep syncing.
    HLC timestamps render without : so metadata filenames are valid on
    Windows.
  • Origin identity — each install persists a stable origin id (machine
    name plus a random suffix). Foreign sessions are stored as
    origin~nativeID with machine=origin, the same convention SSH
    remote-sync already uses, so every read path, the UI, and analytics
    render them without composite-PK surgery across backends. Server, CLI
    folder sync, peer import, and conflict lookup converge on one persisted
    origin via AdoptOrigin.
  • Export and import — export reads from SQLite, so non-file agents,
    uploads, imports, SSH-pulled, and orphan-preserved sessions all publish;
    it is debounced through the existing pg-watch sink loop. Import diffs
    checkpoints against artifact_sync_state, hash-verifies segments, and
    writes foreign sessions through the existing UpsertSession/message
    paths, inheriting FTS5 maintenance, tombstone rejection, and pin
    re-attachment. Undelivered segments are recorded as phantoms and retried,
    tolerating out-of-order delivery from dumb transports.
  • Metadata ledger — renames, trash/restore, stars, and pins append
    tiny HLC-stamped change events replayed deterministically with per-field
    last-writer-wins. Concurrent conflicting edits are never silently
    dropped: the losing value is logged to meta_conflicts and surfaced in
    the UI as a fork badge. Local edits record their own LWW register and
    applied-event marker on write, so a later peer event with a lower order
    key can no longer overwrite a newer local edit; replay advances the local
    HLC past observed remote events to keep later local edits causally ahead;
    and a single not-yet-durable target defers only its own event rather than
    aborting the rest of an origin's replay.
  • Transports — one sync verb, three interchangeable target shapes
    behind a shared Transport interface (export -> set-union exchange ->
    import):
    • Folderagentsview sync [--init|--watch] <dir>, safe for
      Syncthing, Dropbox, NFS, or rclone mounts because every file is
      immutable temp+rename and single-writer-per-prefix.
    • HTTP peeragentsview sync https://peer:8080 [--token <t>]
      exchanges directly over the embedded server's artifact API behind the
      existing Bearer-token middleware. A GET /{origin}/index route
      enumerates an origin's artifacts so metadata events (not referenced by
      the checkpoint) can be pulled; --token defaults to the local auth
      token for a fleet sharing one symmetric token.
    • Object storageagentsview sync s3://bucket/prefix against any
      S3-compatible store (AWS, MinIO, Backblaze B2). Requests are signed with
      AWS Signature Version 4 implemented from the standard library, so there
      is no AWS SDK dependency; credentials and addressing come from the
      standard AWS_* env vars plus AGENTSVIEW_S3_* overrides.
  • Garbage collection — superseded artifacts (everything not reachable
    from an origin's latest checkpoint) are reclaimed both on demand
    (agentsview sync gc [--dry-run] [--grace <d>] <dir>) and automatically
    after a folder sync, over the local store and the shared target together
    so set-union cannot re-propagate the deleted files. A grace window
    protects slow peers, origins without checkpoints are skipped (never read
    as a deletion), and --gc-grace/--no-gc tune or disable the automatic
    pass.
  • Watch and peers UI--watch keeps any target shape syncing on
    change plus a periodic floor through the pg-watch loop, and a peers page
    shows each origin's published vs. locally-present session counts,
    checkpoint sequence, last-published time, and total conflict count.

Scope, tradeoffs, and limitations

  • Trust model: a fully mutually trusted personal fleet. Folder
    transports have no per-writer identity and the HTTP API uses one shared
    token, so any peer can forge any origin's metadata. This is documented as
    exactly that; per-peer tokens and origin signatures are the follow-up
    before any sharing story.
  • No CRDT engine: session content is single-writer-per-machine and
    append-mostly (a grow-only set), and metadata is a small append-only LWW
    log, so a general CRDT library would add real cost without solving a
    problem this data has.
  • Storage: a third local copy of the corpus (source files plus DB plus
    compressed artifacts); zstd recovers 5-10x and GC reclaims superseded
    bulk artifacts behind a grace window.
  • Availability: two intermittently-on machines converge only while both
    can reach the same transport; a NAS, bucket, or always-on peer is the
    practical rendezvous by convention, not by privileged architecture.
  • S3 signing: the hand-rolled SigV4 is exercised both by in-memory mock
    tests and by a MinIO integration test (make test-minio, run in CI) that
    validates real S3 interop end to end; it has not been run against AWS itself.
    Remote GC on an object store or HTTP peer is that peer's own responsibility —
    auto-GC after a non-folder sync only collects the local store.
  • Postgres: unchanged as an optional aggregation hub, with the
    owner_marker push design from the merged fix(postgres): preserve source machine on pg push #701/fix(postgres): guard pg push against same-id cross-machine row collision #724; the session-sink
    seam is extracted (drainSessionBatches) so PG is one sink and the
    artifact exporter can become another. PG read mode returns no
    metadata-ledger conflicts (a parity stub in
    internal/postgres/metadata.go).

The change is additive: upgrading generates an origin id and behaves
exactly as today when sync is not configured. New tables arrive through the
existing idempotent migration path with no dataVersion bump and no resync.

Where to review

  • Canonical format and golden stability: internal/artifact/format_test.go
  • HLC, deterministic replay, and the LWW correctness fixes:
    internal/artifact/hlc.go, internal/artifact/replay.go,
    internal/db/metadata_replay.go
  • Import invariants (phantoms, tombstone-retry, watermark advance):
    internal/artifact/sync.go
  • Transport abstraction and the three implementations:
    internal/artifact/transport.go, transport_http.go, transport_s3.go
  • HTTP peer surface, the index route, and auth:
    internal/server/huma_routes_artifacts.go, internal/artifact/peer.go
  • Conflict surfacing: internal/server/metadata_events.go and the frontend
    SessionBreadcrumb/TrashPage components
  • End-to-end exchange and convergence:
    internal/artifact/twoinstance_test.go,
    internal/server/artifact_http_transport_test.go,
    internal/e2e/artifact_sync_test.go

Relates to #692.

Claude Opus 4.8 reasoning-medium on behalf of maphew

@maphew

maphew commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Feedback is welcome. Still in draft mode since work and testing to this point has been completely agent driven, a combination of GPT-5.5 and Claude 4.8. Next up is manually trying various distributed machine scenarios and seeing how well any of this works in practice. Assuming the idea eventually proves out, I'm happy to split into smaller manageable PR.

@roborev-ci

roborev-ci Bot commented Jun 18, 2026

Copy link
Copy Markdown

roborev: Combined Review (2e11257)

High-level verdict: two actionable findings remain; no additional security issues were reported.

High

  • internal/artifact/metadata.go:145 - Metadata event filenames are built from HLCTimestamp.String(), whose layout contains : characters. Windows is supported, and : is invalid in Windows filenames, so rename/star/pin/delete requests can fail while writing the metadata artifact after the DB mutation has already been applied.
    • Fix: Use a filesystem-safe HLC representation for artifact filenames, or encode/decode the HLC separately from the JSON event value, and add coverage for Windows-safe metadata paths.

Medium

  • internal/server/server.go:125 - The server creates a metadata recorder with cfg.ArtifactOriginID, but when that value is empty the recorder falls back to artifact.EnsureOrigin() in SQLite sync state, while CLI folder sync later uses config.EnsureArtifactOriginID() in config.toml. That can create two different origins for the same machine, causing pre-init metadata events to target oldOrigin~session while exported sessions use newOrigin~session, so peers cannot replay those metadata events.
    • Fix: Use one authoritative origin source everywhere: ensure and persist the config origin before enabling metadata recording/import, or make CLI/server/import/conflict lookup all share the same persisted origin.

Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 9m1s), codex_security (codex/security, done, 7m54s) | Total: 17m4s

wesm pushed a commit that referenced this pull request Jun 19, 2026
## Summary
- Keep the candidate-window and boundary-session behavior in `internal/postgres/push.go` unchanged for this PR, and batch the PostgreSQL-side comparison reads used to decide whether a candidate session can be skipped.
- Implement new batched loaders in `internal/postgres/push_fingerprint.go` for message aggregates, message content hashes, role/time fingerprints, message flags, message system ordinals, token fingerprints, tool-call aggregates, tool-call fingerprints, and usage fingerprints, with chunking inside the helper when session counts exceed `ANY($1)` practicality.
- Use the preloaded message and tool-call aggregates on the hot no-op path, and retry any comparison-preload SQL failure in a fresh transaction without the batched preload instead of continuing inside an already-aborted transaction.
- Add targeted regression tests in `internal/postgres/push_test.go` and `internal/postgres/push_fingerprint_test.go` to cover the new batch-driven skip decision path and helper behavior with empty inputs.

## Scope
- Files changed are `internal/postgres/push.go`, `internal/postgres/push_fingerprint.go`, `internal/postgres/push_test.go`, and `internal/postgres/push_fingerprint_test.go`.
- No boundary/windowing semantics, no schema changes, and no changes to PR #731 or broader sync-work areas.

## Notes
- A focused PG comparison query-count assertion was not added because the existing harness does not expose a stable helper-call/query metric for this exact path without adding brittle test-only instrumentation.
- The review-driven follow-up keeps the existing non-batched fingerprint fallback, but now that fallback only runs from a clean transaction after preload failure instead of on the poisoned transaction that raised the preload error.

Fixes #331


Co-authored-by: Rod Boev <rodboev@users.noreply.github.com>
@maphew

maphew commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the review. Both findings were valid and are addressed in 2804870 and f252c35.

High — Windows-invalid : in metadata filenames. Confirmed: filenames are OrderingKey(hash) = HLCTimestamp.String() + hash, and String() rendered the wall time with 15:04:05 colons. Since the DB mutation runs before the metadata write (e.g. humaStarSession), star/pin/rename/delete would mutate the DB and then fail the file write on Windows, diverging the DB and the ledger. Fix: dropped the : separators from the HLC wall layout (2006-01-02T150405.000000000Z). Keeping a single representation everywhere — content, filename, and the persisted orderKey cursor — preserves lexicographic ordering, parse round-trip, and the filename-equals-content integrity check with no separate encode/decode. Updated the canonical-format goldens, added an assertion that the filename has no Windows-invalid characters, and a parse-rejection case for the old colon form.

Note this changes the canonical on-disk HLC string (event.HLC no longer contains colons). That is fine here since the format is pre-release, but any artifacts written by an earlier draft build would be stale-format.

Medium — divergent origin sources. Confirmed: runServe never populated cfg.ArtifactOriginID before server.New, so an empty config origin made the recorder fall back to a random DB-state origin while CLI folder sync minted a machine-named origin in config.toml. Fix: ensure the config origin (authoritative) at serve startup before the recorder exists, and reconcile the DB sync-state to it via a new idempotent artifact.AdoptOrigin. The recorder, peer import, and folder sync now converge on one persisted origin. Added AdoptOrigin unit tests (persist, idempotent, overwrite-divergent, reject-invalid).

go vet, gofmt, and the artifact/server/config/e2e Go suites pass.

Claude Opus 4.8 reasoning-medium on behalf of maphew

@roborev-ci

roborev-ci Bot commented Jun 19, 2026

Copy link
Copy Markdown

roborev: Combined Review (f252c35)

High-risk issues remain in metadata replay/local state handling; no security-specific findings were reported.

High

  • internal/artifact/sync.go:523 - Local metadata events are never recorded in metadata_replay_state because the importer skips the local origin entirely and MetadataRecorder.Append only writes the artifact file. A later peer event for the same field is treated as the first winner and can overwrite a newer local rename/star/delete/pin without recording a conflict.
    Fix: Keep skipping local session manifests, but replay local meta/ events into the replay tables, or have local append also persist the local projection/applied-event state.

Medium

  • internal/artifact/replay.go:51 - Imported remote HLCs are never observed by the local HLC clock, so after replaying a peer event whose clock is ahead, the next local edit can get an older HLC and lose LWW ordering despite happening after the import.
    Fix: Parse accepted remote event HLCs during replay and advance the persisted metadata clock with HLCClock.Observe.

  • internal/artifact/replay.go:53 - A single ErrMetadataTargetUnavailable aborts replay for the rest of that origin, so one missing/excluded session or pin target can block unrelated later metadata events from applying.
    Fix: Defer only that event without marking it applied, then continue replaying later events whose targets are available.


Panel: ci_default_security | Synthesis: codex, 11s | Members: codex_default (codex/default, done, 7m18s), codex_security (codex/security, done, 4m50s) | Total: 12m19s

@maphew

maphew commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

Thanks again. All three findings were valid and are addressed in 055f3b3.

High — local metadata events missing from the replay register (sync.go:523). Confirmed: import skips the local origin and MetadataRecorder.Append only wrote the artifact file, so locally-originated edits never landed in metadata_replay_state/metadata_applied_events. A later peer event for the same field saw no current winner and could overwrite a newer local rename/star/delete/pin without recording a conflict. Fix: Append now records the event through a new db.RecordLocalMetadataProjection, which runs the same per-field LWW conflict/register/applied bookkeeping as replay but skips re-applying the mutation (the handler already applied it). ApplyMetadataProjection was refactored to share that path via an applyMutation flag. Added a test where a local rename with a higher HLC survives a later-imported lower-HLC peer rename and the peer is recorded as the conflict loser.

Medium — remote HLCs not observed by the local clock (replay.go:51). Confirmed. Replay now tracks the newest accepted remote HLC per origin and calls HLCClock.Observe once at the end, so the next local edit is causally ahead of imported peers. Advancing past the drift bound is best-effort and never fails an otherwise successful import. Import is routed through the recorder (MetadataRecorder.Import) so appends and imports share one clock instance rather than racing two clocks on the same persisted key. Added a test asserting a post-import local edit gets an HLC strictly after a peer HLC that was ahead of the local wall clock.

Medium — one unavailable target aborted the rest of the origin (replay.go:53). Confirmed. ErrMetadataTargetUnavailable now skips only that event (left unapplied, watermark not advanced) and replay continues with later events, which retries the deferred event on a subsequent run once its target exists. Added a test where an event for a missing session no longer blocks a later event for an existing session, and the deferred event applies after the session arrives.

go vet, gofmt, and the artifact/db/server Go suites pass.

Claude Opus 4.8 reasoning-medium on behalf of maphew

@roborev-ci

roborev-ci Bot commented Jun 20, 2026

Copy link
Copy Markdown

roborev: Combined Review (14f32bc)

Artifact sync is not yet complete: two medium-severity data convergence gaps need fixes before merge.

Medium

  • internal/artifact/sync.go:414
    Artifact export pages through ListSessions, which applies the normal session-list visibility filter (message_count > 0). Zero-message sessions with usage events are never exported or synced, even though artifact manifests support usage events.
    Fix: Export from a raw DB enumeration for owned, non-deleted sessions instead of the UI list API, and add coverage for zero-message/usage-only sessions.

  • internal/server/huma_routes_starred.go:92
    Bulk star updates the DB but does not append metadata events. This path is used for localStorage star migration, so those stars remain local-only and never converge through artifact sync.
    Fix: Have bulk star return or determine the sessions actually starred, then append a MetadataOpStar event for each valid session.


Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 10m33s), codex_security (codex/security, done, 5m28s) | Total: 16m9s

@maphew

maphew commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

Thanks. Both convergence gaps were valid and are fixed in 1d8d24c and 8cac9ff.

Medium — usage-only sessions never exported (sync.go:414). Confirmed: export paged through ListSessions, whose sidebar filter (message_count > 0) hid owned sessions that carry only usage events, so they never reached a checkpoint. Fix: export now enumerates owned, non-deleted sessions from a raw query (db.ListOwnedSessionIDsForExport) instead of the list API. Soft-deleted and foreign sessions stay excluded — deletes still propagate via metadata tombstone events, never checkpoint absence. Added a test that a zero-message session with a usage event lands in the checkpoint and imports into a peer with its usage events intact, plus one asserting deleted/foreign sessions stay out.

Medium — bulk star emitted no metadata events (huma_routes_starred.go:92). Confirmed: bulk star wrote the DB but appended nothing, so localStorage-migrated stars stayed local-only. Fix: BulkStarSessions now returns the IDs actually starred (sessions that exist), across the SQLite, PostgreSQL, and DuckDB backends to preserve parity, and the handler appends a star metadata event for each, matching single-session star. Stale IDs are still skipped and produce no event. Added a server test asserting one star event artifact per existing session and none for a missing id.

go vet, gofmt, and the db/artifact/server/duckdb Go suites pass; the PG curation pgtest compiles under the pgtest tag.

Claude Opus 4.8 reasoning-medium on behalf of maphew

@roborev-ci

roborev-ci Bot commented Jun 20, 2026

Copy link
Copy Markdown

roborev: Combined Review (97c2128)

Artifact sync has several medium-risk correctness issues that should be fixed before merge.

Medium

  • cmd/agentsview/sync.go:118 - agentsview sync --host <host> <artifact-target> is accepted, but the early return after runRemoteSync skips artifact sync entirely, silently ignoring the target. Reject --host with any artifact target, or run runArtifactFolderSync before returning.

  • internal/artifact/transport_s3.go:355 - S3 uploads use unconditional PUT, so a stale or concurrent sync can overwrite an existing artifact object. This violates the write-once artifact invariant, especially for sequence-named checkpoints. Use conditional writes such as If-None-Match: *; on “already exists”, fetch and compare content, accepting identical duplicates and rejecting conflicts.

  • internal/artifact/sync.go:1415 - CopyUnion defers from.Close() inside the walk loop, keeping every copied source file open until the entire traversal finishes. Large artifact stores can hit the process file descriptor limit and fail mid-sync. Close each source file immediately after io.Copy, handling both copy and close errors.

  • internal/artifact/replay.go:86 - Metadata replay applies remote events before advancing the local HLC, then silently ignores Observe failures for future-skewed remote HLCs. A later local edit can get a lower order key, be recorded as the losing value, while the handler has already mutated the local DB to that losing value. Do not apply remote metadata unless the clock can be advanced, or make local event creation consult/advance past the replay-state winner before applying the local mutation.


Panel: ci_default_security | Synthesis: codex, 10s | Members: codex_default (codex/default, done, 10m13s), codex_security (codex/security, done, 6m59s) | Total: 17m22s

@maphew

maphew commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

Thanks. Addressed in e77db3a, 4110dce, acbb789, and 6ea6fa8.

Medium — --host + artifact target silently ignored (sync.go:118). Confirmed: --host dispatches to SSH remote sync and returns before artifact sync runs. Fix: validateSyncConfig now rejects the combination outright (ordered after the --init/--watch checks so their specific messages still win). Added accept/reject tests.

Medium — unconditional S3 PUT violates write-once (transport_s3.go:355). Confirmed: unlike the folder and HTTP transports, S3 blind-overwrote. Fix: uploads now send If-None-Match: *; on the 412 precondition failure the object is fetched and compared — identical content is an accepted duplicate, divergent content is a conflict, never a silent merge. The mock honors the conditional and a test covers both outcomes, and the MinIO integration test confirms a real server accepts it.

Medium — CopyUnion deferred close (sync.go:1415). This one is actually a false positive: the defer from.Close() was inside the per-entry WalkDir callback closure, so it ran after each file, not at the end of the traversal — no descriptor accumulation. That said, I removed the ambiguity entirely by switching to os.ReadFile (no manual open/defer), which reads cleaner regardless.

Medium — remote events applied before the HLC advances (replay.go:86). Confirmed and the most important one. Fix: replay now calls Observe on each remote event's HLC before applying it and defers any event whose wall time is beyond the drift bound, leaving it unapplied for a later run to retry once local wall time catches up. A future-skewed peer can no longer drag local state to a value a subsequent local edit couldn't out-order. Added a deferral test; the two-instance harness now orders concurrent edits with within-drift offsets.

go vet, gofmt, and the artifact/db/server/duckdb/cmd Go suites pass; the MinIO integration test (make test-minio) passes against real MinIO.

Claude Opus 4.8 reasoning-medium on behalf of maphew

@roborev-ci

roborev-ci Bot commented Jun 20, 2026

Copy link
Copy Markdown

roborev: Combined Review (6ea6fa8)

Medium severity findings remain.

Medium

  • internal/duckdb/store.go:273
    DuckDB clears Cursor and Limit before applying the generic session filter, so the new sidebar endpoint can return the entire corpus instead of the requested first page. In starred-only mode it also filters only starred roots, missing groups where a child is starred but the root is not, unlike SQLite/Postgres sidebar semantics. Preserve pagination inputs and implement the same root-page plus descendant expansion query used by the other backends, including “any starred session in the tree” root eligibility.

  • internal/artifact/replay.go:49
    Metadata artifact replay accepts filenames/order keys whose HLC component cannot be parsed, because normalizeMetadataName only checks the hash suffix and replay silently ignores ParseHLCTimestamp failures. A peer that can write to the shared artifact transport could publish an order key like zzzz-<sha256>.json with matching body metadata, bypass drift checks, and win LWW metadata conflicts such as purge, soft delete, rename, star, and pin. Reject metadata artifacts whose HLC component cannot be parsed, ideally in validateMetadataArtifactEvent or normalizeMetadataName, while preserving drift deferral for valid future timestamps.


Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 6m9s), codex_security (codex/security, done, 7m2s) | Total: 13m20s

@mariusvniekerk mariusvniekerk force-pushed the docs/local-first-multi-machine-sync branch from 6ea6fa8 to 18c0f18 Compare June 22, 2026 18:25
@roborev-ci

roborev-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown

roborev: Combined Review (18c0f18)

Medium issues need attention before merge.

Medium

  • internal/postgres/push.go:1094 - resolvePushedSessionIdentity treats any existing row with a different machine as a native ID collision, ignoring owner_marker and legacy marker machine aliases. After a machine rename, a session already owned by the same marker can be pushed as new-machine~id, duplicating the existing PG session instead of updating it.

    • Fix: Resolve identity using owner_marker and legacy marker aliases, matching the ownership rules used later in pushSession; only prefix IDs for true cross-owner collisions.
  • internal/artifact/sync.go:561 - Folder import accepts unvalidated origin directory names. A crafted or mistaken origin such as local can be imported and rewritten with Machine = "local", causing foreign sessions to be treated as locally owned and re-exported under this machine’s origin.

    • Fix: Validate folder origin names with validateOriginID before importOrigin, and reject reserved or malformed origins consistently with HTTP/S3 index validation.

Panel: ci_default_security | Synthesis: codex, 7s | Members: codex_default (codex/default, done, 12m20s), codex_security (codex/security, done, 7m46s) | Total: 20m13s

@wesm

wesm commented Jun 24, 2026

Copy link
Copy Markdown
Member

I will rebase this

@wesm wesm force-pushed the docs/local-first-multi-machine-sync branch from 18c0f18 to b228d18 Compare June 24, 2026 20:00
@roborev-ci

roborev-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

roborev: Combined Review (b228d18)

High-level verdict: changes need fixes before merge due to artifact import identity/state corruption risks.

High

  • internal/artifact/sync.go:738
    Imported artifacts copy the source machine’s FilePath, file hash/mtime/size, incremental ordinals, and inode/device fields into the local SQLite session row. Existing local sync lookups use file_path without filtering by machine, so a peer artifact whose path exists locally can cause local sync to skip the real local file or append local messages into the imported origin~id session.
    Fix: Clear or namespace parser-local file metadata/control fields during artifact import, or make all sync file-path/incremental queries restrict to locally owned sessions.

Medium

  • internal/artifact/sync.go:755
    Artifact import prefixes parent_session_id and subagent tool-call session IDs, but leaves source_session_id unchanged. Imported sessions are stored as origin~nativeID, so an unprefixed source_session_id will dangle or point at an unrelated local session.
    Fix: Rewrite non-empty, non-global SourceSessionID the same way as ParentSessionID, and add an import test covering source/parent/subagent relationship rewriting.

  • internal/config/config.go:1588
    EnsureArtifactOriginID generates and writes a new origin without taking the config lock or re-reading under that lock. Concurrent first starts can generate different origin IDs, causing one process to use an origin that is overwritten by another and splitting local artifact identity.
    Fix: Wrap the ensure path in withConfigLock, re-read the config map while locked, honor any existing artifact_origin_id, and only generate/write when still absent.


Panel: ci_default_security | Synthesis: codex, 11s | Members: codex_default (codex/default, done, 12m13s), codex_security (codex/security, done, 7m49s) | Total: 20m13s

@wesm

wesm commented Jun 24, 2026

Copy link
Copy Markdown
Member

I'll continue to work a bit on this to see if I can get it into a state that I'm comfortable with

@wesm wesm force-pushed the docs/local-first-multi-machine-sync branch from b228d18 to 16e5a7b Compare June 25, 2026 00:31
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (16e5a7b)

High-level verdict: changes are not ready as-is due to one high-severity artifact sync bypass and two medium-severity reliability issues.

High

  • cmd/agentsview/sync.go:175
    When a writable daemon is running, agentsview sync /artifact-share returns immediately after daemon sync, so the artifact target and --init are silently ignored and no artifact exchange happens.
    Fix: Route artifact exchange through the daemon, or reject artifact sync while a writable daemon owns the DB; add a daemon-backed artifact sync test.

Medium

  • internal/artifact/transport_http.go:218
    HTTP artifact uploads do not set an Origin header, so POSTs to a default no-auth loopback peer are rejected by the server’s CORS middleware with 403.
    Fix: Set Origin to the peer origin for mutating peer requests, matching the daemon sync client behavior.

  • internal/config/config.go:1914
    EnsureArtifactOriginID reads and writes config.toml without withConfigLock or re-reading under the lock, so two first-run processes can generate different origins and fork the machine identity.
    Fix: Wrap the read/generate/write path in withConfigLock and reuse an existing artifact_origin_id found after acquiring the lock.


Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 13m9s), codex_security (codex/security, done, 11m17s) | Total: 24m35s

@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (8733aec)

Medium-risk issues remain.

Medium

  • internal/artifact/sync.go:803 - Artifact import restores sessions from m.Session, but db.Session.SessionName is json:"-" and is not restored from the manifest. The batch upsert writes session_name from that nil field, causing synced sessions to lose parser-provided titles and potentially clearing titles on existing imported rows. Add an explicit manifest SessionName *string, populate it on export, restore it in rewriteForImport, and add an artifact round-trip test.

  • cmd/agentsview/sync.go:291 - When agentsview sync targets an HTTP(S) artifact peer without --token, artifactPeerToken falls back to appCfg.AuthToken, which is then sent as Authorization: Bearer ... to the peer via internal/artifact/transport_http.go:253. A malicious or compromised peer could collect the local AgentsView auth token and use it for API access. Require an explicit peer token for HTTP(S) artifact targets, make local-token reuse explicit opt-in, or avoid sending Authorization when no peer-specific token is configured.


Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 8m29s), codex_security (codex/security, done, 8m0s) | Total: 16m38s

@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (7851273)

Medium-risk issues found in metadata replay/artifact publishing; security review found no additional issues.

Medium

  • internal/db/metadata_replay.go:184 - Local metadata events are recorded with applyMutation=false after handlers have already mutated the row. If a peer metadata import runs between the handler mutation and this projection write, the new local event can win the LWW register without reapplying its value, leaving metadata_replay_state saying the local edit won while the actual session/star/pin row still contains the peer value. Serialize the user mutation and local projection in one transaction/critical section, or have local projection recording idempotently reapply the local mutation when it wins.

  • internal/artifact/metadata.go:159 - Append marks the local metadata event applied/current in SQLite before the immutable event file is written. A crash or writeFileAtomic failure after that point leaves the local mutation committed and the replay state advanced, but no artifact exists for peers; for purge/delete-everywhere this can permanently lose the fleet-wide delete event after the session is gone. Use a durable outbox or transactional publish flow so the event artifact is written or queued for retry before the mutation is considered successful and before replay state is advanced irrecoverably.


Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 18m9s), codex_security (codex/security, done, 6m21s) | Total: 24m39s

@wesm wesm marked this pull request as ready for review June 25, 2026 03:35
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (efb934f)

Summary verdict: Changes are not clean; 2 medium issues need attention before merge.

Medium

  • internal/db/orphaned.go:337
    Full resync only copies pg_push_marker_id from pg_sync_state and does not copy the new artifact metadata replay tables. After a temp-DB rebuild, local curation rows may be preserved, but metadata_replay_state, metadata_applied_events, and the artifact HLC/import state are lost. A later artifact import can replay older foreign rename/star/trash events as new while local-origin events are skipped, potentially overwriting preserved local metadata.
    Fix: Preserve artifact sync state and metadata replay tables during resync, or rebuild replay state from local metadata artifacts before importing foreign events.

  • internal/artifact/peer.go:288
    Peer/object artifact storage validates artifacts with current-version import decoders before storing them. Future-version checkpoints, segments, and metadata events are rejected by WriteArtifact/ReadArtifact, so an older HTTP peer or S3 client cannot act as a pass-through store during rolling upgrades and sync aborts instead of deferring unknown artifacts.
    Fix: Split storage validation from import validation: verify names, hashes, and basic origin/sequence envelopes where possible, store/serve opaque future-version artifacts, and leave version-specific rejection to the import path.


Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 12m23s), codex_security (codex/security, done, 7m39s) | Total: 20m11s

@wesm wesm force-pushed the docs/local-first-multi-machine-sync branch from efb934f to 550372b Compare June 25, 2026 20:25
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (550372b)

Summary verdict: medium-risk sync consistency issues need fixes before merge.

Medium

  • internal/server/huma_routes_sessions.go:669
    Batch soft-deletes update SQLite but never append MetadataOpSoftDelete events, so sessions deleted via the multi-select/batch endpoint do not converge through artifact sync while single deletes do.
    Fix: Return or compute the IDs actually newly deleted and append one soft-delete metadata event per changed session.

  • internal/artifact/peer.go:510
    Peer/S3 artifact writes reject future-version checkpoints, segments, and metadata as invalid, which makes HTTP/S3 sync fail against a newer peer instead of storing the immutable artifact and letting import defer it.
    Fix: Treat errFutureArtifactVersion as acceptable for transport storage after minimal origin/name/hash validation, and leave deferral to the import path.

  • internal/db/metadata_replay.go:419
    Replayed purge events only tombstone the exact session ID, unlike the existing permanent-delete helpers that also exclude parser fallback aliases; delete-everywhere can therefore be resurrected under an alias on a later local sync.
    Fix: Resolve alias IDs before deleting and insert all of them into excluded_sessions in the same transaction.


Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 12m28s), codex_security (codex/security, done, 5m7s) | Total: 17m44s

@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (eaf6694)

Medium-risk issues remain in artifact sync and artifact upload handling.

Medium

  • internal/artifact/sync.go:1114 - Artifact export/import drops db.ToolCall.FilePath because segmentToolCall never carries it, so synced sessions lose edit/write file paths and disappear from Recent Edits or downstream PG/DuckDB mirrors.
    Fix: Add a file_path field to segmentToolCall, populate it from call.FilePath, restore it in dbMessage(), and add a round-trip test.

  • internal/artifact/sync.go:749 - Checkpoint and manifest validation accepts arbitrary referenced manifest/segment strings as long as they are non-empty. Those values are later used in filepath.Join, so malformed artifacts can abort imports or resolve .. path components outside the intended artifact-kind directory.
    Fix: Validate checkpoint manifest hashes and manifest segment/raw hashes with validateHashHex before accepting or importing them, and reject malformed artifacts early.

  • internal/server/huma_routes_artifacts.go:236 - humaPostArtifact writes request bodies to $AGENTSVIEW_DATA_DIR/artifacts via artifact.WriteArtifact without checking s.db.ReadOnly() or whether the server has a writable local archive. A client that can reach a read-only pg serve/DuckDB serve instance can post arbitrary SHA-matching content, creating persistent files and consuming disk where other mutating routes are blocked.
    Fix: Reject artifact POST in read-only mode before calling WriteArtifact, or only register the write route when the server has a writable local *db.DB plus sync/metadata support.


Panel: ci_default_security | Synthesis: codex, 12s | Members: codex_default (codex/default, done, 15m4s), codex_security (codex/security, done, 5m54s) | Total: 21m10s

Squashed follow-up changes:
- chore: clear golangci-lint modernize and staticcheck debt
- fix(postgres): resolve relationship ids to pushed-session identities
- fix(artifact): carry persisted signal state across manifest round-trip
- fix(postgres): respect session ownership when resolving pushed ids
- fix(postgres): repair stale subagent links on incremental push
- fix(artifact): import scanned sessions as unscanned for secrets
- fix(postgres): reuse legacy-prefixed rows, skip per-session conflicts
- fix(artifact): keep non-content state out of the manifest hash
- fix(artifact): guard os.Stat result in artifactFileExists
- fix(artifact): harden trusted-fleet sync
- fix(sync): harden trusted-fleet edge cases
- fix(postgres): reject foreign already-prefixed identities
- fix(sync): harden artifact sync edge cases
- fix(artifact): preserve titles and avoid token reuse
- Merge origin/main into docs/local-first-multi-machine-sync
- fix(metadata): suppress false conflict noise
- fix(sync): preserve metadata convergence
- fix(sync): make batch delete metadata retryable

Co-authored-by: Wes McKinney <wesmckinn+git@gmail.com>
Generated with Codex
Co-authored-by: Codex <codex@openai.com>
@wesm wesm force-pushed the docs/local-first-multi-machine-sync branch from eaf6694 to f944578 Compare June 25, 2026 21:56
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (f944578)

Reviewed artifact sync changes: one medium-severity correctness finding remains; no high or critical issues reported.

Medium

  • internal/server/huma_routes_starred.go:76 - humaUnstarSession appends an unstar metadata event even when the session ID does not exist or no star row was removed. With artifact sync, that can create a durable last-writer-wins tombstone for arbitrary IDs, so a mistyped or stale request can later suppress a valid peer star for that session. Fix by checking that the session exists, or by having UnstarSession report whether it removed an existing star, before appending metadata. Keep idempotent HTTP behavior if desired, but do not publish metadata for nonexistent/no-op targets.

Panel: ci_default_security | Synthesis: codex, 6s | Members: codex_default (codex/default, done, 7m50s), codex_security (codex/security, done, 5m59s) | Total: 13m55s

wesm and others added 2 commits June 25, 2026 18:50
Unstar requests are HTTP-idempotent, but artifact metadata is durable last-writer-wins state. Publishing an unstar event when no star row was removed lets stale or mistyped requests create tombstones for sessions the local database did not actually change.\n\nHave the store report whether unstar removed a row and only emit metadata for real removals, while preserving the no-content response for missing or already-unstarred targets across the supported backends.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Unstar now avoids publishing metadata for true no-op requests, but a failed metadata write after removing the star made the retry path indistinguishable from a no-op. That could leave the local database unstarred while artifact sync never learned about the mutation.\n\nRestore the star when the unstar metadata append fails, so the failed request remains retryable and the next DELETE can remove the row and publish the artifact. The HTTP response still reports the original metadata failure.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
@roborev-ci

roborev-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown

roborev: Combined Review (5235d49)

Medium confidence: the PR is close, but it has metadata durability and idempotency issues that can cause stale peer state to overwrite local curation.

Medium

  • internal/db/orphaned.go:338 - Full DB resync only copies pg_push_marker_id from pg_sync_state and does not preserve artifact sync state or the new metadata_* replay tables. After a resync, local LWW/applied metadata state is lost while local-origin artifacts are skipped during import, so older peer events can overwrite preserved local curation and HLC state can regress.

    • Fix: Copy durable artifact sync keys and the metadata_applied_events, metadata_replay_state, and metadata_conflicts tables during the resync metadata copy, with compatibility checks for older DBs.
  • internal/server/huma_routes_starred.go:57 - PUT /star appends a star metadata event even when the session was already starred, because StarSession returns true for both newly inserted and already-existing stars. A retry/no-op can create a later HLC event and incorrectly beat a concurrent peer unstar.

    • Fix: Distinguish “session exists” from “star row inserted” and only append metadata when the star state actually changed.
  • internal/server/huma_routes_pins.go:117 - DELETE /pin appends an unpin metadata event whenever the message exists, even if no pin row was removed. Retried or no-op unpins can therefore publish fresh LWW events and incorrectly override peer pins.

    • Fix: Have UnpinMessage report whether it removed a row, or check the pinned row before deletion, and append metadata only on an actual removal.

Panel: ci_default_security | Synthesis: codex, 11s | Members: codex_default (codex/default, done, 11m55s), codex_security (codex/security, done, 9m5s) | Total: 21m11s

wesm and others added 2 commits June 25, 2026 19:26
Metadata replay state is durable LWW bookkeeping, so it must not be committed for an event whose artifact was never published. Otherwise a failed local append can leave invisible state that wins later comparisons even though no peer can import the event.\n\nWrite the metadata artifact first and only then mark the local projection as applied. The unstar failure regression now checks both the visible star rollback and the hidden replay/applied tables.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Windows absolute paths such as C:\Users\... were classified as host:port targets because net.SplitHostPort accepts the drive letter as a host. That made local artifact folder sync, GC, and auto-GC reject normal Windows temp directories in CI.\n\nRecognize drive-letter filesystem paths before the host:port check so Windows local folders follow the same path as POSIX folders while bare host:port values still stay out of folder sync.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
@roborev-ci

roborev-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown

roborev: Combined Review (fce3069)

The PR has two medium-risk convergence/compatibility issues; no security findings with a plausible exploit path were reported.

Medium

  • internal/server/huma_routes_sessions.go:727 - Permanent delete removes the local trashed session before recording the purge metadata event. If appendMetadataEvent fails after DeleteSessionIfTrashed succeeds, retries return “session not found or not in trash,” so the delete-everywhere event cannot be regenerated and peers may keep or later reintroduce the session.

    • Fix: Make purge recording retryable with the local deletion state, for example by allowing purge event creation when the local exclusion already exists, or by recording durable purge intent before making the local delete irreversible.
  • internal/artifact/peer.go:541 - Manifest validation rejects future-version manifest artifacts by requiring the current manifest schema fields. Other artifact types intentionally accept future versions for relay/storage, but a newer peer’s manifest referenced by a future checkpoint will be rejected by HTTP/S3 peers instead of being preserved for forward compatibility.

    • Fix: Decode manifest version/origin first and, when v > formatVersion, validate only minimal invariants such as origin and hash before accepting the artifact; keep strict schema checks for current-version manifests.

Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 10m49s), codex_security (codex/security, done, 5m18s) | Total: 16m16s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants