feat(sync): local-first multi-machine artifact sync#731
Conversation
|
Feedback is welcome. Still in draft mode since work and testing to this point has been completely agent driven, a combination of GPT-5.5 and Claude 4.8. Next up is manually trying various distributed machine scenarios and seeing how well any of this works in practice. Assuming the idea eventually proves out, I'm happy to split into smaller manageable PR. |
roborev: Combined Review (
|
## Summary - Keep the candidate-window and boundary-session behavior in `internal/postgres/push.go` unchanged for this PR, and batch the PostgreSQL-side comparison reads used to decide whether a candidate session can be skipped. - Implement new batched loaders in `internal/postgres/push_fingerprint.go` for message aggregates, message content hashes, role/time fingerprints, message flags, message system ordinals, token fingerprints, tool-call aggregates, tool-call fingerprints, and usage fingerprints, with chunking inside the helper when session counts exceed `ANY($1)` practicality. - Use the preloaded message and tool-call aggregates on the hot no-op path, and retry any comparison-preload SQL failure in a fresh transaction without the batched preload instead of continuing inside an already-aborted transaction. - Add targeted regression tests in `internal/postgres/push_test.go` and `internal/postgres/push_fingerprint_test.go` to cover the new batch-driven skip decision path and helper behavior with empty inputs. ## Scope - Files changed are `internal/postgres/push.go`, `internal/postgres/push_fingerprint.go`, `internal/postgres/push_test.go`, and `internal/postgres/push_fingerprint_test.go`. - No boundary/windowing semantics, no schema changes, and no changes to PR #731 or broader sync-work areas. ## Notes - A focused PG comparison query-count assertion was not added because the existing harness does not expose a stable helper-call/query metric for this exact path without adding brittle test-only instrumentation. - The review-driven follow-up keeps the existing non-batched fingerprint fallback, but now that fallback only runs from a clean transaction after preload failure instead of on the poisoned transaction that raised the preload error. Fixes #331 Co-authored-by: Rod Boev <rodboev@users.noreply.github.com>
|
Thanks for the review. Both findings were valid and are addressed in 2804870 and f252c35. High — Windows-invalid Note this changes the canonical on-disk HLC string ( Medium — divergent origin sources. Confirmed:
Claude Opus 4.8 reasoning-medium on behalf of maphew |
roborev: Combined Review (
|
|
Thanks again. All three findings were valid and are addressed in 055f3b3. High — local metadata events missing from the replay register ( Medium — remote HLCs not observed by the local clock ( Medium — one unavailable target aborted the rest of the origin (
Claude Opus 4.8 reasoning-medium on behalf of maphew |
roborev: Combined Review (
|
|
Thanks. Both convergence gaps were valid and are fixed in 1d8d24c and 8cac9ff. Medium — usage-only sessions never exported ( Medium — bulk star emitted no metadata events (
Claude Opus 4.8 reasoning-medium on behalf of maphew |
roborev: Combined Review (
|
|
Thanks. Addressed in e77db3a, 4110dce, acbb789, and 6ea6fa8. Medium — Medium — unconditional S3 PUT violates write-once ( Medium — Medium — remote events applied before the HLC advances (
Claude Opus 4.8 reasoning-medium on behalf of maphew |
roborev: Combined Review (
|
6ea6fa8 to
18c0f18
Compare
roborev: Combined Review (
|
|
I will rebase this |
18c0f18 to
b228d18
Compare
roborev: Combined Review (
|
|
I'll continue to work a bit on this to see if I can get it into a state that I'm comfortable with |
b228d18 to
16e5a7b
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
efb934f to
550372b
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
Squashed follow-up changes: - chore: clear golangci-lint modernize and staticcheck debt - fix(postgres): resolve relationship ids to pushed-session identities - fix(artifact): carry persisted signal state across manifest round-trip - fix(postgres): respect session ownership when resolving pushed ids - fix(postgres): repair stale subagent links on incremental push - fix(artifact): import scanned sessions as unscanned for secrets - fix(postgres): reuse legacy-prefixed rows, skip per-session conflicts - fix(artifact): keep non-content state out of the manifest hash - fix(artifact): guard os.Stat result in artifactFileExists - fix(artifact): harden trusted-fleet sync - fix(sync): harden trusted-fleet edge cases - fix(postgres): reject foreign already-prefixed identities - fix(sync): harden artifact sync edge cases - fix(artifact): preserve titles and avoid token reuse - Merge origin/main into docs/local-first-multi-machine-sync - fix(metadata): suppress false conflict noise - fix(sync): preserve metadata convergence - fix(sync): make batch delete metadata retryable Co-authored-by: Wes McKinney <wesmckinn+git@gmail.com> Generated with Codex Co-authored-by: Codex <codex@openai.com>
eaf6694 to
f944578
Compare
roborev: Combined Review (
|
Unstar requests are HTTP-idempotent, but artifact metadata is durable last-writer-wins state. Publishing an unstar event when no star row was removed lets stale or mistyped requests create tombstones for sessions the local database did not actually change.\n\nHave the store report whether unstar removed a row and only emit metadata for real removals, while preserving the no-content response for missing or already-unstarred targets across the supported backends. Generated with Codex Co-authored-by: Codex <codex@openai.com>
Unstar now avoids publishing metadata for true no-op requests, but a failed metadata write after removing the star made the retry path indistinguishable from a no-op. That could leave the local database unstarred while artifact sync never learned about the mutation.\n\nRestore the star when the unstar metadata append fails, so the failed request remains retryable and the next DELETE can remove the row and publish the artifact. The HTTP response still reports the original metadata failure. Generated with Codex Co-authored-by: Codex <codex@openai.com>
roborev: Combined Review (
|
Metadata replay state is durable LWW bookkeeping, so it must not be committed for an event whose artifact was never published. Otherwise a failed local append can leave invisible state that wins later comparisons even though no peer can import the event.\n\nWrite the metadata artifact first and only then mark the local projection as applied. The unstar failure regression now checks both the visible star rollback and the hidden replay/applied tables. Generated with Codex Co-authored-by: Codex <codex@openai.com>
Windows absolute paths such as C:\Users\... were classified as host:port targets because net.SplitHostPort accepts the drive letter as a host. That made local artifact folder sync, GC, and auto-GC reject normal Windows temp directories in CI.\n\nRecognize drive-letter filesystem paths before the host:port check so Windows local folders follow the same path as POSIX folders while bare host:port values still stay out of folder sync. Generated with Codex Co-authored-by: Codex <codex@openai.com>
roborev: Combined Review (
|
Implements the local-first multi-machine sync design proposed in #692:
every machine keeps the complete archive and machines converge by
exchanging immutable, content-addressed artifacts over any dumb transport
instead of depending on an always-on PostgreSQL hub. SQLite stays a local,
rebuildable derivation — the live database file never crosses the wire.
Design rationale and the full set of alternatives considered (Automerge,
cr-sqlite, the SQLite session extension, whole-DB replication, raw-file
mirroring) live in
docs/design/local-first-sync.md; user-facing setup isin
docs/artifact-sync.md.What this adds
write-once, content-addressed store under
$AGENTSVIEW_DATA_DIR/artifacts/<origin>/: append-only checkpoints,session manifests, zstd-compressed NDJSON message segments, a metadata
change feed, and an optional raw-source fallback. Serialization is a
pinned forever-contract enforced by golden tests; readers ignore unknown
fields and skip unknown future ops so mixed app versions keep syncing.
HLC timestamps render without
:so metadata filenames are valid onWindows.
name plus a random suffix). Foreign sessions are stored as
origin~nativeIDwithmachine=origin, the same convention SSHremote-sync already uses, so every read path, the UI, and analytics
render them without composite-PK surgery across backends. Server, CLI
folder sync, peer import, and conflict lookup converge on one persisted
origin via
AdoptOrigin.uploads, imports, SSH-pulled, and orphan-preserved sessions all publish;
it is debounced through the existing pg-watch sink loop. Import diffs
checkpoints against
artifact_sync_state, hash-verifies segments, andwrites foreign sessions through the existing
UpsertSession/messagepaths, inheriting FTS5 maintenance, tombstone rejection, and pin
re-attachment. Undelivered segments are recorded as phantoms and retried,
tolerating out-of-order delivery from dumb transports.
tiny HLC-stamped change events replayed deterministically with per-field
last-writer-wins. Concurrent conflicting edits are never silently
dropped: the losing value is logged to
meta_conflictsand surfaced inthe UI as a fork badge. Local edits record their own LWW register and
applied-event marker on write, so a later peer event with a lower order
key can no longer overwrite a newer local edit; replay advances the local
HLC past observed remote events to keep later local edits causally ahead;
and a single not-yet-durable target defers only its own event rather than
aborting the rest of an origin's replay.
syncverb, three interchangeable target shapesbehind a shared
Transportinterface (export -> set-union exchange ->import):
agentsview sync [--init|--watch] <dir>, safe forSyncthing, Dropbox, NFS, or rclone mounts because every file is
immutable temp+rename and single-writer-per-prefix.
agentsview sync https://peer:8080 [--token <t>]exchanges directly over the embedded server's artifact API behind the
existing Bearer-token middleware. A
GET /{origin}/indexrouteenumerates an origin's artifacts so metadata events (not referenced by
the checkpoint) can be pulled;
--tokendefaults to the local authtoken for a fleet sharing one symmetric token.
agentsview sync s3://bucket/prefixagainst anyS3-compatible store (AWS, MinIO, Backblaze B2). Requests are signed with
AWS Signature Version 4 implemented from the standard library, so there
is no AWS SDK dependency; credentials and addressing come from the
standard
AWS_*env vars plusAGENTSVIEW_S3_*overrides.from an origin's latest checkpoint) are reclaimed both on demand
(
agentsview sync gc [--dry-run] [--grace <d>] <dir>) and automaticallyafter a folder sync, over the local store and the shared target together
so set-union cannot re-propagate the deleted files. A grace window
protects slow peers, origins without checkpoints are skipped (never read
as a deletion), and
--gc-grace/--no-gctune or disable the automaticpass.
--watchkeeps any target shape syncing onchange plus a periodic floor through the pg-watch loop, and a peers page
shows each origin's published vs. locally-present session counts,
checkpoint sequence, last-published time, and total conflict count.
Scope, tradeoffs, and limitations
transports have no per-writer identity and the HTTP API uses one shared
token, so any peer can forge any origin's metadata. This is documented as
exactly that; per-peer tokens and origin signatures are the follow-up
before any sharing story.
append-mostly (a grow-only set), and metadata is a small append-only LWW
log, so a general CRDT library would add real cost without solving a
problem this data has.
compressed artifacts); zstd recovers 5-10x and GC reclaims superseded
bulk artifacts behind a grace window.
can reach the same transport; a NAS, bucket, or always-on peer is the
practical rendezvous by convention, not by privileged architecture.
tests and by a MinIO integration test (
make test-minio, run in CI) thatvalidates real S3 interop end to end; it has not been run against AWS itself.
Remote GC on an object store or HTTP peer is that peer's own responsibility —
auto-GC after a non-folder sync only collects the local store.
owner_markerpush design from the merged fix(postgres): preserve source machine on pg push #701/fix(postgres): guard pg push against same-id cross-machine row collision #724; the session-sinkseam is extracted (
drainSessionBatches) so PG is one sink and theartifact exporter can become another. PG read mode returns no
metadata-ledger conflicts (a parity stub in
internal/postgres/metadata.go).The change is additive: upgrading generates an origin id and behaves
exactly as today when sync is not configured. New tables arrive through the
existing idempotent migration path with no dataVersion bump and no resync.
Where to review
internal/artifact/format_test.gointernal/artifact/hlc.go,internal/artifact/replay.go,internal/db/metadata_replay.gointernal/artifact/sync.gointernal/artifact/transport.go,transport_http.go,transport_s3.gointernal/server/huma_routes_artifacts.go,internal/artifact/peer.gointernal/server/metadata_events.goand the frontendSessionBreadcrumb/TrashPagecomponentsinternal/artifact/twoinstance_test.go,internal/server/artifact_http_transport_test.go,internal/e2e/artifact_sync_test.goRelates to #692.
Claude Opus 4.8 reasoning-medium on behalf of maphew