Skip to content

test(cluster): multi-node deployment-tracking integration test (Slice B2)#221

Open
kriszyp wants to merge 8 commits into
mainfrom
feat/deployment-tracking-slice-b2
Open

test(cluster): multi-node deployment-tracking integration test (Slice B2)#221
kriszyp wants to merge 8 commits into
mainfrom
feat/deployment-tracking-slice-b2

Conversation

@kriszyp
Copy link
Copy Markdown
Member

@kriszyp kriszyp commented May 23, 2026

Companion to HarperFast/harper#760 — Slice B2 of the deployment-tracking redesign tracked in HarperFast/harper#641.

Summary

A 3-node cluster integration test verifying that the new payload-via-row design actually works end-to-end. After harper #760 strips req.payload before replicateOperation and switches peers to read from hdb_deployment.payload_blob, this test proves:

  1. deploy_component from node 0 succeeds, the component is loaded on all 3 nodes.
  2. The hdb_deployment row replicates and is queryable from any peer.
  3. The origin's row has peer_results populated with status=success entries for both peer nodes — confirming the origin successfully captured per-peer outcomes from replicateOperation's return value.

The OSS-side counterpart (harper #760) exercises the peer-side branch on a single node by seeding the row first. This test verifies the full multi-node round trip with real BLOB_CHUNK replication.

Notes

  • No code changes in harper-pro. The existing replicateOperation in replication/replicator.ts already does the generic operation forwarding the new design needs — no deploy_component-specific path required. The chunked-relay / direct-HTTPS approach from the closed harper-pro#146 is fully obsoleted by the row-replication channel.
  • Submodule bump is temporary. The core submodule is currently pointed at harper's feat/deployment-tracking-slice-b2 branch so the test can run against the OSS-side changes. Once harper #760 merges, re-target core to harper main HEAD before marking this PR ready for review.

Test plan

  • CI runs deployTrackingReplication.test.mjs (3-node cluster, ~60s)
  • Verify the test fails cleanly if either:
    • the strip-before-replicate behavior in harper #760 is reverted (peer would receive a Readable it can't process)
    • the peer-side branch in harper #760 is reverted (peer would try to extract from missing req.payload)

🤖 Generated by Claude

kriszyp and others added 2 commits May 22, 2026 21:54
3-node cluster test verifying Slice B2 of HarperFast/harper#641: payload
travels to peers via the replicated hdb_deployment.payload_blob row
attribute through Harper's existing BLOB_CHUNK channel, not via the
operations API body.

Asserts:
- deploy_component from node 0 succeeds, component loads on all 3 nodes
- hdb_deployment row replicates and is queryable from any node
- origin row's peer_results is populated with success entries for both
  peer nodes (proves origin captured per-peer outcomes from
  replicateOperation's return)

The OSS-side counterpart (HarperFast/harper#760) exercises the peer-side
branch on a single node by seeding the row first; this test verifies the
full multi-node round trip the new design depends on.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Temporary bump so the new deployTrackingReplication.test.mjs cluster test
runs against harper's Slice B2 changes (HarperFast/harper#760). Re-target
to harper main HEAD once #760 merges.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread core Outdated
@@ -1 +1 @@
Subproject commit 974cf40ec2fc1b4b96876c1a082dcdb9dba9baed
Subproject commit 2d597bbc13e09d08e1a7831a5415686b9ed23440
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core submodule is pinned to a commit on feat/deployment-tracking-slice-b2 rather than a commit on harper/main. If this PR is merged as-is, harper-pro/main would depend on unreleased feature-branch code in core. The PR description already notes this must be retargeted to harper/main HEAD after harper#760 merges — just flagging it explicitly so it can't be accidentally merged early.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 23, 2026

1. core submodule pinned to feature branch — must retarget before merge

File: core:1
What: core still points to 2d597bb on feat/deployment-tracking-slice-b2 (the post-review bump in commit 21dc3d5 advanced the pointer on the same branch rather than retargeting to harper/main).
Why it matters: Merging as-is would pin harper-pro/main to unreleased feature-branch code.
Suggested fix: After harper#760 merges, update the submodule pointer to the corresponding harper/main HEAD commit (as already noted in the PR description).

The post-review test improvements (restart: false for write durability, replicate-dispatch assertion) look correct.

kriszyp added a commit to HarperFast/harper that referenced this pull request May 23, 2026
The deploy hangs after extraction when reading a Web ReadableStream from
a file-backed Blob inside the same Harper process on Bun — same code
passes on Node v22/v24 across Linux and Windows. The harper-pro 3-node
cluster test (HarperFast/harper-pro#221) covers the same code path
end-to-end with real replication, so this skip doesn't lose coverage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kriszyp kriszyp requested review from Ethan-Arrowood and heskew May 23, 2026 04:00
kriszyp and others added 5 commits May 23, 2026 06:58
The fullyConnectedReplication test uses restart:true to verify the new
component routes are loaded on peers, but for verifying B2's peer_results
tracking we need a clean deploy. Restarting HTTP workers mid-flow cycles
the worker that owns the recorder before recorder.finish() can flush the
final put with peer_results.

Removed the /Location/2 component-reachability check since it requires
restart:true; the harper #760 single-node test already verifies the peer
branch extracts from the row blob. This cluster test focuses on what's
unique: row replication + peer_results capture.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sertion

- Remove unused `fetchWithRetry` import that broke lint after dropping the
  /Location/2 component-reachability check.
- Add explicit assertion that `deployResponse.replicated` is a non-empty
  array. If `replicateOperation` doesn't dispatch to peers (origin's
  `server.nodes` empty), the failure now surfaces here instead of as a
  silent empty `peer_results` later — clearer signal for triage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Includes harper commits up to 6d7cd99c (fix to eliminate the put race
where peer_results was lost — now bundled into the terminal finish()
put).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The 1s sleep after deploy wasn't enough on slower CI shards (Node v22
shard 3) for the terminal-state finish() put to propagate via table
replication. Poll up to 15s with 250ms cadence — fast in the happy case,
patient enough for slower environments. peer_results assertion test
already passing after the race fix landed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kriszyp kriszyp closed this May 23, 2026
@kriszyp kriszyp reopened this May 23, 2026
@kriszyp kriszyp marked this pull request as ready for review May 23, 2026 17:22
@kriszyp kriszyp requested a review from a team as a code owner May 23, 2026 17:22
@kriszyp
Copy link
Copy Markdown
Member Author

kriszyp commented May 23, 2026

Heads-up on CI: the `pull_request` event has stopped triggering test workflows on this PR since commit 0867433 (around 13:36 UTC). I've tried pushing new commits, closing+reopening the PR, and flipping draft→ready — none of these re-triggered the standard workflow set. Only `pull_request_target` workflows (Cherry-pick) fire.

The `workflow_dispatch` runs I triggered manually mostly worked, but Unit Tests fails there with a submodule clone auth issue (`fatal: could not read Username`) that doesn't happen in regular pull_request runs.

This looks like a GitHub Actions service-level issue, not anything in the PR. The `main` branch's workflows still fire normally on push events.

The supporting harper-side PR (HarperFast/harper#760) is fully green and ready for review — this PR's CI can be triggered by either:

  1. Waiting for the GH issue to resolve and re-pushing
  2. Re-syncing the core submodule once harper#760 merges (which would change the submodule pointer and likely retrigger workflows)

🤖 Generated by Claude

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant