Skip to content

test: replication receive-side stress regressions (catch-up memory + blob save)#150

Open
kriszyp wants to merge 4 commits into
mainfrom
stress/replication-receive-tests
Open

test: replication receive-side stress regressions (catch-up memory + blob save)#150
kriszyp wants to merge 4 commits into
mainfrom
stress/replication-receive-tests

Conversation

@kriszyp
Copy link
Copy Markdown
Member

@kriszyp kriszyp commented May 15, 2026

Summary

Two cluster integration tests targeting the receive-side failure modes that knocked a prod node off the cluster earlier this month. Both are guards against regression of fixes already on main / pending review.

Test Guards Currently in main?
receiveBacklogMemory.test.mjs PR #147 (RECEIVE_EVENT_HIGH_WATER_MARK backpressure) yes — should pass
blobSaveRejectionContainment.test.mjs PR #149 (outstandingBlobsToFinish stores catch-handled promise) no — passes only after #149 merges

Why

In early May a Harper Pro 5.0.16 node OOM'd repeatedly during peer catch-up. Two distinct receive-side defects compounded:

  1. onWSMessage decoded every audit record in a WS message synchronously inside a tight do { ... } while (...) loop. A single message with thousands of records ballooned heap past the 2 GB old-gen ceiling — ERR_WORKER_OUT_OF_MEMORY every ~25 s. Fixed by PR fix: bound replication receive memory to stop worker OOM crash loops #147, which yields when the consumer queue exceeds RECEIVE_EVENT_HIGH_WATER_MARK = 100.
  2. When saveBlob rejected (the same node was producing Blob save failed for X from peer Y at ~35/s, ENOENT during catch-up), the raw rejecting promise sat in outstandingBlobsToFinish and await Promise.all(...) inside end_txn's onCommit propagated it out as uncaughtException. Fixed by PR fix(replication): swallow blob save rejection in outstandingBlobsToFinish #149, which stores the catch-handled promise instead.

These tests reproduce both conditions directly so future changes can't quietly reintroduce either.

What each test does

receiveBacklogMemory.test.mjs

  • Brings up a 2-node cluster.
  • killHarper(B), then bursts 40 transactions × 500 records each on A — each transaction is a single WS message with 500 audit entries (well past the HWM of 100). Total backlog: 20 k records.
  • Restarts B; samples system_information.memory.rss every 500 ms while B catches up.
  • Polls describe_table.record_count for an unambiguous catch-up signal (no dependency on cluster_status.lastReceivedVersion shape).
  • Asserts: catch-up completes, no ERR_WORKER_OUT_OF_MEMORY in log, peak RSS < 1.5 GB.

blobSaveRejectionContainment.test.mjs

  • Brings up a 2-node cluster with the existing Location blob-bearing fixture deployed to both.
  • Pre-installs a new fault-injection component on B only (via setupHarperWithFixture) that monkey-patches fs.createWriteStream to emit ENOENT for every 7th call targeting /blobs/. The injector arms only when HARPER_TEST_BLOB_FAIL_INTERVAL is set, so it's inert if the fixture is ever picked up elsewhere.
  • Drives 400 /Location/{n} requests on A → each creates a blob via sourcedFrom → replicates to B → every 7th blob save on B trips the injector.
  • Asserts:
    1. The injector actually fired (test is non-vacuous).
    2. [error] [replication]: Blob save failed for <id> from <peer> appears (the .catch ran).
    3. No uncaughtException lines mentioning Blob/ENOENT.*blobs — this is the regression.
    4. B still reports A connected in cluster_status.
    5. A fresh write on A still propagates to B after the failures (liveness).

Where to look

  • fixture-blob-fail-injector/resources.js — the monkey-patch uses createRequire to get the CJS fs module (ESM namespace objects are frozen) and replaces createWriteStream. Harper's dist code uses require('node:fs').createWriteStream(...) at call time, so the patch is picked up live. Confirmed via cross-model review.
  • The BATCH_SIZE = 500 and FAIL_INTERVAL = 7 are tuned to comfortably exceed the relevant thresholds without blowing test time. Could be made more aggressive if CI runners turn out to be faster than expected; the bounds in the assertions stay valid either way.
  • The 1.5 GB peak-RSS bound is intentionally generous. The bug burst past 2 GB; anything well under that means the HWM-driven pause is taking effect. Tuning down later is easy and orthogonal.

Risk / known-flaky areas

  • Local validation hit a pre-existing harness/env issue ("Maximum call stack size exceeded" inside replicationTopology.test.mjs setup) that also affects main. The new tests are structurally identical to replicationLoad.test.mjs and should run cleanly in CI even though local was noisy.
  • blobSaveRejectionContainment will fail on main today because PR fix(replication): swallow blob save rejection in outstandingBlobsToFinish #149 isn't merged yet. Either land that first, or this PR is the trailing half of the same change and lands after.

Testing

  • npx oxlint --quiet on all new files → 0 errors, only the pre-existing new Array(concurrency) warning in clusterShared.mjs that lint:required already tolerates on main
  • npx prettier --check → clean
  • node --check on all new mjs/js → clean
  • Cross-model review (Gemini): positive; no required changes

Test plan

🤖 Generated with Claude Code

@kriszyp kriszyp requested a review from a team as a code owner May 15, 2026 00:03
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 15, 2026

Reviewed; no blockers found.

kriszyp and others added 2 commits May 18, 2026 22:47
…lob save)

Two cluster integration tests covering the receive-side failure modes that took a
prod node off the cluster:

1. receiveBacklogMemory.test.mjs — guards PR #147's
   RECEIVE_EVENT_HIGH_WATER_MARK fix. Kills receiver B, bursts 40 transactions
   of 500 records each on A (each transaction = one WS message → 500 audit
   entries decoded), restarts B, samples memory while it catches up, asserts
   peak RSS < 1.5 GB and no ERR_WORKER_OUT_OF_MEMORY in the log.

2. blobSaveRejectionContainment.test.mjs — guards PR #149's contract that a
   rejected saveBlob promise is logged exactly once and never escapes onCommit
   as uncaughtException. Installs a fault-injection component on B only that
   monkey-patches fs.createWriteStream to fail every 7th /blobs/ write with
   ENOENT, drives Location-component blob traffic from A, asserts the
   "Blob save failed for ..." line appears but uncaughtException lines do not,
   and that liveness (a fresh write) still propagates after failures.

Adds shared helpers to clusterShared.mjs: readLog, waitForCatchUp,
getMemoryInfo, peakMemory. The fault-injection fixture lives at
integrationTests/cluster/fixture-blob-fail-injector/ and is opt-in via
HARPER_TEST_BLOB_FAIL_INTERVAL env var.

These exercise the same failure surface that affected wtk-ap-west-1 in May:
unbounded synchronous decode on receive, and blob save rejections escaping
the commit confirmation path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI sets HARPER_INTEGRATION_TEST_LOG_DIR, which causes the integration-testing
harness to redirect Harper's `logging.root` to a per-suite directory exposed
on `ctx.harper.logDir` rather than `{dataRootDir}/log/hdb.log`.

readLog() was only checking the dataRootDir path, so in CI it returned an
empty string — making the blob-fail-injector banner assertion fail even when
the component had loaded correctly. Check both locations now.

The receiveBacklogMemory test was also affected (its no-OOM assertion was
reading the wrong file) but happened to pass vacuously.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kriszyp kriszyp force-pushed the stress/replication-receive-tests branch from 3337190 to e51f137 Compare May 19, 2026 04:48
kriszyp and others added 2 commits May 18, 2026 22:55
The shared `fixture/` Location source produces 7,500-byte blobs, which fall
under Harper's FILE_STORAGE_THRESHOLD (8192 bytes) and are stored inline in
the record. With no file write, the receiver's createWriteStream is never
called and the fault injector has nothing to intercept — assertion #2
("Blob save failed line appeared") failed even though the injector loaded
correctly (banner present in B's log twice).

Add a dedicated `fixture-large-blob-source/` with a `LargeLocation` table
whose `sourcedFrom` produces 50 KB streamed blobs — comfortably above the
threshold, guaranteed to take the file-backed write path on the receiver.
Switch the test to deploy/hit this fixture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GET routing on a `sourcedFrom` table can be subtle: hitting /LargeLocation/{id}
on the receiver of a partially-populated record may re-invoke the cache-miss
handler instead of returning the locally stored record, and the request can
fail in ways that don't show up as Harper log lines.

The previous run confirmed assertions 1-4 are airtight (35 Blob save failed
log entries, 0 uncaughtException, still connected per cluster_status) — the
test was failing only on the liveness GET timing out. Switch to comparing
describe_table.record_count before/after the upsert: a direct, unambiguous
signal that doesn't depend on REST GET semantics for sourcedFrom tables.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant