test: replication receive-side stress regressions (catch-up memory + blob save)#150
Open
kriszyp wants to merge 4 commits into
Open
test: replication receive-side stress regressions (catch-up memory + blob save)#150kriszyp wants to merge 4 commits into
kriszyp wants to merge 4 commits into
Conversation
Contributor
|
Reviewed; no blockers found. |
…lob save) Two cluster integration tests covering the receive-side failure modes that took a prod node off the cluster: 1. receiveBacklogMemory.test.mjs — guards PR #147's RECEIVE_EVENT_HIGH_WATER_MARK fix. Kills receiver B, bursts 40 transactions of 500 records each on A (each transaction = one WS message → 500 audit entries decoded), restarts B, samples memory while it catches up, asserts peak RSS < 1.5 GB and no ERR_WORKER_OUT_OF_MEMORY in the log. 2. blobSaveRejectionContainment.test.mjs — guards PR #149's contract that a rejected saveBlob promise is logged exactly once and never escapes onCommit as uncaughtException. Installs a fault-injection component on B only that monkey-patches fs.createWriteStream to fail every 7th /blobs/ write with ENOENT, drives Location-component blob traffic from A, asserts the "Blob save failed for ..." line appears but uncaughtException lines do not, and that liveness (a fresh write) still propagates after failures. Adds shared helpers to clusterShared.mjs: readLog, waitForCatchUp, getMemoryInfo, peakMemory. The fault-injection fixture lives at integrationTests/cluster/fixture-blob-fail-injector/ and is opt-in via HARPER_TEST_BLOB_FAIL_INTERVAL env var. These exercise the same failure surface that affected wtk-ap-west-1 in May: unbounded synchronous decode on receive, and blob save rejections escaping the commit confirmation path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI sets HARPER_INTEGRATION_TEST_LOG_DIR, which causes the integration-testing
harness to redirect Harper's `logging.root` to a per-suite directory exposed
on `ctx.harper.logDir` rather than `{dataRootDir}/log/hdb.log`.
readLog() was only checking the dataRootDir path, so in CI it returned an
empty string — making the blob-fail-injector banner assertion fail even when
the component had loaded correctly. Check both locations now.
The receiveBacklogMemory test was also affected (its no-OOM assertion was
reading the wrong file) but happened to pass vacuously.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3337190 to
e51f137
Compare
The shared `fixture/` Location source produces 7,500-byte blobs, which fall under Harper's FILE_STORAGE_THRESHOLD (8192 bytes) and are stored inline in the record. With no file write, the receiver's createWriteStream is never called and the fault injector has nothing to intercept — assertion #2 ("Blob save failed line appeared") failed even though the injector loaded correctly (banner present in B's log twice). Add a dedicated `fixture-large-blob-source/` with a `LargeLocation` table whose `sourcedFrom` produces 50 KB streamed blobs — comfortably above the threshold, guaranteed to take the file-backed write path on the receiver. Switch the test to deploy/hit this fixture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GET routing on a `sourcedFrom` table can be subtle: hitting /LargeLocation/{id}
on the receiver of a partially-populated record may re-invoke the cache-miss
handler instead of returning the locally stored record, and the request can
fail in ways that don't show up as Harper log lines.
The previous run confirmed assertions 1-4 are airtight (35 Blob save failed
log entries, 0 uncaughtException, still connected per cluster_status) — the
test was failing only on the liveness GET timing out. Switch to comparing
describe_table.record_count before/after the upsert: a direct, unambiguous
signal that doesn't depend on REST GET semantics for sourcedFrom tables.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two cluster integration tests targeting the receive-side failure modes that knocked a prod node off the cluster earlier this month. Both are guards against regression of fixes already on
main/ pending review.receiveBacklogMemory.test.mjsRECEIVE_EVENT_HIGH_WATER_MARKbackpressure)blobSaveRejectionContainment.test.mjsoutstandingBlobsToFinishstores catch-handled promise)Why
In early May a Harper Pro 5.0.16 node OOM'd repeatedly during peer catch-up. Two distinct receive-side defects compounded:
onWSMessagedecoded every audit record in a WS message synchronously inside a tightdo { ... } while (...)loop. A single message with thousands of records ballooned heap past the 2 GB old-gen ceiling —ERR_WORKER_OUT_OF_MEMORYevery ~25 s. Fixed by PR fix: bound replication receive memory to stop worker OOM crash loops #147, which yields when the consumer queue exceedsRECEIVE_EVENT_HIGH_WATER_MARK = 100.saveBlobrejected (the same node was producingBlob save failed for X from peer Yat ~35/s, ENOENT during catch-up), the raw rejecting promise sat inoutstandingBlobsToFinishandawait Promise.all(...)insideend_txn'sonCommitpropagated it out asuncaughtException. Fixed by PR fix(replication): swallow blob save rejection in outstandingBlobsToFinish #149, which stores the catch-handled promise instead.These tests reproduce both conditions directly so future changes can't quietly reintroduce either.
What each test does
receiveBacklogMemory.test.mjskillHarper(B), then bursts 40 transactions × 500 records each on A — each transaction is a single WS message with 500 audit entries (well past the HWM of 100). Total backlog: 20 k records.system_information.memory.rssevery 500 ms while B catches up.describe_table.record_countfor an unambiguous catch-up signal (no dependency oncluster_status.lastReceivedVersionshape).ERR_WORKER_OUT_OF_MEMORYin log, peak RSS < 1.5 GB.blobSaveRejectionContainment.test.mjsLocationblob-bearing fixture deployed to both.setupHarperWithFixture) that monkey-patchesfs.createWriteStreamto emitENOENTfor every 7th call targeting/blobs/. The injector arms only whenHARPER_TEST_BLOB_FAIL_INTERVALis set, so it's inert if the fixture is ever picked up elsewhere./Location/{n}requests on A → each creates a blob viasourcedFrom→ replicates to B → every 7th blob save on B trips the injector.[error] [replication]: Blob save failed for <id> from <peer>appears (the.catchran).uncaughtExceptionlines mentioningBlob/ENOENT.*blobs— this is the regression.cluster_status.Where to look
fixture-blob-fail-injector/resources.js— the monkey-patch usescreateRequireto get the CJSfsmodule (ESM namespace objects are frozen) and replacescreateWriteStream. Harper's dist code usesrequire('node:fs').createWriteStream(...)at call time, so the patch is picked up live. Confirmed via cross-model review.BATCH_SIZE = 500andFAIL_INTERVAL = 7are tuned to comfortably exceed the relevant thresholds without blowing test time. Could be made more aggressive if CI runners turn out to be faster than expected; the bounds in the assertions stay valid either way.Risk / known-flaky areas
replicationTopology.test.mjssetup) that also affectsmain. The new tests are structurally identical toreplicationLoad.test.mjsand should run cleanly in CI even though local was noisy.blobSaveRejectionContainmentwill fail onmaintoday because PR fix(replication): swallow blob save rejection in outstandingBlobsToFinish #149 isn't merged yet. Either land that first, or this PR is the trailing half of the same change and lands after.Testing
npx oxlint --quieton all new files → 0 errors, only the pre-existingnew Array(concurrency)warning inclusterShared.mjsthatlint:requiredalready tolerates on mainnpx prettier --check→ cleannode --checkon all new mjs/js → cleanTest plan
receiveBacklogMemory(it should — PR fix: bound replication receive memory to stop worker OOM crash loops #147 is in main)blobSaveRejectionContainmentpassesgrep '[blob-fail-injector] installed' ...) — assertion 1 should fail loudly if not🤖 Generated with Claude Code