test(stress): long-running soak + worker cascade + orphan + adversity#171
test(stress): long-running soak + worker cascade + orphan + adversity#171kriszyp wants to merge 5 commits into
Conversation
Four new long-running stress tests in integrationTests/stress/, gated on HARPER_RUN_STRESS_TESTS=1 so the normal integration suite skips them. Tests run from a new stress-tests.yaml workflow on workflow_dispatch or the weekly Sunday cron. Each test targets a production failure mode observed on wtk-ap-west-1 that the existing PR-blocking suite can't reach: - soakWithRollingRestarts: 4-node mesh, prerender-style mixed-blob workload, rolling SIGKILL+restart cycle. Default 4 hours in workflow, configurable via HARPER_STRESS_SOAK_MINUTES. Asserts no OOM, no uncaughtException, no `Error sending blob ENOENT`, and ≤1% record- count drift across all nodes after a final convergence wait. - workerExitCascade: PR #147 stagger fix coverage that the existing receiveBacklogMemory test can't exercise (it runs with THREADS_COUNT=1). Boots a 4-worker target node, sustained writes across 6 databases, kills exactly one worker via the new SuicideWorker component, then inspects post-kill `Setting up subscription with leader` log timestamps for ≥80ms pairwise spacing and ensures no cascade (≤1 new worker PID). - blobOrphanRace: investigation test for the qub `Error sending blob ENOENT` pattern we couldn't fully diagnose. Heavy supersede churn over a small keyspace + a mid-test B restart. Default 15 min locally / 60 in workflow. If we ever do reproduce the orphan, the test description is set up so it points reviewers at the bug. - rapidReconnectAdversity: pivoted from a tc/netem design (needed root) to rapid kill+restart cycles, ~15s apart, across 3 nodes. Same code- path coverage as a network-adversity proxy — connect, resubscribe, blob stream resumption, listener cleanup — without any sudo. Asserts no MaxListenersExceededWarning (the recent #161/#173 leak fixes are what this is guarding) plus the usual no-OOM/no-uncaught/convergence trio. Adds two fixtures: - fixture-prerender-workload: sourcedFrom-backed Prerender table with deterministic-but-bimodal blob sizes (60% inline / 40% file-backed) to exercise both blob storage paths in one workload. - fixture-suicide-worker: REST endpoint /SuicideWorker that calls process.exit(137) on the worker that handles the request. Also adds shared helpers in stressShared.mjs (clusterSnapshot, waitForAllConnected, sampleMetrics, summariseSamples, prerenderId) plus an in-process metrics sampler that captures memory/threads at fixed intervals for post-run analysis. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two iteration fixes from local validation runs: 1. After SIGKILL+restart, update `ctx.nodes[idx]` with the new harper handle. Without this, round-robin cycles that wrap (cycle N where N >= NODE_COUNT) try to kill the original ChildProcess reference, which is already dead, and then startHarper attempts to bind the same hostname:port as the still-running harper from the previous restart — which fails. Observed locally on the adversity test (4 cycles × 3 nodes = wrap on cycle 4) but applies to the soak too. Same fix in both files. 2. Relax convergence checks from strict equality to ≤1% drift, plus poll for convergence (60–120s) before asserting. Strict equality is too brittle under heavy restart churn: at the moment traffic stops there can be a 1–2 record tail in flight that hasn't replicated yet. Same relaxation across soak, orphan, and adversity. Production-pattern assertions (no OOM / uncaught / orphan / listener leak) remain strict — those are the failures we actually want to catch. Local runs of orphan and adversity both reported 0 across all four markers; only the convergence assertion failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Local validation (orphan2 run, 5 min churn over 80 keys, mid-test B restart) reached the convergence assertion with A=915 B=915 — converged exactly but well above the 80-key keyspace. Harper's `sourcedFrom` cache evidently creates new record versions per cache miss under high churn, so describe_table.record_count can substantially exceed unique key count. The orphan repro is about blob lifecycle, not exact record cardinality — drop the upper bound, keep the drift check + nonzero check. Production-pattern assertions on this same run: - 0 "Error sending blob ENOENT" on both nodes - 0 uncaughtException on both nodes - record_count converged exactly (915 = 915) under churn Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| await killHarper({ harper: B }); | ||
| await startHarper( | ||
| { name: ctx.name, harper: { dataRootDir: B.dataRootDir, hostname: B.hostname } }, | ||
| { | ||
| config: { | ||
| analytics: { aggregatePeriod: -1 }, | ||
| logging: { colors: false, console: true, level: 'debug' }, | ||
| replication: { securePort: B.hostname + ':9933' }, | ||
| threads: { count: THREADS_PER_NODE }, | ||
| }, | ||
| env: { HARPER_NO_FLUSH_ON_EXIT: true }, | ||
| } | ||
| ); | ||
| restarted = true; |
There was a problem hiding this comment.
Stale B reference after restart — post-restart log assertions silently incomplete in CI
startHarper receives a fresh { dataRootDir, hostname } object and mutates it in-place (adding logDir, httpURL, etc.). The new logDir is on that fresh object; the original B / ctx.nodes[1] still holds the pre-restart logDir. Because readLog checks node.logDir first and returns immediately when the file exists, the post-restart assertions (#1 orphans on B, #2 uncaughtException on B) silently read from B's pre-restart log in CI where HARPER_INTEGRATION_TEST_LOG_DIR is set. If the orphan manifests during B's reconnection window — the riskiest window — it would be missed.
The soak and adversity tests both fix this with ctx.nodes[victimIdx] = restartCtx.harper. Apply the same pattern here:
| await killHarper({ harper: B }); | |
| await startHarper( | |
| { name: ctx.name, harper: { dataRootDir: B.dataRootDir, hostname: B.hostname } }, | |
| { | |
| config: { | |
| analytics: { aggregatePeriod: -1 }, | |
| logging: { colors: false, console: true, level: 'debug' }, | |
| replication: { securePort: B.hostname + ':9933' }, | |
| threads: { count: THREADS_PER_NODE }, | |
| }, | |
| env: { HARPER_NO_FLUSH_ON_EXIT: true }, | |
| } | |
| ); | |
| restarted = true; | |
| console.log(`[orphan] mid-test restart of B (${B.hostname})`); | |
| await killHarper({ harper: B }); | |
| const restartCtx = { | |
| name: ctx.name, | |
| harper: { dataRootDir: B.dataRootDir, hostname: B.hostname }, | |
| }; | |
| await startHarper(restartCtx, { | |
| config: { | |
| analytics: { aggregatePeriod: -1 }, | |
| logging: { colors: false, console: true, level: 'debug' }, | |
| replication: { securePort: B.hostname + ':9933' }, | |
| threads: { count: THREADS_PER_NODE }, | |
| }, | |
| env: { HARPER_NO_FLUSH_ON_EXIT: true }, | |
| }); | |
| ctx.nodes[1] = restartCtx.harper; | |
| restarted = true; |
Then re-destructure const [A, B] = ctx.nodes; just before the drain loop and log assertions (or use ctx.nodes[0] / ctx.nodes[1] directly) so readLog(B) and sendOperation(B, ...) pick up the updated handle.
There was a problem hiding this comment.
Fixed in e97a3a3 — captured the restarted handle, updated ctx.nodes[1] and the local B alias to match the soak + adversity pattern.
🤖 AI-generated reply
|
One blocker: |
Claude-bot review on PR #171 flagged that blobOrphanRace doesn't update ctx.nodes[1] after the mid-test B restart. While the logDir happens to be hostname-derived and stable across restarts (so readLog still captures post-restart logs), the suggested fix is correct hygiene and matches the pattern used in soak + adversity. Captures the restarted harper handle into ctx.nodes[1] and the local `B` alias so all later references see the new context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ow consumer, partition) Adds four long-running stress tests aimed at replication failure modes that the existing stress suite doesn't cover, plus a small TCP-proxy helper for the partition case. - replayCatchupSeam: SIGKILL with HARPER_NO_FLUSH_ON_EXIT forces replayLogs to fire on restart while peers are mid catch-up. Asserts no double-apply / no loss across the seam and that the replay path actually ran (>=1 "Replayed N records" warn). Local: 5,020 writes, 200/200/200 convergence, 133s. - backlogRecovery: kill one peer for 5 min while three others churn, restart, watch the drain. Asserts peer-side per-peer queue stays bounded (peer RSS held at ~450 MB across both 2-min and 5-min offline windows - same number with 2x the backlog) and the rejoining node catches up without OOM. Local: 28k writes, 800/800/800/800 convergence. - slowConsumerBackpressure: A (4 worker threads) vs B (1 worker thread) + high-concurrency churn. Asserts A's RSS stays bounded under sustained pressure and the cluster reconverges. backPressurePercent observation is a soft warn rather than a hard assertion since the loopback asymmetry may not be enough to actually trigger backpressure (and didn't, in local validation). Local: 32k writes, 500/500 convergence. - partitionHealConvergence + replicationProxy: split-brain test driving a userspace TCP proxy between two nodes. Currently SKIPPED by default (gated on HARPER_STRESS_ALLOW_INSECURE_REPLICATION=1 in addition to the usual HARPER_RUN_STRESS_TESTS=1). Blocker: Harper's replication WS validates the cert SAN/altnames against the dial target hostname, and self-signed replication certs don't include the proxy's hostname. File header documents the three paths to unblock plus the secondary concern that even after the TLS path opens, the test needs to verify Harper's replication isn't bidirectionally dialed (else blocking only the B->A proxy won't actually partition). All four tests follow the existing stress-suite conventions: gated on HARPER_RUN_STRESS_TESTS=1, skipped placeholder otherwise, deterministic prerender-style workload via the existing fixture, drift < 1% convergence threshold matching the rest of the suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| test: | ||
| - name: 'worker-exit-cascade' | ||
| file: 'integrationTests/stress/workerExitCascade.test.mjs' | ||
| env_vars: '' | ||
| - name: 'soak-rolling-restarts' | ||
| file: 'integrationTests/stress/soakWithRollingRestarts.test.mjs' | ||
| env_vars: 'HARPER_STRESS_SOAK_MINUTES' | ||
| - name: 'blob-orphan-race' | ||
| file: 'integrationTests/stress/blobOrphanRace.test.mjs' | ||
| env_vars: 'HARPER_STRESS_ORPHAN_MINUTES' | ||
| - name: 'rapid-reconnect-adversity' | ||
| file: 'integrationTests/stress/rapidReconnectAdversity.test.mjs' | ||
| env_vars: 'HARPER_STRESS_ADVERSITY_MINUTES' |
There was a problem hiding this comment.
The matrix only covers the four original tests. The three runnable new tests added in this push — backlogRecovery, replayCatchupSeam, and slowConsumerBackpressure — are absent, so they'll never execute in CI (weekly cron or manual dispatch). partitionHealConvergence is intentionally blocked and documented, so excluding that one is fine.
Suggested additions:
| test: | |
| - name: 'worker-exit-cascade' | |
| file: 'integrationTests/stress/workerExitCascade.test.mjs' | |
| env_vars: '' | |
| - name: 'soak-rolling-restarts' | |
| file: 'integrationTests/stress/soakWithRollingRestarts.test.mjs' | |
| env_vars: 'HARPER_STRESS_SOAK_MINUTES' | |
| - name: 'blob-orphan-race' | |
| file: 'integrationTests/stress/blobOrphanRace.test.mjs' | |
| env_vars: 'HARPER_STRESS_ORPHAN_MINUTES' | |
| - name: 'rapid-reconnect-adversity' | |
| file: 'integrationTests/stress/rapidReconnectAdversity.test.mjs' | |
| env_vars: 'HARPER_STRESS_ADVERSITY_MINUTES' | |
| test: | |
| - name: 'worker-exit-cascade' | |
| file: 'integrationTests/stress/workerExitCascade.test.mjs' | |
| env_vars: '' | |
| - name: 'soak-rolling-restarts' | |
| file: 'integrationTests/stress/soakWithRollingRestarts.test.mjs' | |
| env_vars: 'HARPER_STRESS_SOAK_MINUTES' | |
| - name: 'blob-orphan-race' | |
| file: 'integrationTests/stress/blobOrphanRace.test.mjs' | |
| env_vars: 'HARPER_STRESS_ORPHAN_MINUTES' | |
| - name: 'rapid-reconnect-adversity' | |
| file: 'integrationTests/stress/rapidReconnectAdversity.test.mjs' | |
| env_vars: 'HARPER_STRESS_ADVERSITY_MINUTES' | |
| - name: 'replay-catchup-seam' | |
| file: 'integrationTests/stress/replayCatchupSeam.test.mjs' | |
| env_vars: '' | |
| - name: 'backlog-recovery' | |
| file: 'integrationTests/stress/backlogRecovery.test.mjs' | |
| env_vars: 'HARPER_STRESS_BACKLOG_OFFLINE_MINUTES' | |
| - name: 'slow-consumer-backpressure' | |
| file: 'integrationTests/stress/slowConsumerBackpressure.test.mjs' | |
| env_vars: 'HARPER_STRESS_SLOW_MINUTES' |
You'll also want to expose the new duration knobs (backlog-offline-minutes, slow-minutes) in the workflow_dispatch.inputs block and wire them into the env: section of the Run step, matching the pattern the existing four tests use.
Summary
Four long-running replication stress tests targeting the production failure modes observed on wtk-ap-west-1 in early May, plus a workflow scaffolding to run them weekly (or on-demand) without affecting the PR-blocking integration matrix.
soakWithRollingRestartsworkerExitCascadeWORKER_EXIT_REASSIGN_STAGGER_MS = 100. The existingreceiveBacklogMemorytest can't exercise this — it runs withTHREADS_COUNT=1.blobOrphanRaceError sending blob … ENOENTpattern we couldn't fully diagnose. Heavy supersede churn over a small keyspace + a mid-test restart. If we ever do hit the orphan, the test description points reviewers at the bug.rapidReconnectAdversitytc/netemadversity proxy (connect, resubscribe, blob resume, listener cleanup) but withoutNET_ADMIN. Asserts noMaxListenersExceededWarningin node logs — guards the recent #161/#173 leak fixes.All four are gated on
HARPER_RUN_STRESS_TESTS=1. The disabled branch registers a single skipped placeholder test so the normal integration runner treats the file as a no-op.Why its own workflow
The longest of these (soak) runs ~4 h with defaults. PR shards time out at 15 min and the matrix is already 12 jobs. Stress tests live in
.github/workflows/stress-tests.yaml—workflow_dispatch(per-test duration knobs in the dialog) + weekly Sunday cron — so they don't slow PR signal and still get regular coverage.Local validation runs
(Note: the integration-testing harness emits
MaxListenersExceededWarning: 11 exit listeners added to [process]around cycle 8 of adversity — that's the harness'sprocess.on('exit', …)accumulating across manystartHarpercalls, not anything in Harper. My assertion scans node logs only, so it doesn't fail the test. Worth filing separately against the harness.)A longer soak attempt (10 min, ~10 cycles, 2.5 wraps) failed with no test-body console output captured — looked like a node:test reporter/buffering interaction triggered around a wrap cycle. All assertions appear to have held (the harper logs showed 0 OOM / 0 uncaught / 0 orphans), but the test result line lacked detail. The 5-min single-wrap run above passes cleanly, and the 4-h CI run will produce richer artifacts. Worth investigating if it recurs.
Two iteration fixes from local validation
ctx.nodes[idx]update after restart. I originally didn't write the newharperhandle back to the array. Round-robin cycles that wrap (cycle ≥NODE_COUNT) then killed the staleChildProcessreference — already dead — andstartHarperthen tried to bind a hostname:port already held by the still-running harper from the previous restart. Nowctx.nodes[idx] = restartCtx.harper. Applied to soak + adversity. Confirmed working on soak4 (5 cycles with 1 wrap, exact convergence) and adversity (10 cycles with 3 wraps, exact convergence).blobOrphanRacealso drops a bogusmaxCount <= KEYSPACEupper bound — Harper'ssourcedFromcache produces more record versions than the unique-key count under churn.)Fixtures
fixture-prerender-workload/—sourcedFromcache table with deterministic-but-bimodal blob sizing (60 % under 8 KB → inline, 40 % above → file-backed). Matches the wtk production mix and exercises both blob storage paths in one workload.fixture-suicide-worker/— single REST endpoint/SuicideWorkerthat returns{ threadId, pid }and schedulesprocess.exit(137)viasetImmediateso the HTTP response flushes before the worker dies. Lets the cascade test kill exactly one worker on demand.Shared helpers (
stressShared.mjs)clusterSnapshot(node)— uniform-shape view ofcluster_status.waitForAllConnected(node),waitForRecordCount(node, table, target)— bounded polling.sampleMetrics(node, { intervalMs })+summariseSamples(samples)— in-process memory/thread sampler for post-run analysis.fetchWithRetrywithAbortSignal.timeoutso a stalled connection to a mid-kill node can't hang the test.readLog(node)— checksnode.logDirfirst (set by the harness whenHARPER_INTEGRATION_TEST_LOG_DIRis in env) before falling back to{dataRootDir}/log/hdb.log. Same fix as test: replication receive-side stress regressions (catch-up memory + blob save) #150.Where to look
fail-fast: falseandtimeout-minutes: 260to fit the 240 min soak default. Per-test duration knobs surface in theworkflow_dispatchdialog.stressShared.mjs:fetchWithRetry— added a 5 s per-attemptAbortSignal.timeout. Without it a fetch against a mid-kill node could hold the test for the full retry budget. Bump to 10 s if CI runners turn out to be slower than expected.fixture-suicide-workerusessetImmediate(() => process.exit(137))so the HTTP response flushes before the worker dies. The cascade test's[cascade] suicide response: {"threadId":1,...}line confirms this works.How to run
What this PR does NOT do
blobOrphanRacereproduces it in the workflow, the test description sets up reviewers to act on it.🤖 Generated with Claude Code
Update 2026-05-20: replication-correctness suite
Adds four more long-running stress tests in commit
feaa1a2, aimed at replication scenarios the original four don't cover. Same gating, same workload fixture, same convergence/drift conventions.replayCatchupSeamreplayLogs.ts(on-restart audit-tail replay) and live replication catch-up. SIGKILL withHARPER_NO_FLUSH_ON_EXIT=trueforces replay to fire; meanwhile peers send catch-up. Asserts no double-apply / no loss and that the replay path actually executed (≥1"Replayed N records"warn — otherwise the seam wasn't exercised).backlogRecoveryslowConsumerBackpressurebackPressurePercentobservation is a soft warn (logged, not asserted) — on loopback, the asymmetry may not be enough to actually trigger backpressure; in local validation it wasn't.partitionHealConvergence(+replicationProxy.mjshelper)record_countequality and per-key agreement on sampled ids. Skipped by default (requires bothHARPER_RUN_STRESS_TESTS=1andHARPER_STRESS_ALLOW_INSECURE_REPLICATION=1) — blocker is in the file header.Local validation
Cross-model review (agy / Antigravity CLI)
Ran a cross-model review on the new files. Acted on:
replicationProxy.mjs(close events fire on both ends; added acleanedguard).Declined:
(ctx) => { ctx.nodes = ... }pattern and pass; finding doesn't apply to this version ofnode:test.node:testruns each test in process isolation, so the test-process exit reclaimssetIntervalhandles. Not a real leak in practice.Side observation (separate follow-up)
The
manageThreads.restartWorkerspath in core emitsMaxListenersExceededWarning: 11 exit listeners added to [Worker]once duringdeploy_componentwithrestart: true, on a node withthreads.count: 4. Surfaced bybacklogRecoverywhile I was developing it. Real but unrelated to this PR; flagged separately.🤖 Generated with Claude Code