test(issue-135): integration repro for Resource SDK search after restart#140
test(issue-135): integration repro for Resource SDK search after restart#140kriszyp wants to merge 5 commits into
Conversation
Three distinct failures in tests that were never running because the glob only matched *.ts files: - cluster/fullyConnectedReplication, cluster/replicationLoad: imported targz from a relative path (../../core/integrationTests/utils/targz.ts) that no longer exists; moved to @harperfast/integration-testing which already exports it (matches how issue135 test does it) - analytics fixture: spawn() calls lacked the required name option (Harper's wrapped spawn enforces this to deduplicate long-lived processes); added unique name per child using Date.now() + index - cloneNode "three more nodes": waitForAvailableStatus only means the node is up, not that the full replication mesh has established; added a poll loop (same pattern as replicationTopology.test.mjs) that retries up to 20× with backoff before asserting all sockets are connected Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ollow-tags only pushes annotated)
…restart Adds a single-node integration test and fixture component for serent-canopy issue #135 (tables.X.search returns subset after Harper restart while ops API returns full set). Results summary: - Scenario A (single-node graceful restart): PASSES — not reproduced here. - Scenarios B/C (multi-node replication): BLOCKED — cluster JWT key generation fails in the integration test environment, same as replicationLoad/ fullyConnectedReplication. Scenario B left as test.skip with explanation. - Unit-level results (subprocess restart, schema-change reindex): all pass. Bug is scoped to multi-node replication state. The most likely mechanism: index entries for rows received via replication replay are not rebuilt on restart, leaving them invisible to tables.X.search (secondary index walk) while ops API scan of primaryStore still sees them. Files added: - integrationTests/cluster/issue135-resource-search-after-restart.test.mjs - integrationTests/cluster/issue135-fixture/ (SearchCount resource + ScoreEvidence schema) - integrationTests/cluster/ISSUE-135-PLAN.md / ISSUE-135-FINDINGS.md Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
Reviewed; no blockers found. |
- Adds Scenario B (`issue135-replicated-search-after-restart.test.mjs`) which exercises the most likely repro vector for serent-canopy issue #135: insert rows on node 0 → wait for replication to drain to node 1 → restart node 1 → query node 1 via the Resource SDK (`tables.X.search`). Bug does not reproduce — both ops API and Resource SDK return the full row set. - Keeps the two scenarios in separate files because `node --test` runs top-level suites concurrently and the cluster startup of Scenario B interferes with Scenario A's single-node restart timing when colocated. - Adds a post-deploy `describe_table` poll to Scenario A — `pollHealth` alone doesn't guarantee the graphql schema has finished loading and the first insert sometimes returned "database 'data' does not exist". - Bumps `@harperfast/integration-testing` to ^0.3.0. - Bumps the core submodule pointer to include the companion fix (`fix: register operations in worker threads so replication WS can dispatch them`) which the cluster setup in Scenario B depends on. Updates ISSUE-135-FINDINGS.md to reflect the new test results. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
There was a problem hiding this comment.
name is not a valid spawn option.
There was a problem hiding this comment.
It is with harper's spawn (where we replace the native node module with constrained child_process module), and name is required (it performs deduplication/checking if the process has already been spawned by another thread). This was based on our conversation where you said you thought it was best to preserve the existing spawn function for this functionality.
We still could add our own spawn if this creates confusion.
There was a problem hiding this comment.
I'm confused. node:child_process is not Node's node:child_process? I vividly remember this conversation. Given the context, I don't think we have many options.
There was a problem hiding this comment.
It is attenuated/constrained version of node:child_process, and returned(imported) for modules that import node:child_process. But maybe we should have the process de-duping version imported from harper? (but this goes back to the problem; I have never seen a module actually use node:child_process correctly. Failing fast isn't really any worse than feigning success 🤷 ).
There was a problem hiding this comment.
I'm confused. node:child_process is not Node's node:child_process? I vividly remember this conversation. Given the context, I don't think we have many options.
Summary
Adds integration tests reproducing canopy issue #135 (Resource SDK
tables.X.searchreturning subset after Harper restart while ops API returns full set), and bumps the core submodule to include the companion fix that enables multi-node cluster integration tests.Companion PR
Depends on HarperFast/harper#523 in core (
fix: register operations in worker threads so replication WS can dispatch them). Without that, Scenario B and any multi-node cluster test fails withOperation 'add_node_back' not found and connection was required to sign certificate.Test results
resetDatabases()+ indexed searchissue135-resource-search-after-restart.test.mjs)issue135-replicated-search-after-restart.test.mjs)The bug does not reproduce in any harness scenario. This is meaningful negative data: the issue #135 fingerprint is narrower than expected (single-node restart, replication-receiving-node restart, schema-change reindex with restart — none reproduce it). Likely triggers remaining: Fabric topology (>2 nodes, multi-region), production-scale load, or a specific combination of rolling
deploy_componentwith active replication backlog that this harness doesn't exercise.Changes in this PR
integrationTests/cluster/issue135-resource-search-after-restart.test.mjs— Scenario A (single-node).integrationTests/cluster/issue135-replicated-search-after-restart.test.mjs— Scenario B (multi-node, new).integrationTests/cluster/issue135-fixture/— fixture component (SearchCountResource +ScoreEvidenceschema).integrationTests/cluster/ISSUE-135-FINDINGS.md— what was tried, what passes, what the result implies.integrationTests/cluster/ISSUE-135-PLAN.md— original plan doc kept for context.@harperfast/integration-testingto^0.3.0.Review attention
node --testruns top-level suites concurrently — cluster startup in Scenario B interferes with Scenario A's single-node restart timing when colocated. Worth confirming this is the right pattern vs. forcing--test-concurrency=1.describe_tableafter the post-deploypollHealthto avoid a race where the graphql schema hasn't finished loading and the first insert hitsdatabase 'data' does not exist. This may be a Harper Pro issue worth investigating separately.SearchCountresource usesfor awaitovertables.X.search(an AsyncIterable). Worth confirming this is the canonical Resource handler pattern.🤖 Generated with Claude Sonnet 4.6 (1M context)