Skip to content

test(issue-135): integration repro for Resource SDK search after restart#140

Open
kriszyp wants to merge 5 commits into
mainfrom
test/issue-135-replication-repro
Open

test(issue-135): integration repro for Resource SDK search after restart#140
kriszyp wants to merge 5 commits into
mainfrom
test/issue-135-replication-repro

Conversation

@kriszyp
Copy link
Copy Markdown
Member

@kriszyp kriszyp commented May 12, 2026

Summary

Adds integration tests reproducing canopy issue #135 (Resource SDK tables.X.search returning subset after Harper restart while ops API returns full set), and bumps the core submodule to include the companion fix that enables multi-node cluster integration tests.

Companion PR

Depends on HarperFast/harper#523 in core (fix: register operations in worker threads so replication WS can dispatch them). Without that, Scenario B and any multi-node cluster test fails with Operation 'add_node_back' not found and connection was required to sign certificate.

Test results

Layer Result
Unit: in-process resetDatabases() + indexed search ✅ Pass
Unit: real subprocess boundary (fork writer, read in parent) ✅ Pass
Unit: schema-change reindex with mid-flight interruption ✅ Pass
Integration Scenario A: single-node graceful restart (issue135-resource-search-after-restart.test.mjs) ✅ Pass
Integration Scenario B: 2-node, write on node 0 → replicate to node 1 → restart node 1 (issue135-replicated-search-after-restart.test.mjs) ✅ Pass

The bug does not reproduce in any harness scenario. This is meaningful negative data: the issue #135 fingerprint is narrower than expected (single-node restart, replication-receiving-node restart, schema-change reindex with restart — none reproduce it). Likely triggers remaining: Fabric topology (>2 nodes, multi-region), production-scale load, or a specific combination of rolling deploy_component with active replication backlog that this harness doesn't exercise.

Changes in this PR

  • integrationTests/cluster/issue135-resource-search-after-restart.test.mjs — Scenario A (single-node).
  • integrationTests/cluster/issue135-replicated-search-after-restart.test.mjs — Scenario B (multi-node, new).
  • integrationTests/cluster/issue135-fixture/ — fixture component (SearchCount Resource + ScoreEvidence schema).
  • integrationTests/cluster/ISSUE-135-FINDINGS.md — what was tried, what passes, what the result implies.
  • integrationTests/cluster/ISSUE-135-PLAN.md — original plan doc kept for context.
  • Bump @harperfast/integration-testing to ^0.3.0.
  • Core submodule pointer bumped to include fix: register operations in worker threads so replication WS can dispatch them harper#523.

Review attention

  • Two suites are split across two files because node --test runs top-level suites concurrently — cluster startup in Scenario B interferes with Scenario A's single-node restart timing when colocated. Worth confirming this is the right pattern vs. forcing --test-concurrency=1.
  • Scenario A now polls describe_table after the post-deploy pollHealth to avoid a race where the graphql schema hasn't finished loading and the first insert hits database 'data' does not exist. This may be a Harper Pro issue worth investigating separately.
  • Fixture's SearchCount resource uses for await over tables.X.search (an AsyncIterable). Worth confirming this is the canonical Resource handler pattern.

🤖 Generated with Claude Sonnet 4.6 (1M context)

kriszyp and others added 4 commits May 12, 2026 16:01
Three distinct failures in tests that were never running because the
glob only matched *.ts files:

- cluster/fullyConnectedReplication, cluster/replicationLoad: imported
  targz from a relative path (../../core/integrationTests/utils/targz.ts)
  that no longer exists; moved to @harperfast/integration-testing which
  already exports it (matches how issue135 test does it)

- analytics fixture: spawn() calls lacked the required name option
  (Harper's wrapped spawn enforces this to deduplicate long-lived
  processes); added unique name per child using Date.now() + index

- cloneNode "three more nodes": waitForAvailableStatus only means the
  node is up, not that the full replication mesh has established; added
  a poll loop (same pattern as replicationTopology.test.mjs) that retries
  up to 20× with backoff before asserting all sockets are connected

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…restart

Adds a single-node integration test and fixture component for serent-canopy
issue #135 (tables.X.search returns subset after Harper restart while ops API
returns full set).

Results summary:
- Scenario A (single-node graceful restart): PASSES — not reproduced here.
- Scenarios B/C (multi-node replication): BLOCKED — cluster JWT key generation
  fails in the integration test environment, same as replicationLoad/
  fullyConnectedReplication. Scenario B left as test.skip with explanation.
- Unit-level results (subprocess restart, schema-change reindex): all pass.

Bug is scoped to multi-node replication state. The most likely mechanism:
index entries for rows received via replication replay are not rebuilt on
restart, leaving them invisible to tables.X.search (secondary index walk)
while ops API scan of primaryStore still sees them.

Files added:
- integrationTests/cluster/issue135-resource-search-after-restart.test.mjs
- integrationTests/cluster/issue135-fixture/ (SearchCount resource + ScoreEvidence schema)
- integrationTests/cluster/ISSUE-135-PLAN.md / ISSUE-135-FINDINGS.md

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@kriszyp kriszyp requested a review from a team as a code owner May 12, 2026 23:17
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 12, 2026

Reviewed; no blockers found.

- Adds Scenario B (`issue135-replicated-search-after-restart.test.mjs`) which
  exercises the most likely repro vector for serent-canopy issue #135:
  insert rows on node 0 → wait for replication to drain to node 1 → restart
  node 1 → query node 1 via the Resource SDK (`tables.X.search`). Bug does
  not reproduce — both ops API and Resource SDK return the full row set.
- Keeps the two scenarios in separate files because `node --test` runs
  top-level suites concurrently and the cluster startup of Scenario B
  interferes with Scenario A's single-node restart timing when colocated.
- Adds a post-deploy `describe_table` poll to Scenario A — `pollHealth`
  alone doesn't guarantee the graphql schema has finished loading and the
  first insert sometimes returned "database 'data' does not exist".
- Bumps `@harperfast/integration-testing` to ^0.3.0.
- Bumps the core submodule pointer to include the companion fix
  (`fix: register operations in worker threads so replication WS can
  dispatch them`) which the cluster setup in Scenario B depends on.

Updates ISSUE-135-FINDINGS.md to reflect the new test results.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@socket-security
Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updated@​harperfast/​integration-testing@​0.2.0 ⏵ 0.3.078 +310010094 +5100

View full report

Base automatically changed from re-enable-integration-testing to main May 13, 2026 14:54
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name is not a valid spawn option.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is with harper's spawn (where we replace the native node module with constrained child_process module), and name is required (it performs deduplication/checking if the process has already been spawned by another thread). This was based on our conversation where you said you thought it was best to preserve the existing spawn function for this functionality.
We still could add our own spawn if this creates confusion.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. node:child_process is not Node's node:child_process? I vividly remember this conversation. Given the context, I don't think we have many options.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is attenuated/constrained version of node:child_process, and returned(imported) for modules that import node:child_process. But maybe we should have the process de-duping version imported from harper? (but this goes back to the problem; I have never seen a module actually use node:child_process correctly. Failing fast isn't really any worse than feigning success 🤷 ).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. node:child_process is not Node's node:child_process? I vividly remember this conversation. Given the context, I don't think we have many options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants