test: add stage-local contract tests for compiler and query pipelines by venikman · Pull Request #32 · venikman/fpf-memory

venikman · 2026-04-13T01:44:05Z

What

Add stage-local contract tests for both the compiler and query pipelines, so that a test failure pinpoints the specific broken stage rather than surfacing as a generic "end-to-end answer is wrong."

Closes #6

Why

Per FPF grounding (B.3 Trust & Assurance, A.15 Role–Method–Work Alignment, B.3.4 Evidence Decay): each pipeline stage needs independent evidence of correctness. The existing 6 test files (1,530 lines) are all end-to-end/integration tests — none isolate individual stages. Stage-local contract tests provide targeted evidence refresh rather than requiring full end-to-end reruns.

Type

Changes

tests/compiler-contracts.test.ts — 5 describe blocks (~20 test cases):
- Parser resilience: Non-trivial section/pattern/route/lexicon counts; non-empty node IDs; metadata preservation; anchor validity
- Graph closure: Bounded unresolved references; tracked duplicate IDs; no broken routes; outline/explicit_reference relations; route step resolution rate
- Index round-trip: Index→pattern resolution; alias index validity; lexicon links; status/route/family indexes
- Validation coverage: Bounded missing fields; plausible index counts
- Snapshot determinism: Byte-identical re-compile; different hash on source change
tests/query-contracts.test.ts — 7 describe blocks (~30 test cases), refactored to call stage modules directly (no QueryEngine import):
- Normalizer → normalizeQuery() from query-normalizer.ts
- Seeder → seedCandidates() from candidate-seeder.ts
- Ranker → rankCandidates() from candidate-ranker.ts
- Frontier expansion → expandGrounding() from frontier-expander.ts
- Projection stability → buildPatternAnswer(), confidenceFromTrace(), gapsFromTrace() from answer-projector.ts
- Synthesis isolation → synthesizeAnswer() from synthesis-adapter.ts
- Trace determinism — assembled from stages, not QueryEngine
Each describe block now provides true stage-local fault attribution: a regression in ranking cannot fail a test labeled "Normalizer."
Existing 6 end-to-end test files are untouched

Note: Rebased on main after PR #31 merge. Stage modules are now available as direct imports.

Validation

bun run lint passes locally
bun run check passes locally
bun run test passes locally (CI green)
No new warnings introduced
bun run build succeeds (if runtime/server code touched)
bun run docs:build succeeds (if docs touched)
Relevant docs updated (README, docs/, inline JSDoc if applicable)

Boundary check

Runtime source set is FPF-spec.md only — no additional corpora added
No vector database or remote indexing introduced
No Python code added
MCP tool contracts stay in src/mcp/tool-contracts.ts

Agent metadata

Field	Value
Agent	Devin
Session	https://app.devin.ai/sessions/8fa4d238c77d45b2bfa931f5547007bc
Prompt	Merger: address venikman's review on #32, refactor query tests to call stage modules directly

Requested by: @venikman

devin-ai-integration · 2026-04-13T01:44:07Z

Original prompt from Stanislau

how our issues doing?

devin-ai-integration · 2026-04-13T01:44:08Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

coderabbitai · 2026-04-13T01:44:20Z

📝 Walkthrough

Walkthrough

Adds two stage-scoped contract test suites: one for the compiler pipeline validating parser, graph, index, validation, and snapshot invariants; and one for the query pipeline validating normalizer, seeder, ranker, frontier, projection, and synthesis isolation behaviors.

Changes

Cohort / File(s)	Summary
Compiler Contract Tests `tests/compiler-contracts.test.ts`	Adds stage-local compiler contracts (+282 lines): caches compiler output, computes SHA‑256 sourceHash, validates parser resilience (counts, non-empty IDs, preserved metadata), graph closure (unresolved refs, duplicate IDs, relation edges, route integrity), index round-trip (aliases, lexicon, status, route name resolution), validation coverage, and snapshot determinism (byte-identical snapshot for same input, differing hash on source change).
Query Pipeline Contract Tests `tests/query-contracts.test.ts`	Adds stage-local query contracts (+356 lines): compiles a cached Snapshot, constructs QueryEngine helpers, and validates normalizer signals, candidate seeder categories (exact-id, lexical, route expansion), candidate ranking and selection, frontier expansion bounds and hop metadata, projection determinism and verbose/proof behaviors, and synthesis isolation/fallback behaviors.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐇 With twitching nose I run my test,

I cache the build and do my best.
Parser, graph, and index all in sight,
Queries seeded, ranked, and bounded tight.
A carrot for each passing bite — hooray, good night!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding stage-local contract tests for compiler and query pipelines.
Linked Issues check	✅ Passed	The PR fully satisfies all acceptance criteria from issue `#6`: stage-local tests for all compiler stages (parser, graph, index, validation, snapshot) and query stages (normalizer, seeder, ranker, frontier, projector, synthesis), with existing end-to-end tests untouched.
Out of Scope Changes check	✅ Passed	All changes are directly in-scope: two new test files (compiler-contracts.test.ts and query-contracts.test.ts) providing the stage-local contract coverage specified in issue `#6`; no unrelated modifications.
Description check	✅ Passed	The pull request description is comprehensive and well-structured, covering all major sections of the template with clear details about what is being added, why, and how it validates against defined checks.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch devin/1776043509-stage-contract-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces stage-local contract tests for the compiler and query pipelines in tests/compiler-contracts.test.ts and tests/query-contracts.test.ts. The reviewer provided feedback focused on improving test robustness and determinism, specifically recommending the use of relative paths instead of process.cwd() and fixed timestamps for snapshots to ensure reproducible results. Additionally, the feedback suggests replacing magic numbers with constants and adding explicit assertions to conditional tests to prevent silent passes and ensure meaningful coverage.

gemini-code-assist · 2026-04-13T01:46:03Z

+  if (cachedOutput) {
+    return cachedOutput;
+  }
+  const sourcePath = resolve(process.cwd(), 'FPF-spec.md');


Using process.cwd() to resolve the source file path is fragile as it depends on the directory from which the test runner is executed. This can lead to failures if tests are run from a different directory level (e.g., in a monorepo or a sub-package). It is safer to resolve paths relative to the test file's location.

gemini-code-assist · 2026-04-13T01:46:03Z

+  cachedOutput = compileFpfSource({
+    sourcePath,
+    sourceHash,
+    builtAt: new Date().toISOString(),


Using new Date().toISOString() for builtAt introduces non-determinism into the cached compiler output. For contract tests, especially those verifying snapshot consistency and determinism, using a fixed timestamp is preferred to ensure reproducible results across different runs.

Suggested change

builtAt: new Date().toISOString(),

builtAt: '2025-01-01T00:00:00.000Z',

gemini-code-assist · 2026-04-13T01:46:03Z

+    expect(validation.parsedSections).toBeGreaterThan(100);
+    expect(validation.parsedPatterns).toBeGreaterThan(50);
+    expect(validation.parsedRoutes).toBeGreaterThan(0);
+    expect(validation.parsedLexiconEntries).toBeGreaterThan(5);


These assertions use arbitrary magic numbers as thresholds. While they serve as a baseline for the current FPF-spec.md, they make the tests brittle and tightly coupled to the specific content of the source document. If the document is refactored or reduced in size, these tests may fail even if the compiler logic is correct. Consider defining these thresholds as constants.

gemini-code-assist · 2026-04-13T01:46:03Z

+  if (cachedSnapshot) {
+    return cachedSnapshot;
+  }
+  const sourcePath = resolve(process.cwd(), 'FPF-spec.md');


Using process.cwd() to resolve the source file path is fragile. It is recommended to use a path relative to the test file's location to ensure the tests run correctly regardless of the current working directory.

gemini-code-assist · 2026-04-13T01:46:03Z

+  const output = compileFpfSource({
+    sourcePath,
+    sourceHash,
+    builtAt: new Date().toISOString(),


Using a dynamic date for builtAt in the test snapshot can lead to inconsistent results. Using a fixed timestamp is better for determinism in contract tests.

Suggested change

builtAt: new Date().toISOString(),

builtAt: '2025-01-01T00:00:00.000Z',

gemini-code-assist · 2026-04-13T01:46:04Z

+    if (firstRoute) {
+      const trace = engine(snapshot).trace(`Tell me about the ${firstRoute} route`);
+      expect(trace.detected.routeNames).toContain(firstRoute);
+    }


This test is conditional on firstRoute being defined. If the parser fails to find any routes, the test will pass silently without actually asserting the behavior. It should explicitly check that routes exist before proceeding.

Suggested change

if (firstRoute) {

const trace = engine(snapshot).trace(`Tell me about the ${firstRoute} route`);

expect(trace.detected.routeNames).toContain(firstRoute);

}

expect(routeNames.length).toBeGreaterThan(0);

const firstRoute = routeNames[0]!;

const trace = engine(snapshot).trace('Tell me about the ' + firstRoute + ' route');

expect(trace.detected.routeNames).toContain(firstRoute);

gemini-code-assist · 2026-04-13T01:46:04Z

+    if (statusKeys.length > 0) {
+      const firstStatus = statusKeys[0]!;
+      const trace = engine(snapshot).trace(`Show me ${firstStatus} patterns`);
+      expect(trace.detected.statusTerms).toContain(firstStatus);
+    }


Similar to the route detection test, this test passes silently if statusKeys is empty. Asserting that status terms are present in the index ensures that the test coverage is meaningful and that the parser is working as expected.

expect(statusKeys.length).toBeGreaterThan(0); const firstStatus = statusKeys[0]!; const trace = engine(snapshot).trace('Show me ' + firstStatus + ' patterns'); expect(trace.detected.statusTerms).toContain(firstStatus);

gemini-code-assist · 2026-04-13T01:46:04Z

+    if (trace.retrievalHops.length > 0) {
+      const firstHop = trace.retrievalHops[0]!;
+      expect(firstHop.iteration).toBe(1);
+      expect(firstHop.reason.length).toBeGreaterThan(0);
+      expect(typeof firstHop.sufficientAfter).toBe('boolean');
+    }


The assertions for hop metadata are wrapped in an if block. If no hops are recorded, the assertions are skipped, and the test passes. This could mask a bug where the engine fails to perform any retrieval hops for a complex query. You should assert that at least one hop occurred.

expect(trace.retrievalHops.length).toBeGreaterThan(0); const firstHop = trace.retrievalHops[0]!; expect(firstHop.iteration).toBe(1); expect(firstHop.reason.length).toBeGreaterThan(0); expect(typeof firstHop.sufficientAfter).toBe('boolean');

gemini-code-assist · 2026-04-13T01:46:04Z

+    if (trace.sufficient) {
+      expect(trace.selectedAnchorIds.length).toBeGreaterThan(0);
+    }


This test only verifies anchor selection if the trace is already marked as sufficient. For a known pattern query like 'What is A.1.1?', the engine should always reach sufficiency. Asserting sufficiency directly makes the test more robust.

expect(trace.sufficient).toBe(true); expect(trace.selectedAnchorIds.length).toBeGreaterThan(0);

Copilot

Pull request overview

Adds stage-local contract tests for the runtime compiler and query engine so failures localize to a specific pipeline stage rather than surfacing only as end-to-end mismatches.

Changes:

Introduces compiler stage contract tests (parser, graph closure, indexes, validation, determinism) against FPF-spec.md.
Introduces query stage contract tests (normalization, seeding, ranking, frontier bounds, projection, synthesis isolation, trace determinism) against a compiled snapshot.
Keeps existing end-to-end/integration tests unchanged while expanding stage-level evidence coverage.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
tests/compiler-contracts.test.ts	Adds stage-local compiler contract tests validating parse/graph/index/validation/determinism invariants.
tests/query-contracts.test.ts	Adds stage-local query contract tests validating normalization, candidate seeding/ranking, frontier bounds, projection behavior, synthesis isolation, and trace determinism.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-13T01:48:16Z

+    const routeNames = Object.values(snapshot.routeGraph.nodes).map((r) => r.name);
+    const firstRoute = routeNames[0];
+
+    if (firstRoute) {
+      const trace = engine(snapshot).trace(`Tell me about the ${firstRoute} route`);
+      expect(trace.detected.routeNames).toContain(firstRoute);
+    }


This test can silently pass if routeGraph.nodes is empty (or contains routes without names) because it guards on if (firstRoute) and otherwise performs no assertions. Since the contract is specifically about route-name detection, assert a non-empty route set (e.g., expect(routeNames.length).toBeGreaterThan(0)) and then run the detection expectations unconditionally so regressions can’t be skipped.

Suggested change

const routeNames = Object.values(snapshot.routeGraph.nodes).map((r) => r.name);

const firstRoute = routeNames[0];

if (firstRoute) {

const trace = engine(snapshot).trace(`Tell me about the ${firstRoute} route`);

expect(trace.detected.routeNames).toContain(firstRoute);

}

const routeNames = Object.values(snapshot.routeGraph.nodes)

.map((r) => r.name)

.filter((name): name is string => typeof name === 'string' && name.length > 0);

expect(routeNames.length).toBeGreaterThan(0);

const firstRoute = routeNames[0]!;

const trace = engine(snapshot).trace(`Tell me about the ${firstRoute} route`);

expect(trace.detected.routeNames).toContain(firstRoute);

Copilot · 2026-04-13T01:48:16Z

+    if (statusKeys.length > 0) {
+      const firstStatus = statusKeys[0]!;
+      const trace = engine(snapshot).trace(`Show me ${firstStatus} patterns`);
+      expect(trace.detected.statusTerms).toContain(firstStatus);
+    }


Similar to the route-name test: this can become a vacuous pass if the status index ever becomes empty because the assertions are skipped under if (statusKeys.length > 0). Consider asserting statusKeys.length > 0 (or using a known status term like stable) so the test reliably fails when status-term detection breaks.

Suggested change

if (statusKeys.length > 0) {

const firstStatus = statusKeys[0]!;

const trace = engine(snapshot).trace(`Show me ${firstStatus} patterns`);

expect(trace.detected.statusTerms).toContain(firstStatus);

}

expect(statusKeys.length).toBeGreaterThan(0);

const firstStatus = statusKeys[0]!;

const trace = engine(snapshot).trace(`Show me ${firstStatus} patterns`);

expect(trace.detected.statusTerms).toContain(firstStatus);

Copilot · 2026-04-13T01:48:17Z

+    const routeCandidates = trace.candidateScores.filter((c) => c.kind === 'route');
+    expect(routeCandidates.length).toBeGreaterThan(0);


This is intended to validate seeding route expansion candidates, but it inspects trace.candidateScores, which is a ranked + truncated view (slice(0, 16)). A route can be correctly seeded yet not appear in the top 16, causing a false failure. Prefer asserting against trace.frontierCandidates (e.g., presence of origin === 'route_expansion') which represents the seed set without truncation.

Suggested change

const routeCandidates = trace.candidateScores.filter((c) => c.kind === 'route');

expect(routeCandidates.length).toBeGreaterThan(0);

const routeFrontier = trace.frontierCandidates.filter(

(c) => c.origin === 'route_expansion',

);

expect(routeFrontier.length).toBeGreaterThan(0);

Copilot · 2026-04-13T01:48:17Z

+    // threshold (100) and total count stays low relative to the full catalog.
+    const highScoring = trace.candidateScores.filter((c) => c.score >= 100);
+    expect(highScoring.length).toBe(0);
+    expect(trace.candidateScores.length).toBeLessThan(


trace.candidateScores is always capped to the top 16 candidates (see QueryEngine.trace()), so the assertion about “total count stays low relative to the full catalog” is not actually testing what it claims—it will almost always be true regardless of how many candidates were generated. To make this contract meaningful, assert against an un-truncated signal (e.g., trace.frontierCandidates.length) or extend TraceResult to include the full candidate count.

Suggested change

expect(trace.candidateScores.length).toBeLessThan(

expect(trace.frontierCandidates.length).toBeLessThan(

Copilot · 2026-04-13T01:48:17Z

+  it('status index keys partition compiled nodes without overlap', async () => {
+    const { snapshot } = await getCompilerOutput();
+    const statusIndex = snapshot.indexes.statusIndex;
+
+    expect(Object.keys(statusIndex).length).toBeGreaterThan(0);
+
+    for (const [_status, nodeIds] of Object.entries(statusIndex)) {
+      for (const nodeId of nodeIds) {
+        expect(snapshot.compiledNodes[nodeId]).toBeDefined();
+      }
+    }
+  });


The test name/comment claims the status index “partitions compiled nodes without overlap”, but the assertions only check that referenced node IDs exist. Either tighten the assertions to actually verify the partition/overlap property (likely for pattern nodes only), or rename the test to reflect what it currently validates (e.g., that status index entries resolve to compiled nodes).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c0595bfd6e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-13T01:51:45Z

+    if (statusKeys.length > 0) {
+      const firstStatus = statusKeys[0]!;
+      const trace = engine(snapshot).trace(`Show me ${firstStatus} patterns`);
+      expect(trace.detected.statusTerms).toContain(firstStatus);


Use a supported status token in normalizer contract test

This test picks statusKeys[0] from snapshot.indexes.statusIndex and expects it in trace.detected.statusTerms, but QueryEngine.trace() only emits the fixed tokens draft, stable, stub, and transitional. The status index currently also contains values like new, full text, and transitional stub, so a harmless change in insertion order or source ordering can make this test fail even when normalizer behavior is unchanged, creating a flaky CI gate.

Useful? React with 👍 / 👎.

Valid point — the normalizer only detects the fixed tokens draft, stable, stub, transitional via substring matching, so picking an arbitrary statusKeys[0] could yield a key like new or full text that the normalizer never emits.

Fixed in 3239888: the test now finds a known token from the hardcoded list that exists in the status index, ensuring the test validates actual normalizer behavior.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/compiler-contracts.test.ts`:
- Around line 193-203: The test currently only checks that each nodeId in
statusIndex exists in snapshot.compiledNodes but doesn't ensure nodeIds are
unique across status keys; update the test (referencing statusIndex, snapshot,
compiledNodes) to assert there is no overlap by tracking seen node IDs (e.g., a
Set) as you iterate all Object.entries(statusIndex) and for each nodeId assert
it's not already present before adding it, and optionally assert that the total
number of unique node IDs equals the sum of all nodeIds lengths to catch
duplicates.
- Around line 220-236: The tests currently only run against live production data
(getCompilerOutput) so they won't detect regressions that stop reporting issues;
add a deterministic malformed fixture and a unit test that feeds it through the
same validation path and asserts specific findings. Create a new test case
(e.g., in the same suite 'Compiler / Validation stage') that builds a small
synthetic snapshot with intentionally missing required fields and an unresolved
reference, call the same validation logic used by getCompilerOutput (reuse the
functions that populate snapshot.validation or invoke the validator directly),
then assert exact expected counts on validation.missingRequiredFields and
validation.unresolvedReferences (or indexMapNodes if applicable) to ensure the
validator flags the known-bad inputs. Ensure the test references
getCompilerOutput/validator entrypoint and the validation fields
missingRequiredFields and unresolvedReferences/indexMapNodes so future
regressions fail deterministically.
- Around line 38-82: The tests under "Compiler / Parser stage" currently only
validate the canonical snapshot returned by getCompilerOutput(); add additional
test cases that recompile semantic-equivalent variants (whitespace-only,
reordered sections, and prose-only edits) and assert the same invariants
(parsedSections/parsedPatterns/parsedRoutes/parsedLexiconEntries counts,
compiledNodes IDs, pattern metadata for 'A.1.1', and valid anchors) hold for
each variant; implement helper functions to produce or load variant inputs, call
the same getCompilerOutput or a variant-taking compile function, and run the
existing assertions against each resulting snapshot to ensure parser resilience.

In `@tests/query-contracts.test.ts`:
- Around line 268-316: The tests for fallback synthesis check status, ids,
citations, and relations but don’t assert that the fallback answer and
constraints remain identical to the deterministic baseline; update the three
tests using engine(...) and LocalAnswerSynthesizer (the "returns deterministic
answer when synthesizer is unavailable", "falls back to deterministic answer
when synthesizer throws", and "does not alter deterministic IDs or citations
when synthesis fails") to also capture the deterministic baseline by calling
eng.query(...) (or engine(snapshot) once) and then assert that result.answer and
result.constraints equal the baseline's answer and constraints (for both the
unavailable and throwing synthesizer cases) so any regression that changes
answer content or constraints will fail.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bf33b511-7b48-4d5a-a9f4-7322e0c100c8

📥 Commits

Reviewing files that changed from the base of the PR and between aa03982 and c0595bf.

📒 Files selected for processing (2)

tests/compiler-contracts.test.ts
tests/query-contracts.test.ts

coderabbitai · 2026-04-13T01:52:54Z

+describe('Compiler / Parser stage', () => {
+  it('parses a non-trivial number of sections, patterns, routes, and lexicon entries', async () => {
+    const { snapshot } = await getCompilerOutput();
+    const { validation } = snapshot;
+
+    expect(validation.parsedSections).toBeGreaterThan(100);
+    expect(validation.parsedPatterns).toBeGreaterThan(50);
+    expect(validation.parsedRoutes).toBeGreaterThan(0);
+    expect(validation.parsedLexiconEntries).toBeGreaterThan(5);
+  });
+
+  it('assigns IDs to all compiled nodes and none are empty strings', async () => {
+    const { snapshot } = await getCompilerOutput();
+    const nodeIds = Object.keys(snapshot.compiledNodes);
+
+    expect(nodeIds.length).toBeGreaterThan(50);
+    for (const nodeId of nodeIds) {
+      expect(nodeId.length).toBeGreaterThan(0);
+    }
+  });
+
+  it('preserves pattern metadata fields (title, status, part)', async () => {
+    const { snapshot } = await getCompilerOutput();
+    const pattern = snapshot.patternGraph.nodes['A.1.1'];
+
+    expect(pattern).toBeDefined();
+    expect(pattern!.title.length).toBeGreaterThan(0);
+    expect(pattern!.status.length).toBeGreaterThan(0);
+    expect(pattern!.sectionIds.length).toBeGreaterThan(0);
+  });
+
+  it('produces anchors with valid line ranges', async () => {
+    const { snapshot } = await getCompilerOutput();
+    const anchors = Object.values(snapshot.anchorMap);
+
+    expect(anchors.length).toBeGreaterThan(50);
+    for (const anchor of anchors.slice(0, 20)) {
+      expect(anchor.lineStart).toBeGreaterThanOrEqual(0);
+      expect(anchor.lineEnd).toBeGreaterThan(anchor.lineStart);
+    }
+
+    const nonEmpty = anchors.filter((a) => a.text.length > 0);
+    expect(nonEmpty.length).toBeGreaterThan(anchors.length / 2);
+  });
+});


⚠️ Potential issue | 🟠 Major

Parser resilience is not actually under test here.

These assertions only inspect the canonical FPF-spec.md snapshot. They never recompile whitespace-only, reordered, or prose-only variants, so a regression on semantically equivalent source edits would still pass this whole block.

Suggested direction

+ it('preserves semantic IR under whitespace-only rewrites', async () => { + const sourcePath = resolve(process.cwd(), 'FPF-spec.md'); + const sourceText = await readFile(sourcePath, 'utf8'); + const rewritten = sourceText + .replace(/[ \t]+$/gm, '') + .replace(/\n{3,}/g, '\n\n'); + const builtAt = '2025-01-01T00:00:00.000Z'; + + const base = compileFpfSource({ + sourcePath, + sourceHash: createHash('sha256').update(sourceText).digest('hex'), + builtAt, + sourceText, + }); + const variant = compileFpfSource({ + sourcePath, + sourceHash: createHash('sha256').update(rewritten).digest('hex'), + builtAt, + sourceText: rewritten, + }); + + expect(Object.keys(variant.snapshot.patternGraph.nodes)).toEqual( + Object.keys(base.snapshot.patternGraph.nodes), + ); + expect(Object.keys(variant.snapshot.routeGraph.nodes)).toEqual( + Object.keys(base.snapshot.routeGraph.nodes), + ); + });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/compiler-contracts.test.ts` around lines 38 - 82, The tests under "Compiler / Parser stage" currently only validate the canonical snapshot returned by getCompilerOutput(); add additional test cases that recompile semantic-equivalent variants (whitespace-only, reordered sections, and prose-only edits) and assert the same invariants (parsedSections/parsedPatterns/parsedRoutes/parsedLexiconEntries counts, compiledNodes IDs, pattern metadata for 'A.1.1', and valid anchors) hold for each variant; implement helper functions to produce or load variant inputs, call the same getCompilerOutput or a variant-taking compile function, and run the existing assertions against each resulting snapshot to ensure parser resilience.

coderabbitai · 2026-04-13T01:52:54Z

+  it('status index keys partition compiled nodes without overlap', async () => {
+    const { snapshot } = await getCompilerOutput();
+    const statusIndex = snapshot.indexes.statusIndex;
+
+    expect(Object.keys(statusIndex).length).toBeGreaterThan(0);
+
+    for (const [_status, nodeIds] of Object.entries(statusIndex)) {
+      for (const nodeId of nodeIds) {
+        expect(snapshot.compiledNodes[nodeId]).toBeDefined();
+      }
+    }


⚠️ Potential issue | 🟠 Major

This does not verify the “without overlap” part of the contract.

The loop only checks that every indexed ID exists in compiledNodes. If the same node is emitted under multiple status keys, this test still passes.

Suggested fix

expect(Object.keys(statusIndex).length).toBeGreaterThan(0); - for (const [_status, nodeIds] of Object.entries(statusIndex)) { + const seen = new Map<string, string>(); + for (const [status, nodeIds] of Object.entries(statusIndex)) { for (const nodeId of nodeIds) { expect(snapshot.compiledNodes[nodeId]).toBeDefined(); + expect(seen.has(nodeId)).toBe(false); + seen.set(nodeId, status); } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

it('status index keys partition compiled nodes without overlap', async () => {

const { snapshot } = await getCompilerOutput();

const statusIndex = snapshot.indexes.statusIndex;

expect(Object.keys(statusIndex).length).toBeGreaterThan(0);

for (const [_status, nodeIds] of Object.entries(statusIndex)) {

for (const nodeId of nodeIds) {

expect(snapshot.compiledNodes[nodeId]).toBeDefined();

}

}

it('status index keys partition compiled nodes without overlap', async () => {

const { snapshot } = await getCompilerOutput();

const statusIndex = snapshot.indexes.statusIndex;

expect(Object.keys(statusIndex).length).toBeGreaterThan(0);

const seen = new Map<string, string>();

for (const [status, nodeIds] of Object.entries(statusIndex)) {

for (const nodeId of nodeIds) {

expect(snapshot.compiledNodes[nodeId]).toBeDefined();

expect(seen.has(nodeId)).toBe(false);

seen.set(nodeId, status);

}

}

});

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/compiler-contracts.test.ts` around lines 193 - 203, The test currently only checks that each nodeId in statusIndex exists in snapshot.compiledNodes but doesn't ensure nodeIds are unique across status keys; update the test (referencing statusIndex, snapshot, compiledNodes) to assert there is no overlap by tracking seen node IDs (e.g., a Set) as you iterate all Object.entries(statusIndex) and for each nodeId assert it's not already present before adding it, and optionally assert that the total number of unique node IDs equals the sum of all nodeIds lengths to catch duplicates.

The test was renamed from "status index keys partition compiled nodes without overlap" to "status index entries resolve to existing compiled nodes" in 3239888, so the name now accurately describes what the assertions check (existence only).

The overlap assertion is intentionally omitted because the FPF status index legitimately assigns nodes to multiple statuses (e.g., a pattern can be both "stub" and "transitional"). Adding a uniqueness constraint would assert a property the data model doesn't guarantee.

Skipped: comment is from another GitHub bot.

The test was renamed from "status index keys partition compiled nodes without overlap" to "status index entries resolve to existing compiled nodes" in 3239888, so the name now accurately describes what the assertions check (existence only).

The overlap assertion is intentionally omitted because the FPF status index legitimately assigns nodes to multiple statuses (e.g., a pattern can be both "stub" and "transitional"). Adding a uniqueness constraint would assert a property the data model doesn't guarantee.

Skipped: comment is from another GitHub bot.

coderabbitai · 2026-04-13T01:52:54Z

+describe('Compiler / Validation stage', () => {
+  it('keeps missing required fields bounded', async () => {
+    const { snapshot } = await getCompilerOutput();
+    const { validation } = snapshot;
+
+    // The FPF spec has a small number of patterns with incomplete metadata.
+    // The contract is that this stays bounded — a regression would spike it.
+    expect(validation.missingRequiredFields).toBeLessThan(25);
+  });
+
+  it('counts a plausible number of index map nodes', async () => {
+    const { snapshot } = await getCompilerOutput();
+    const { validation } = snapshot;
+
+    expect(validation.indexMapNodes).toBeGreaterThan(50);
+    expect(validation.indexMapNodes).toBe(Object.keys(snapshot.indexMap).length);
+  });


⚠️ Potential issue | 🟠 Major

The validator is never exercised with known-bad input.

Both checks run against the current production spec, so a regression that stops reporting missing fields or unresolved references can still go green here. This stage needs at least one deliberately malformed fixture and explicit findings assertions.

Suggested direction

+ it('reports specific findings for malformed input', () => { + const sourceText = ` +## A.1.1 +Status: draft + +References: Missing.Id +`; + const output = compileFpfSource({ + sourcePath: 'inline', + sourceHash: createHash('sha256').update(sourceText).digest('hex'), + builtAt: '2025-01-01T00:00:00.000Z', + sourceText, + }); + + expect(output.snapshot.validation.missingRequiredFields).toBeGreaterThan(0); + expect(output.snapshot.validation.unresolvedReferences.length).toBeGreaterThan(0); + });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/compiler-contracts.test.ts` around lines 220 - 236, The tests currently only run against live production data (getCompilerOutput) so they won't detect regressions that stop reporting issues; add a deterministic malformed fixture and a unit test that feeds it through the same validation path and asserts specific findings. Create a new test case (e.g., in the same suite 'Compiler / Validation stage') that builds a small synthetic snapshot with intentionally missing required fields and an unresolved reference, call the same validation logic used by getCompilerOutput (reuse the functions that populate snapshot.validation or invoke the validator directly), then assert exact expected counts on validation.missingRequiredFields and validation.unresolvedReferences (or indexMapNodes if applicable) to ensure the validator flags the known-bad inputs. Ensure the test references getCompilerOutput/validator entrypoint and the validation fields missingRequiredFields and unresolvedReferences/indexMapNodes so future regressions fail deterministically.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 32398887bb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-13T02:04:35Z

+      sourceText: `${sourceText}\n<!-- change -->\n`,
+    });
+
+    expect(first.snapshot.sourceHash).not.toBe(second.snapshot.sourceHash);


Assert source hash changes through runtime path

This check is tautological for the compiler contract: compileFpfSource stores the caller-provided sourceHash in the snapshot, so passing two already-different hashes guarantees this assertion passes even if source-text handling regresses. As written, the test validates the test’s own hash precomputation rather than any compiler behavior, so a real compiler regression could slip through with green CI.

Useful? React with 👍 / 👎.

Good catch — you're right, this assertion is tautological since compileFpfSource just stores the caller-provided sourceHash verbatim. The test validates the test's own hash precomputation, not any compiler behavior.

I'll fix this by verifying that the compiler actually produces structurally different output when the source text changes (e.g., different compiled node count or different searchable text), rather than just checking the pass-through hash.

venikman · 2026-04-14T03:28:08Z

Code Review

Overview

Adds two new test files — tests/compiler-contracts.test.ts (~295 lines) and tests/query-contracts.test.ts (~356 lines) — that assert stage-local invariants for each compiler stage (parser, graph closure, index round-trip, validation, determinism) and each query stage (normalizer, seeder, ranker, frontier, projection, synthesis isolation).

Strengths

Stage-local failure mode — this is the right framing. When a test fails, reviewer can tell immediately which stage broke instead of chasing a "answer is wrong" symptom across the whole pipeline.
Bounded assertions — MIN_SECTIONS = 100, MIN_PATTERNS = 50, duplicateIds < patternCount + 10 — loose enough to survive spec edits, tight enough to catch regressions.
Snapshot determinism test — compileFpfSource called twice yields byte-identical JSON. This is a real invariant of the compiler and a common thing to accidentally break.
Synthesis isolation — verifies that when synthesizer is unavailable/throws, deterministic IDs/citations are unchanged. Directly addresses the failure mode I flagged in the audit as risky.
Cached CompilerOutput across tests in the file — keeps the suite fast.

Issues & nits

Brittle to spec-specific IDs

expect(trace.detected.ids).toContain('A.1.1');
expect(snapshot.patternGraph.nodes['A.1.1']).toBeDefined();

Hardcoded IDs (A.1.1, C.17, B.3, etc.) appear in many tests. If the spec renames or renumbers, the whole suite breaks. Contract tests against a live spec are inherently coupled — acceptable, but worth a comment at the top explaining that A.1.1 is being used as a canonical fixture and where to update if it changes.

`xyzzy plugh` nonsense probe

const trace = engine(snapshot).trace('xyzzy plugh');
expect(trace.detected.ids).toEqual([]);

If the spec ever happens to mention either word, the test fails. Pick a truly impossible string like __FPFTEST_NONSENSE_999__ to be safe. Low-probability but zero-cost to fix.

Synthetic section test asserts weak invariant

const structurallyDifferent =
  secondSections > firstSections || secondIndexNodes > firstIndexNodes;
expect(structurallyDifferent).toBe(true);

Tests only that something grew. Doesn't verify the new section was actually parsed as Z.99. Stronger assertion: expect(second.snapshot.patternGraph.nodes['Z.99']).toBeDefined() or equivalent — proves the parser handled the new heading shape correctly.

Cached output across tests shares mutable state risk

cachedOutput is module-scoped. If any test mutates the returned snapshot, subsequent tests see the mutation. Today all consumers are read-only so this is fine; worth freezing the object (Object.freeze-ish) to make the invariant explicit.

Coverage gaps I'd add in a follow-up

Heuristic seed rules (landed in refactor: extract inline heuristic seeds and route constraints to snapshot-attached rules #28) — no test asserting that snapshot-attached rules fire correctly for a creativity/vocabulary query.
Session cache persistence (feat: persist session cache across process restarts #27) — no contract test for load/save round-trip.
IndexingView / refresh classification (feat: add stable indexing view and semantic refresh classification #29) — flagged in that PR's review as missing tests; belongs here.

Coupling with PR #31 split

This PR stands alone on current main, but the query-stage names (normalizer, seeder, ranker, frontier, projection, synthesis) map directly to the new files in #31. If #31 lands first, great — tests validate the split. If #32 lands first, the stage names are aspirational until #31 realigns. Would recommend landing #31 before or with this.

Recommendation

Approve. This is exactly the right test posture for a staged pipeline and directly supports the split refactor in #31. Small nits only — fix the nonsense-query token, strengthen the Z.99 assertion, and track the coverage-gap follow-ups.

devin-ai-integration · 2026-04-14T03:53:46Z

@venikman All three nits fixed in 5972811:

Nonsense probe — 'xyzzy plugh' → '__FPFTEST_NONSENSE_999__' across all three usages. Zero chance of spec collision.
Z.99 assertion strengthened — Now asserts secondSections > firstSections directly (not the weaker "something grew" disjunction). Added comment explaining why Z.99 appears as a parsed section, not a patternGraph node — the compiler only promotes headings that match spec-catalog entries.
Spec-ID fixture comments — Added canonical fixture ID documentation to both test file headers explaining that A.1.1, A.15, B.3 are stable spec anchors and where to update if renamed.

Re: coverage gaps (heuristic seed rules, session cache persistence, IndexingView classification) — agreed, these belong in follow-up PRs. The current suite covers the six query stages and five compiler stages.

Re: coupling with #31 — agreed on landing together. Stage names map 1:1 to #31's new files.

venikman

One substantive issue before merge:

tests/query-contracts.test.ts is described as stage-local coverage, but the file still routes every assertion through new QueryEngine(...).trace() or .query(). That means a regression in ranking, frontier expansion, or projection can still fail the blocks labeled Normalizer or Seeder, so the PR does not yet provide true stage-local fault attribution for the query pipeline.

If the goal for #6 is real per-stage isolation, the query-side tests need a smaller seam than QueryEngine itself, for example by exposing and testing the individual query stages directly or by extracting stage helpers behind explicit contracts.

I validated the branch locally in an isolated snapshot and did not find a runtime regression beyond this review point.

devin-ai-integration · 2026-04-14T05:38:42Z

Devin is archived and cannot be woken up. Please unarchive Devin if you want to continue using it.

Closes #6 - compiler-contracts.test.ts: 5 describe blocks covering Parser resilience, Graph closure, Index round-trip, Validation coverage, and Snapshot determinism - query-contracts.test.ts: 7 describe blocks covering Normalizer, Seeder, Ranker, Frontier expansion, Projection stability, Synthesis isolation, and cross-cutting Trace determinism Each test targets a specific pipeline stage promise so that a failure pinpoints the broken stage rather than surfacing as a generic 'end-to-end answer is wrong.' Existing end-to-end tests remain untouched. Co-Authored-By: Stanislau <nedbailov375426@gmail.com>

- Add total>0 guard before division in route resolution test (Devin Review) - Use hardcoded normalizer tokens for status term test (Codex) - Use frontierCandidates for route seeder and nonsense tests (Copilot) - Rename status index test to match actual assertions (CodeRabbit) - Fix hop metadata test to handle already-sufficient grounding - Use real FPF IDs (A.1.1, A.15, B.3) for hop metadata query Co-Authored-By: Stanislau <nedbailov375426@gmail.com>

Co-Authored-By: Stanislau <nedbailov375426@gmail.com>

…ec-ID fixture comments - Replace 'xyzzy plugh' with '__FPFTEST_NONSENSE_999__' to avoid false failures if spec ever mentions those words - Strengthen Z.99 structural assertion: verify parsedSections grows (not just 'something grew'), with comment explaining why Z.99 appears as a section not a pattern node - Add canonical fixture ID comments to both test files explaining that A.1.1/A.15/B.3 are stable spec anchors and where to update if renamed Co-Authored-By: Stanislau <nedbailov375426@gmail.com>

Co-Authored-By: Stanislau <nedbailov375426@gmail.com>

Address venikman's review: replace QueryEngine.trace()/query() with direct imports of normalizeQuery, seedCandidates, rankCandidates, expandGrounding, buildPatternAnswer, synthesizeAnswer, confidenceFromTrace, and gapsFromTrace. Each describe block now targets its stage function in isolation so a regression in one stage cannot masquerade as a failure in another. Co-Authored-By: Stanislau <nedbailov375426@gmail.com>

devin-ai-integration · 2026-04-14T05:46:01Z

Addressed in 13f7af7 — refactored the entire query contract test file to call stage modules directly instead of routing through QueryEngine.trace()/.query():

Before: Every test imported QueryEngine and called .trace() or .query() — a regression in ranking could fail a test labeled "Normalizer."

After: Each describe block imports and calls its stage function directly:

Normalizer → normalizeQuery() from query-normalizer.ts
Seeder → seedCandidates() from candidate-seeder.ts
Ranker → rankCandidates() from candidate-ranker.ts
Frontier → expandGrounding() from frontier-expander.ts
Projection → buildPatternAnswer(), confidenceFromTrace(), gapsFromTrace() from answer-projector.ts
Synthesis → synthesizeAnswer() from synthesis-adapter.ts

The assembleTrace() helper (used only by projection/synthesis tests that need a full TraceResult) chains the stages explicitly without going through QueryEngine, so the seam is transparent and each stage's output is independently inspectable.

QueryEngine is no longer imported in the test file at all.

Copilot AI review requested due to automatic review settings April 13, 2026 01:44

devin-ai-integration Bot assigned venikman Apr 13, 2026

Copilot started reviewing on behalf of venikman April 13, 2026 01:44 View session

gemini-code-assist Bot reviewed Apr 13, 2026

View reviewed changes

Copilot AI reviewed Apr 13, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

chatgpt-codex-connector Bot reviewed Apr 13, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 13, 2026

View reviewed changes

devin-ai-integration Bot force-pushed the devin/1776043509-stage-contract-tests branch 3 times, most recently from ccd523a to 2fe8114 Compare April 14, 2026 03:25

venikman mentioned this pull request Apr 14, 2026

feat: add stable indexing view and semantic refresh classification #29

Merged

12 tasks

This was referenced Apr 14, 2026

refactor: split query-engine.ts into retrieval stages #31

Merged

Split query-engine.ts into retrieval stages #2

Closed

Add stage-local contract tests for compiler and query pipelines #6

Closed

venikman commented Apr 14, 2026

View reviewed changes

devin-ai-integration Bot and others added 4 commits April 14, 2026 05:40

test: replace tautological sourceHash check with structural assertion

c4765d6

Co-Authored-By: Stanislau <nedbailov375426@gmail.com>

devin-ai-integration Bot and others added 2 commits April 14, 2026 05:40

fix: remove unused firstIndexNodes/secondIndexNodes variables

7a69ba0

Co-Authored-By: Stanislau <nedbailov375426@gmail.com>

devin-ai-integration Bot force-pushed the devin/1776043509-stage-contract-tests branch from 195ced1 to 13f7af7 Compare April 14, 2026 05:43

venikman merged commit 81f55e7 into main Apr 14, 2026
1 check passed

venikman deleted the devin/1776043509-stage-contract-tests branch May 9, 2026 10:59

	builtAt: new Date().toISOString(),
	builtAt: '2025-01-01T00:00:00.000Z',

		const routeCandidates = trace.candidateScores.filter((c) => c.kind === 'route');
		expect(routeCandidates.length).toBeGreaterThan(0);

-    const routeCandidates = trace.candidateScores.filter((c) => c.kind === 'route');
-    expect(routeCandidates.length).toBeGreaterThan(0);
+    const routeFrontier = trace.frontierCandidates.filter(
+      (c) => c.origin === 'route_expansion',
+    );
+    expect(routeFrontier.length).toBeGreaterThan(0);

	expect(trace.candidateScores.length).toBeLessThan(
	expect(trace.frontierCandidates.length).toBeLessThan(

Conversation

venikman commented Apr 13, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Type

Changes

Validation

Boundary check

Agent metadata

Uh oh!

devin-ai-integration Bot commented Apr 13, 2026

Uh oh!

devin-ai-integration Bot commented Apr 13, 2026

🤖 Devin AI Engineer

Uh oh!

coderabbitai Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

venikman commented Apr 13, 2026 •

edited by devin-ai-integration Bot

Loading

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading

coderabbitai Bot Apr 13, 2026 •

edited

Loading

`xyzzy plugh` nonsense probe