Skip to content

test: add stage-local contract tests for compiler and query pipelines#32

Merged
venikman merged 6 commits into
mainfrom
devin/1776043509-stage-contract-tests
Apr 14, 2026
Merged

test: add stage-local contract tests for compiler and query pipelines#32
venikman merged 6 commits into
mainfrom
devin/1776043509-stage-contract-tests

Conversation

@venikman
Copy link
Copy Markdown
Owner

@venikman venikman commented Apr 13, 2026

What

Add stage-local contract tests for both the compiler and query pipelines, so that a test failure pinpoints the specific broken stage rather than surfacing as a generic "end-to-end answer is wrong."

Closes #6

Why

Per FPF grounding (B.3 Trust & Assurance, A.15 Role–Method–Work Alignment, B.3.4 Evidence Decay): each pipeline stage needs independent evidence of correctness. The existing 6 test files (1,530 lines) are all end-to-end/integration tests — none isolate individual stages. Stage-local contract tests provide targeted evidence refresh rather than requiring full end-to-end reruns.

Type

  • feat — new capability
  • fix — bug fix
  • refactor — code restructuring
  • docs — documentation only
  • chore — maintenance (deps, CI, cleanup)

Changes

  • tests/compiler-contracts.test.ts — 5 describe blocks (~20 test cases):

    • Parser resilience: Non-trivial section/pattern/route/lexicon counts; non-empty node IDs; metadata preservation; anchor validity
    • Graph closure: Bounded unresolved references; tracked duplicate IDs; no broken routes; outline/explicit_reference relations; route step resolution rate
    • Index round-trip: Index→pattern resolution; alias index validity; lexicon links; status/route/family indexes
    • Validation coverage: Bounded missing fields; plausible index counts
    • Snapshot determinism: Byte-identical re-compile; different hash on source change
  • tests/query-contracts.test.ts — 7 describe blocks (~30 test cases), refactored to call stage modules directly (no QueryEngine import):

    • NormalizernormalizeQuery() from query-normalizer.ts
    • SeederseedCandidates() from candidate-seeder.ts
    • RankerrankCandidates() from candidate-ranker.ts
    • Frontier expansionexpandGrounding() from frontier-expander.ts
    • Projection stabilitybuildPatternAnswer(), confidenceFromTrace(), gapsFromTrace() from answer-projector.ts
    • Synthesis isolationsynthesizeAnswer() from synthesis-adapter.ts
    • Trace determinism — assembled from stages, not QueryEngine

    Each describe block now provides true stage-local fault attribution: a regression in ranking cannot fail a test labeled "Normalizer."

  • Existing 6 end-to-end test files are untouched

Note: Rebased on main after PR #31 merge. Stage modules are now available as direct imports.

Validation

  • bun run lint passes locally
  • bun run check passes locally
  • bun run test passes locally (CI green)
  • No new warnings introduced
  • bun run build succeeds (if runtime/server code touched)
  • bun run docs:build succeeds (if docs touched)
  • Relevant docs updated (README, docs/, inline JSDoc if applicable)

Boundary check

  • Runtime source set is FPF-spec.md only — no additional corpora added
  • No vector database or remote indexing introduced
  • No Python code added
  • MCP tool contracts stay in src/mcp/tool-contracts.ts

Agent metadata

Field Value
Agent Devin
Session https://app.devin.ai/sessions/8fa4d238c77d45b2bfa931f5547007bc
Prompt Merger: address venikman's review on #32, refactor query tests to call stage modules directly

Requested by: @venikman

Copilot AI review requested due to automatic review settings April 13, 2026 01:44
@devin-ai-integration
Copy link
Copy Markdown
Contributor

Original prompt from Stanislau

how our issues doing?

@devin-ai-integration
Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

Adds two stage-scoped contract test suites: one for the compiler pipeline validating parser, graph, index, validation, and snapshot invariants; and one for the query pipeline validating normalizer, seeder, ranker, frontier, projection, and synthesis isolation behaviors.

Changes

Cohort / File(s) Summary
Compiler Contract Tests
tests/compiler-contracts.test.ts
Adds stage-local compiler contracts (+282 lines): caches compiler output, computes SHA‑256 sourceHash, validates parser resilience (counts, non-empty IDs, preserved metadata), graph closure (unresolved refs, duplicate IDs, relation edges, route integrity), index round-trip (aliases, lexicon, status, route name resolution), validation coverage, and snapshot determinism (byte-identical snapshot for same input, differing hash on source change).
Query Pipeline Contract Tests
tests/query-contracts.test.ts
Adds stage-local query contracts (+356 lines): compiles a cached Snapshot, constructs QueryEngine helpers, and validates normalizer signals, candidate seeder categories (exact-id, lexical, route expansion), candidate ranking and selection, frontier expansion bounds and hop metadata, projection determinism and verbose/proof behaviors, and synthesis isolation/fallback behaviors.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐇 With twitching nose I run my test,

I cache the build and do my best.
Parser, graph, and index all in sight,
Queries seeded, ranked, and bounded tight.
A carrot for each passing bite — hooray, good night!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding stage-local contract tests for compiler and query pipelines.
Linked Issues check ✅ Passed The PR fully satisfies all acceptance criteria from issue #6: stage-local tests for all compiler stages (parser, graph, index, validation, snapshot) and query stages (normalizer, seeder, ranker, frontier, projector, synthesis), with existing end-to-end tests untouched.
Out of Scope Changes check ✅ Passed All changes are directly in-scope: two new test files (compiler-contracts.test.ts and query-contracts.test.ts) providing the stage-local contract coverage specified in issue #6; no unrelated modifications.
Description check ✅ Passed The pull request description is comprehensive and well-structured, covering all major sections of the template with clear details about what is being added, why, and how it validates against defined checks.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch devin/1776043509-stage-contract-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces stage-local contract tests for the compiler and query pipelines in tests/compiler-contracts.test.ts and tests/query-contracts.test.ts. The reviewer provided feedback focused on improving test robustness and determinism, specifically recommending the use of relative paths instead of process.cwd() and fixed timestamps for snapshots to ensure reproducible results. Additionally, the feedback suggests replacing magic numbers with constants and adding explicit assertions to conditional tests to prevent silent passes and ensure meaningful coverage.

if (cachedOutput) {
return cachedOutput;
}
const sourcePath = resolve(process.cwd(), 'FPF-spec.md');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using process.cwd() to resolve the source file path is fragile as it depends on the directory from which the test runner is executed. This can lead to failures if tests are run from a different directory level (e.g., in a monorepo or a sub-package). It is safer to resolve paths relative to the test file's location.

Comment thread tests/compiler-contracts.test.ts Outdated
cachedOutput = compileFpfSource({
sourcePath,
sourceHash,
builtAt: new Date().toISOString(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using new Date().toISOString() for builtAt introduces non-determinism into the cached compiler output. For contract tests, especially those verifying snapshot consistency and determinism, using a fixed timestamp is preferred to ensure reproducible results across different runs.

Suggested change
builtAt: new Date().toISOString(),
builtAt: '2025-01-01T00:00:00.000Z',

Comment thread tests/compiler-contracts.test.ts Outdated
Comment on lines +43 to +46
expect(validation.parsedSections).toBeGreaterThan(100);
expect(validation.parsedPatterns).toBeGreaterThan(50);
expect(validation.parsedRoutes).toBeGreaterThan(0);
expect(validation.parsedLexiconEntries).toBeGreaterThan(5);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These assertions use arbitrary magic numbers as thresholds. While they serve as a baseline for the current FPF-spec.md, they make the tests brittle and tightly coupled to the specific content of the source document. If the document is refactored or reduced in size, these tests may fail even if the compiler logic is correct. Consider defining these thresholds as constants.

if (cachedSnapshot) {
return cachedSnapshot;
}
const sourcePath = resolve(process.cwd(), 'FPF-spec.md');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using process.cwd() to resolve the source file path is fragile. It is recommended to use a path relative to the test file's location to ensure the tests run correctly regardless of the current working directory.

Comment thread tests/query-contracts.test.ts Outdated
const output = compileFpfSource({
sourcePath,
sourceHash,
builtAt: new Date().toISOString(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a dynamic date for builtAt in the test snapshot can lead to inconsistent results. Using a fixed timestamp is better for determinism in contract tests.

Suggested change
builtAt: new Date().toISOString(),
builtAt: '2025-01-01T00:00:00.000Z',

Comment thread tests/query-contracts.test.ts Outdated
Comment on lines +59 to +62
if (firstRoute) {
const trace = engine(snapshot).trace(`Tell me about the ${firstRoute} route`);
expect(trace.detected.routeNames).toContain(firstRoute);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test is conditional on firstRoute being defined. If the parser fails to find any routes, the test will pass silently without actually asserting the behavior. It should explicitly check that routes exist before proceeding.

Suggested change
if (firstRoute) {
const trace = engine(snapshot).trace(`Tell me about the ${firstRoute} route`);
expect(trace.detected.routeNames).toContain(firstRoute);
}
expect(routeNames.length).toBeGreaterThan(0);
const firstRoute = routeNames[0]!;
const trace = engine(snapshot).trace('Tell me about the ' + firstRoute + ' route');
expect(trace.detected.routeNames).toContain(firstRoute);

Comment thread tests/query-contracts.test.ts Outdated
Comment on lines +69 to +73
if (statusKeys.length > 0) {
const firstStatus = statusKeys[0]!;
const trace = engine(snapshot).trace(`Show me ${firstStatus} patterns`);
expect(trace.detected.statusTerms).toContain(firstStatus);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the route detection test, this test passes silently if statusKeys is empty. Asserting that status terms are present in the index ensures that the test coverage is meaningful and that the parser is working as expected.

    expect(statusKeys.length).toBeGreaterThan(0);
    const firstStatus = statusKeys[0]!;
    const trace = engine(snapshot).trace('Show me ' + firstStatus + ' patterns');
    expect(trace.detected.statusTerms).toContain(firstStatus);

Comment thread tests/query-contracts.test.ts Outdated
Comment on lines +191 to +196
if (trace.retrievalHops.length > 0) {
const firstHop = trace.retrievalHops[0]!;
expect(firstHop.iteration).toBe(1);
expect(firstHop.reason.length).toBeGreaterThan(0);
expect(typeof firstHop.sufficientAfter).toBe('boolean');
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertions for hop metadata are wrapped in an if block. If no hops are recorded, the assertions are skipped, and the test passes. This could mask a bug where the engine fails to perform any retrieval hops for a complex query. You should assert that at least one hop occurred.

    expect(trace.retrievalHops.length).toBeGreaterThan(0);
    const firstHop = trace.retrievalHops[0]!;
    expect(firstHop.iteration).toBe(1);
    expect(firstHop.reason.length).toBeGreaterThan(0);
    expect(typeof firstHop.sufficientAfter).toBe('boolean');

Comment thread tests/query-contracts.test.ts Outdated
Comment on lines +203 to +205
if (trace.sufficient) {
expect(trace.selectedAnchorIds.length).toBeGreaterThan(0);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test only verifies anchor selection if the trace is already marked as sufficient. For a known pattern query like 'What is A.1.1?', the engine should always reach sufficiency. Asserting sufficiency directly makes the test more robust.

    expect(trace.sufficient).toBe(true);
    expect(trace.selectedAnchorIds.length).toBeGreaterThan(0);

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds stage-local contract tests for the runtime compiler and query engine so failures localize to a specific pipeline stage rather than surfacing only as end-to-end mismatches.

Changes:

  • Introduces compiler stage contract tests (parser, graph closure, indexes, validation, determinism) against FPF-spec.md.
  • Introduces query stage contract tests (normalization, seeding, ranking, frontier bounds, projection, synthesis isolation, trace determinism) against a compiled snapshot.
  • Keeps existing end-to-end/integration tests unchanged while expanding stage-level evidence coverage.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
tests/compiler-contracts.test.ts Adds stage-local compiler contract tests validating parse/graph/index/validation/determinism invariants.
tests/query-contracts.test.ts Adds stage-local query contract tests validating normalization, candidate seeding/ranking, frontier bounds, projection behavior, synthesis isolation, and trace determinism.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/query-contracts.test.ts Outdated
Comment on lines +56 to +62
const routeNames = Object.values(snapshot.routeGraph.nodes).map((r) => r.name);
const firstRoute = routeNames[0];

if (firstRoute) {
const trace = engine(snapshot).trace(`Tell me about the ${firstRoute} route`);
expect(trace.detected.routeNames).toContain(firstRoute);
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test can silently pass if routeGraph.nodes is empty (or contains routes without names) because it guards on if (firstRoute) and otherwise performs no assertions. Since the contract is specifically about route-name detection, assert a non-empty route set (e.g., expect(routeNames.length).toBeGreaterThan(0)) and then run the detection expectations unconditionally so regressions can’t be skipped.

Suggested change
const routeNames = Object.values(snapshot.routeGraph.nodes).map((r) => r.name);
const firstRoute = routeNames[0];
if (firstRoute) {
const trace = engine(snapshot).trace(`Tell me about the ${firstRoute} route`);
expect(trace.detected.routeNames).toContain(firstRoute);
}
const routeNames = Object.values(snapshot.routeGraph.nodes)
.map((r) => r.name)
.filter((name): name is string => typeof name === 'string' && name.length > 0);
expect(routeNames.length).toBeGreaterThan(0);
const firstRoute = routeNames[0]!;
const trace = engine(snapshot).trace(`Tell me about the ${firstRoute} route`);
expect(trace.detected.routeNames).toContain(firstRoute);

Copilot uses AI. Check for mistakes.
Comment thread tests/query-contracts.test.ts Outdated
Comment on lines +69 to +73
if (statusKeys.length > 0) {
const firstStatus = statusKeys[0]!;
const trace = engine(snapshot).trace(`Show me ${firstStatus} patterns`);
expect(trace.detected.statusTerms).toContain(firstStatus);
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the route-name test: this can become a vacuous pass if the status index ever becomes empty because the assertions are skipped under if (statusKeys.length > 0). Consider asserting statusKeys.length > 0 (or using a known status term like stable) so the test reliably fails when status-term detection breaks.

Suggested change
if (statusKeys.length > 0) {
const firstStatus = statusKeys[0]!;
const trace = engine(snapshot).trace(`Show me ${firstStatus} patterns`);
expect(trace.detected.statusTerms).toContain(firstStatus);
}
expect(statusKeys.length).toBeGreaterThan(0);
const firstStatus = statusKeys[0]!;
const trace = engine(snapshot).trace(`Show me ${firstStatus} patterns`);
expect(trace.detected.statusTerms).toContain(firstStatus);

Copilot uses AI. Check for mistakes.
Comment thread tests/query-contracts.test.ts Outdated
Comment on lines +113 to +114
const routeCandidates = trace.candidateScores.filter((c) => c.kind === 'route');
expect(routeCandidates.length).toBeGreaterThan(0);
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intended to validate seeding route expansion candidates, but it inspects trace.candidateScores, which is a ranked + truncated view (slice(0, 16)). A route can be correctly seeded yet not appear in the top 16, causing a false failure. Prefer asserting against trace.frontierCandidates (e.g., presence of origin === 'route_expansion') which represents the seed set without truncation.

Suggested change
const routeCandidates = trace.candidateScores.filter((c) => c.kind === 'route');
expect(routeCandidates.length).toBeGreaterThan(0);
const routeFrontier = trace.frontierCandidates.filter(
(c) => c.origin === 'route_expansion',
);
expect(routeFrontier.length).toBeGreaterThan(0);

Copilot uses AI. Check for mistakes.
Comment thread tests/query-contracts.test.ts Outdated
// threshold (100) and total count stays low relative to the full catalog.
const highScoring = trace.candidateScores.filter((c) => c.score >= 100);
expect(highScoring.length).toBe(0);
expect(trace.candidateScores.length).toBeLessThan(
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trace.candidateScores is always capped to the top 16 candidates (see QueryEngine.trace()), so the assertion about “total count stays low relative to the full catalog” is not actually testing what it claims—it will almost always be true regardless of how many candidates were generated. To make this contract meaningful, assert against an un-truncated signal (e.g., trace.frontierCandidates.length) or extend TraceResult to include the full candidate count.

Suggested change
expect(trace.candidateScores.length).toBeLessThan(
expect(trace.frontierCandidates.length).toBeLessThan(

Copilot uses AI. Check for mistakes.
Comment thread tests/compiler-contracts.test.ts Outdated
Comment on lines +193 to +204
it('status index keys partition compiled nodes without overlap', async () => {
const { snapshot } = await getCompilerOutput();
const statusIndex = snapshot.indexes.statusIndex;

expect(Object.keys(statusIndex).length).toBeGreaterThan(0);

for (const [_status, nodeIds] of Object.entries(statusIndex)) {
for (const nodeId of nodeIds) {
expect(snapshot.compiledNodes[nodeId]).toBeDefined();
}
}
});
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name/comment claims the status index “partitions compiled nodes without overlap”, but the assertions only check that referenced node IDs exist. Either tighten the assertions to actually verify the partition/overlap property (likely for pattern nodes only), or rename the test to reflect what it currently validates (e.g., that status index entries resolve to compiled nodes).

Copilot uses AI. Check for mistakes.
devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c0595bfd6e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/query-contracts.test.ts Outdated
Comment on lines +69 to +72
if (statusKeys.length > 0) {
const firstStatus = statusKeys[0]!;
const trace = engine(snapshot).trace(`Show me ${firstStatus} patterns`);
expect(trace.detected.statusTerms).toContain(firstStatus);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use a supported status token in normalizer contract test

This test picks statusKeys[0] from snapshot.indexes.statusIndex and expects it in trace.detected.statusTerms, but QueryEngine.trace() only emits the fixed tokens draft, stable, stub, and transitional. The status index currently also contains values like new, full text, and transitional stub, so a harmless change in insertion order or source ordering can make this test fail even when normalizer behavior is unchanged, creating a flaky CI gate.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point — the normalizer only detects the fixed tokens draft, stable, stub, transitional via substring matching, so picking an arbitrary statusKeys[0] could yield a key like new or full text that the normalizer never emits.

Fixed in 3239888: the test now finds a known token from the hardcoded list that exists in the status index, ensuring the test validates actual normalizer behavior.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/compiler-contracts.test.ts`:
- Around line 193-203: The test currently only checks that each nodeId in
statusIndex exists in snapshot.compiledNodes but doesn't ensure nodeIds are
unique across status keys; update the test (referencing statusIndex, snapshot,
compiledNodes) to assert there is no overlap by tracking seen node IDs (e.g., a
Set) as you iterate all Object.entries(statusIndex) and for each nodeId assert
it's not already present before adding it, and optionally assert that the total
number of unique node IDs equals the sum of all nodeIds lengths to catch
duplicates.
- Around line 220-236: The tests currently only run against live production data
(getCompilerOutput) so they won't detect regressions that stop reporting issues;
add a deterministic malformed fixture and a unit test that feeds it through the
same validation path and asserts specific findings. Create a new test case
(e.g., in the same suite 'Compiler / Validation stage') that builds a small
synthetic snapshot with intentionally missing required fields and an unresolved
reference, call the same validation logic used by getCompilerOutput (reuse the
functions that populate snapshot.validation or invoke the validator directly),
then assert exact expected counts on validation.missingRequiredFields and
validation.unresolvedReferences (or indexMapNodes if applicable) to ensure the
validator flags the known-bad inputs. Ensure the test references
getCompilerOutput/validator entrypoint and the validation fields
missingRequiredFields and unresolvedReferences/indexMapNodes so future
regressions fail deterministically.
- Around line 38-82: The tests under "Compiler / Parser stage" currently only
validate the canonical snapshot returned by getCompilerOutput(); add additional
test cases that recompile semantic-equivalent variants (whitespace-only,
reordered sections, and prose-only edits) and assert the same invariants
(parsedSections/parsedPatterns/parsedRoutes/parsedLexiconEntries counts,
compiledNodes IDs, pattern metadata for 'A.1.1', and valid anchors) hold for
each variant; implement helper functions to produce or load variant inputs, call
the same getCompilerOutput or a variant-taking compile function, and run the
existing assertions against each resulting snapshot to ensure parser resilience.

In `@tests/query-contracts.test.ts`:
- Around line 268-316: The tests for fallback synthesis check status, ids,
citations, and relations but don’t assert that the fallback answer and
constraints remain identical to the deterministic baseline; update the three
tests using engine(...) and LocalAnswerSynthesizer (the "returns deterministic
answer when synthesizer is unavailable", "falls back to deterministic answer
when synthesizer throws", and "does not alter deterministic IDs or citations
when synthesis fails") to also capture the deterministic baseline by calling
eng.query(...) (or engine(snapshot) once) and then assert that result.answer and
result.constraints equal the baseline's answer and constraints (for both the
unavailable and throwing synthesizer cases) so any regression that changes
answer content or constraints will fail.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bf33b511-7b48-4d5a-a9f4-7322e0c100c8

📥 Commits

Reviewing files that changed from the base of the PR and between aa03982 and c0595bf.

📒 Files selected for processing (2)
  • tests/compiler-contracts.test.ts
  • tests/query-contracts.test.ts

Comment on lines +38 to +82
describe('Compiler / Parser stage', () => {
it('parses a non-trivial number of sections, patterns, routes, and lexicon entries', async () => {
const { snapshot } = await getCompilerOutput();
const { validation } = snapshot;

expect(validation.parsedSections).toBeGreaterThan(100);
expect(validation.parsedPatterns).toBeGreaterThan(50);
expect(validation.parsedRoutes).toBeGreaterThan(0);
expect(validation.parsedLexiconEntries).toBeGreaterThan(5);
});

it('assigns IDs to all compiled nodes and none are empty strings', async () => {
const { snapshot } = await getCompilerOutput();
const nodeIds = Object.keys(snapshot.compiledNodes);

expect(nodeIds.length).toBeGreaterThan(50);
for (const nodeId of nodeIds) {
expect(nodeId.length).toBeGreaterThan(0);
}
});

it('preserves pattern metadata fields (title, status, part)', async () => {
const { snapshot } = await getCompilerOutput();
const pattern = snapshot.patternGraph.nodes['A.1.1'];

expect(pattern).toBeDefined();
expect(pattern!.title.length).toBeGreaterThan(0);
expect(pattern!.status.length).toBeGreaterThan(0);
expect(pattern!.sectionIds.length).toBeGreaterThan(0);
});

it('produces anchors with valid line ranges', async () => {
const { snapshot } = await getCompilerOutput();
const anchors = Object.values(snapshot.anchorMap);

expect(anchors.length).toBeGreaterThan(50);
for (const anchor of anchors.slice(0, 20)) {
expect(anchor.lineStart).toBeGreaterThanOrEqual(0);
expect(anchor.lineEnd).toBeGreaterThan(anchor.lineStart);
}

const nonEmpty = anchors.filter((a) => a.text.length > 0);
expect(nonEmpty.length).toBeGreaterThan(anchors.length / 2);
});
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Parser resilience is not actually under test here.

These assertions only inspect the canonical FPF-spec.md snapshot. They never recompile whitespace-only, reordered, or prose-only variants, so a regression on semantically equivalent source edits would still pass this whole block.

Suggested direction
+  it('preserves semantic IR under whitespace-only rewrites', async () => {
+    const sourcePath = resolve(process.cwd(), 'FPF-spec.md');
+    const sourceText = await readFile(sourcePath, 'utf8');
+    const rewritten = sourceText
+      .replace(/[ \t]+$/gm, '')
+      .replace(/\n{3,}/g, '\n\n');
+    const builtAt = '2025-01-01T00:00:00.000Z';
+
+    const base = compileFpfSource({
+      sourcePath,
+      sourceHash: createHash('sha256').update(sourceText).digest('hex'),
+      builtAt,
+      sourceText,
+    });
+    const variant = compileFpfSource({
+      sourcePath,
+      sourceHash: createHash('sha256').update(rewritten).digest('hex'),
+      builtAt,
+      sourceText: rewritten,
+    });
+
+    expect(Object.keys(variant.snapshot.patternGraph.nodes)).toEqual(
+      Object.keys(base.snapshot.patternGraph.nodes),
+    );
+    expect(Object.keys(variant.snapshot.routeGraph.nodes)).toEqual(
+      Object.keys(base.snapshot.routeGraph.nodes),
+    );
+  });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/compiler-contracts.test.ts` around lines 38 - 82, The tests under
"Compiler / Parser stage" currently only validate the canonical snapshot
returned by getCompilerOutput(); add additional test cases that recompile
semantic-equivalent variants (whitespace-only, reordered sections, and
prose-only edits) and assert the same invariants
(parsedSections/parsedPatterns/parsedRoutes/parsedLexiconEntries counts,
compiledNodes IDs, pattern metadata for 'A.1.1', and valid anchors) hold for
each variant; implement helper functions to produce or load variant inputs, call
the same getCompilerOutput or a variant-taking compile function, and run the
existing assertions against each resulting snapshot to ensure parser resilience.

Comment thread tests/compiler-contracts.test.ts Outdated
Comment on lines +193 to +203
it('status index keys partition compiled nodes without overlap', async () => {
const { snapshot } = await getCompilerOutput();
const statusIndex = snapshot.indexes.statusIndex;

expect(Object.keys(statusIndex).length).toBeGreaterThan(0);

for (const [_status, nodeIds] of Object.entries(statusIndex)) {
for (const nodeId of nodeIds) {
expect(snapshot.compiledNodes[nodeId]).toBeDefined();
}
}
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This does not verify the “without overlap” part of the contract.

The loop only checks that every indexed ID exists in compiledNodes. If the same node is emitted under multiple status keys, this test still passes.

Suggested fix
   expect(Object.keys(statusIndex).length).toBeGreaterThan(0);

-  for (const [_status, nodeIds] of Object.entries(statusIndex)) {
+  const seen = new Map<string, string>();
+  for (const [status, nodeIds] of Object.entries(statusIndex)) {
     for (const nodeId of nodeIds) {
       expect(snapshot.compiledNodes[nodeId]).toBeDefined();
+      expect(seen.has(nodeId)).toBe(false);
+      seen.set(nodeId, status);
     }
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
it('status index keys partition compiled nodes without overlap', async () => {
const { snapshot } = await getCompilerOutput();
const statusIndex = snapshot.indexes.statusIndex;
expect(Object.keys(statusIndex).length).toBeGreaterThan(0);
for (const [_status, nodeIds] of Object.entries(statusIndex)) {
for (const nodeId of nodeIds) {
expect(snapshot.compiledNodes[nodeId]).toBeDefined();
}
}
it('status index keys partition compiled nodes without overlap', async () => {
const { snapshot } = await getCompilerOutput();
const statusIndex = snapshot.indexes.statusIndex;
expect(Object.keys(statusIndex).length).toBeGreaterThan(0);
const seen = new Map<string, string>();
for (const [status, nodeIds] of Object.entries(statusIndex)) {
for (const nodeId of nodeIds) {
expect(snapshot.compiledNodes[nodeId]).toBeDefined();
expect(seen.has(nodeId)).toBe(false);
seen.set(nodeId, status);
}
}
});
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/compiler-contracts.test.ts` around lines 193 - 203, The test currently
only checks that each nodeId in statusIndex exists in snapshot.compiledNodes but
doesn't ensure nodeIds are unique across status keys; update the test
(referencing statusIndex, snapshot, compiledNodes) to assert there is no overlap
by tracking seen node IDs (e.g., a Set) as you iterate all
Object.entries(statusIndex) and for each nodeId assert it's not already present
before adding it, and optionally assert that the total number of unique node IDs
equals the sum of all nodeIds lengths to catch duplicates.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test was renamed from "status index keys partition compiled nodes without overlap" to "status index entries resolve to existing compiled nodes" in 3239888, so the name now accurately describes what the assertions check (existence only).

The overlap assertion is intentionally omitted because the FPF status index legitimately assigns nodes to multiple statuses (e.g., a pattern can be both "stub" and "transitional"). Adding a uniqueness constraint would assert a property the data model doesn't guarantee.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipped: comment is from another GitHub bot.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test was renamed from "status index keys partition compiled nodes without overlap" to "status index entries resolve to existing compiled nodes" in 3239888, so the name now accurately describes what the assertions check (existence only).

The overlap assertion is intentionally omitted because the FPF status index legitimately assigns nodes to multiple statuses (e.g., a pattern can be both "stub" and "transitional"). Adding a uniqueness constraint would assert a property the data model doesn't guarantee.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipped: comment is from another GitHub bot.

Comment on lines +220 to +236
describe('Compiler / Validation stage', () => {
it('keeps missing required fields bounded', async () => {
const { snapshot } = await getCompilerOutput();
const { validation } = snapshot;

// The FPF spec has a small number of patterns with incomplete metadata.
// The contract is that this stays bounded — a regression would spike it.
expect(validation.missingRequiredFields).toBeLessThan(25);
});

it('counts a plausible number of index map nodes', async () => {
const { snapshot } = await getCompilerOutput();
const { validation } = snapshot;

expect(validation.indexMapNodes).toBeGreaterThan(50);
expect(validation.indexMapNodes).toBe(Object.keys(snapshot.indexMap).length);
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The validator is never exercised with known-bad input.

Both checks run against the current production spec, so a regression that stops reporting missing fields or unresolved references can still go green here. This stage needs at least one deliberately malformed fixture and explicit findings assertions.

Suggested direction
+  it('reports specific findings for malformed input', () => {
+    const sourceText = `
+## A.1.1
+Status: draft
+
+References: Missing.Id
+`;
+    const output = compileFpfSource({
+      sourcePath: 'inline',
+      sourceHash: createHash('sha256').update(sourceText).digest('hex'),
+      builtAt: '2025-01-01T00:00:00.000Z',
+      sourceText,
+    });
+
+    expect(output.snapshot.validation.missingRequiredFields).toBeGreaterThan(0);
+    expect(output.snapshot.validation.unresolvedReferences.length).toBeGreaterThan(0);
+  });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/compiler-contracts.test.ts` around lines 220 - 236, The tests currently
only run against live production data (getCompilerOutput) so they won't detect
regressions that stop reporting issues; add a deterministic malformed fixture
and a unit test that feeds it through the same validation path and asserts
specific findings. Create a new test case (e.g., in the same suite 'Compiler /
Validation stage') that builds a small synthetic snapshot with intentionally
missing required fields and an unresolved reference, call the same validation
logic used by getCompilerOutput (reuse the functions that populate
snapshot.validation or invoke the validator directly), then assert exact
expected counts on validation.missingRequiredFields and
validation.unresolvedReferences (or indexMapNodes if applicable) to ensure the
validator flags the known-bad inputs. Ensure the test references
getCompilerOutput/validator entrypoint and the validation fields
missingRequiredFields and unresolvedReferences/indexMapNodes so future
regressions fail deterministically.

Comment thread tests/query-contracts.test.ts
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 32398887bb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tests/compiler-contracts.test.ts Outdated
sourceText: `${sourceText}\n<!-- change -->\n`,
});

expect(first.snapshot.sourceHash).not.toBe(second.snapshot.sourceHash);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Assert source hash changes through runtime path

This check is tautological for the compiler contract: compileFpfSource stores the caller-provided sourceHash in the snapshot, so passing two already-different hashes guarantees this assertion passes even if source-text handling regresses. As written, the test validates the test’s own hash precomputation rather than any compiler behavior, so a real compiler regression could slip through with green CI.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you're right, this assertion is tautological since compileFpfSource just stores the caller-provided sourceHash verbatim. The test validates the test's own hash precomputation, not any compiler behavior.

I'll fix this by verifying that the compiler actually produces structurally different output when the source text changes (e.g., different compiled node count or different searchable text), rather than just checking the pass-through hash.

@devin-ai-integration devin-ai-integration Bot force-pushed the devin/1776043509-stage-contract-tests branch 3 times, most recently from ccd523a to 2fe8114 Compare April 14, 2026 03:25
@venikman
Copy link
Copy Markdown
Owner Author

Code Review

Overview

Adds two new test files — tests/compiler-contracts.test.ts (~295 lines) and tests/query-contracts.test.ts (~356 lines) — that assert stage-local invariants for each compiler stage (parser, graph closure, index round-trip, validation, determinism) and each query stage (normalizer, seeder, ranker, frontier, projection, synthesis isolation).

Strengths

  • Stage-local failure mode — this is the right framing. When a test fails, reviewer can tell immediately which stage broke instead of chasing a "answer is wrong" symptom across the whole pipeline.
  • Bounded assertionsMIN_SECTIONS = 100, MIN_PATTERNS = 50, duplicateIds < patternCount + 10 — loose enough to survive spec edits, tight enough to catch regressions.
  • Snapshot determinism testcompileFpfSource called twice yields byte-identical JSON. This is a real invariant of the compiler and a common thing to accidentally break.
  • Synthesis isolation — verifies that when synthesizer is unavailable/throws, deterministic IDs/citations are unchanged. Directly addresses the failure mode I flagged in the audit as risky.
  • Cached CompilerOutput across tests in the file — keeps the suite fast.

Issues & nits

Brittle to spec-specific IDs

expect(trace.detected.ids).toContain('A.1.1');
expect(snapshot.patternGraph.nodes['A.1.1']).toBeDefined();

Hardcoded IDs (A.1.1, C.17, B.3, etc.) appear in many tests. If the spec renames or renumbers, the whole suite breaks. Contract tests against a live spec are inherently coupled — acceptable, but worth a comment at the top explaining that A.1.1 is being used as a canonical fixture and where to update if it changes.

xyzzy plugh nonsense probe

const trace = engine(snapshot).trace('xyzzy plugh');
expect(trace.detected.ids).toEqual([]);

If the spec ever happens to mention either word, the test fails. Pick a truly impossible string like __FPFTEST_NONSENSE_999__ to be safe. Low-probability but zero-cost to fix.

Synthetic section test asserts weak invariant

const structurallyDifferent =
  secondSections > firstSections || secondIndexNodes > firstIndexNodes;
expect(structurallyDifferent).toBe(true);

Tests only that something grew. Doesn't verify the new section was actually parsed as Z.99. Stronger assertion: expect(second.snapshot.patternGraph.nodes['Z.99']).toBeDefined() or equivalent — proves the parser handled the new heading shape correctly.

Cached output across tests shares mutable state risk

cachedOutput is module-scoped. If any test mutates the returned snapshot, subsequent tests see the mutation. Today all consumers are read-only so this is fine; worth freezing the object (Object.freeze-ish) to make the invariant explicit.

Coverage gaps I'd add in a follow-up

Coupling with PR #31 split

This PR stands alone on current main, but the query-stage names (normalizer, seeder, ranker, frontier, projection, synthesis) map directly to the new files in #31. If #31 lands first, great — tests validate the split. If #32 lands first, the stage names are aspirational until #31 realigns. Would recommend landing #31 before or with this.

Recommendation

Approve. This is exactly the right test posture for a staged pipeline and directly supports the split refactor in #31. Small nits only — fix the nonsense-query token, strengthen the Z.99 assertion, and track the coverage-gap follow-ups.

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@venikman All three nits fixed in 5972811:

  1. Nonsense probe'xyzzy plugh''__FPFTEST_NONSENSE_999__' across all three usages. Zero chance of spec collision.

  2. Z.99 assertion strengthened — Now asserts secondSections > firstSections directly (not the weaker "something grew" disjunction). Added comment explaining why Z.99 appears as a parsed section, not a patternGraph node — the compiler only promotes headings that match spec-catalog entries.

  3. Spec-ID fixture comments — Added canonical fixture ID documentation to both test file headers explaining that A.1.1, A.15, B.3 are stable spec anchors and where to update if renamed.

Re: coverage gaps (heuristic seed rules, session cache persistence, IndexingView classification) — agreed, these belong in follow-up PRs. The current suite covers the six query stages and five compiler stages.

Re: coupling with #31 — agreed on landing together. Stage names map 1:1 to #31's new files.

Copy link
Copy Markdown
Owner Author

@venikman venikman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One substantive issue before merge:

tests/query-contracts.test.ts is described as stage-local coverage, but the file still routes every assertion through new QueryEngine(...).trace() or .query(). That means a regression in ranking, frontier expansion, or projection can still fail the blocks labeled Normalizer or Seeder, so the PR does not yet provide true stage-local fault attribution for the query pipeline.

If the goal for #6 is real per-stage isolation, the query-side tests need a smaller seam than QueryEngine itself, for example by exposing and testing the individual query stages directly or by extracting stage helpers behind explicit contracts.

I validated the branch locally in an isolated snapshot and did not find a runtime regression beyond this review point.

@devin-ai-integration
Copy link
Copy Markdown
Contributor

Devin is archived and cannot be woken up. Please unarchive Devin if you want to continue using it.

devin-ai-integration Bot and others added 4 commits April 14, 2026 05:40
Closes #6

- compiler-contracts.test.ts: 5 describe blocks covering Parser resilience,
  Graph closure, Index round-trip, Validation coverage, and Snapshot determinism
- query-contracts.test.ts: 7 describe blocks covering Normalizer, Seeder,
  Ranker, Frontier expansion, Projection stability, Synthesis isolation,
  and cross-cutting Trace determinism

Each test targets a specific pipeline stage promise so that a failure pinpoints
the broken stage rather than surfacing as a generic 'end-to-end answer is wrong.'
Existing end-to-end tests remain untouched.

Co-Authored-By: Stanislau <nedbailov375426@gmail.com>
- Add total>0 guard before division in route resolution test (Devin Review)
- Use hardcoded normalizer tokens for status term test (Codex)
- Use frontierCandidates for route seeder and nonsense tests (Copilot)
- Rename status index test to match actual assertions (CodeRabbit)
- Fix hop metadata test to handle already-sufficient grounding
- Use real FPF IDs (A.1.1, A.15, B.3) for hop metadata query

Co-Authored-By: Stanislau <nedbailov375426@gmail.com>
Co-Authored-By: Stanislau <nedbailov375426@gmail.com>
…ec-ID fixture comments

- Replace 'xyzzy plugh' with '__FPFTEST_NONSENSE_999__' to avoid false
  failures if spec ever mentions those words
- Strengthen Z.99 structural assertion: verify parsedSections grows (not
  just 'something grew'), with comment explaining why Z.99 appears as a
  section not a pattern node
- Add canonical fixture ID comments to both test files explaining that
  A.1.1/A.15/B.3 are stable spec anchors and where to update if renamed

Co-Authored-By: Stanislau <nedbailov375426@gmail.com>
devin-ai-integration Bot and others added 2 commits April 14, 2026 05:40
Co-Authored-By: Stanislau <nedbailov375426@gmail.com>
Address venikman's review: replace QueryEngine.trace()/query() with direct
imports of normalizeQuery, seedCandidates, rankCandidates, expandGrounding,
buildPatternAnswer, synthesizeAnswer, confidenceFromTrace, and gapsFromTrace.

Each describe block now targets its stage function in isolation so a
regression in one stage cannot masquerade as a failure in another.

Co-Authored-By: Stanislau <nedbailov375426@gmail.com>
@devin-ai-integration devin-ai-integration Bot force-pushed the devin/1776043509-stage-contract-tests branch from 195ced1 to 13f7af7 Compare April 14, 2026 05:43
@devin-ai-integration
Copy link
Copy Markdown
Contributor

Addressed in 13f7af7 — refactored the entire query contract test file to call stage modules directly instead of routing through QueryEngine.trace()/.query():

Before: Every test imported QueryEngine and called .trace() or .query() — a regression in ranking could fail a test labeled "Normalizer."

After: Each describe block imports and calls its stage function directly:

  • NormalizernormalizeQuery() from query-normalizer.ts
  • SeederseedCandidates() from candidate-seeder.ts
  • RankerrankCandidates() from candidate-ranker.ts
  • FrontierexpandGrounding() from frontier-expander.ts
  • ProjectionbuildPatternAnswer(), confidenceFromTrace(), gapsFromTrace() from answer-projector.ts
  • SynthesissynthesizeAnswer() from synthesis-adapter.ts

The assembleTrace() helper (used only by projection/synthesis tests that need a full TraceResult) chains the stages explicitly without going through QueryEngine, so the seam is transparent and each stage's output is independently inspectable.

QueryEngine is no longer imported in the test file at all.

@venikman venikman merged commit 81f55e7 into main Apr 14, 2026
1 check passed
@venikman venikman deleted the devin/1776043509-stage-contract-tests branch May 9, 2026 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add stage-local contract tests for compiler and query pipelines

2 participants