Parse plain-text catalog dependency cells into typed relation edges#23
Parse plain-text catalog dependency cells into typed relation edges#23venikman wants to merge 1 commit into
Conversation
The parseLabeledRelations function only matched **bold:** markdown format, but catalog dependency cells use plain text like "Builds on: A.1, A.2. Constrains: B.3." This caused 148 of 237 patterns to lose typed semantic relations (builds_on, constrains, coordinates_with, etc.) from catalog metadata. Add a plain-text fallback parser that activates when the bold regex finds nothing. Uses lookahead on known RELATION_LABELS keys to split multi-label cells correctly. Extract pushRelationEdges helper to deduplicate the edge-creation logic. Impact: builds_on 353→1132, coordinates_with 167→583, prerequisite_for 37→68. Total +1396 typed relation edges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughModified the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the parseLabeledRelations function to support both bold markdown and plain-text catalog formats for relations, introducing a pushRelationEdges helper to handle the edge creation logic. The reviewer suggested a performance optimization to move the dynamic regex construction for plain-text labels to a module-level constant or to memoize it, as the underlying labels are static and re-calculating the regex on every function call adds unnecessary overhead.
| const escapedLabels = Object.keys(RELATION_LABELS).map( | ||
| (label) => label.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'), | ||
| ); | ||
| const labelAlternation = escapedLabels.join('|'); | ||
| const plainRegex = new RegExp( | ||
| `(${labelAlternation}):\\s*(.*?)(?=(?:${labelAlternation}):|$)`, | ||
| 'gis', | ||
| ); |
There was a problem hiding this comment.
The plainRegex and its associated strings (escapedLabels, labelAlternation) are re-constructed on every call to parseLabeledRelations when bold relations are absent. Since RELATION_LABELS is a static constant, this regex should be defined as a module-level constant or memoized to avoid unnecessary overhead during the compilation process, especially as this function is called for every pattern in the catalog.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/runtime/compiler.ts`:
- Around line 1204-1216: The current logic skips plain-text label parsing when
relationEdges already contains matches (the if (relationEdges.length === 0)
branch), causing mixed-format cells to miss edges; remove the conditional so the
plainRegex pass always runs after the bold/initial parse: keep the escapedLabels
/ labelAlternation / plainRegex construction and the loop that calls
pushRelationEdges(relationEdges, sourceId, match[1] ?? '', match[2] ?? '',
sourceCitation), but execute it unconditionally (or after the bold-match pass
without gating on relationEdges.length) so both bold and plain labels in text
are parsed and appended to relationEdges.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| if (relationEdges.length === 0) { | ||
| const escapedLabels = Object.keys(RELATION_LABELS).map( | ||
| (label) => label.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'), | ||
| ); | ||
| const labelAlternation = escapedLabels.join('|'); | ||
| const plainRegex = new RegExp( | ||
| `(${labelAlternation}):\\s*(.*?)(?=(?:${labelAlternation}):|$)`, | ||
| 'gis', | ||
| ); | ||
| for (const match of text.matchAll(plainRegex)) { | ||
| pushRelationEdges(relationEdges, sourceId, match[1] ?? '', match[2] ?? '', sourceCitation); | ||
| } | ||
| } |
There was a problem hiding this comment.
Parse plain-text labels even when bold labels already matched.
Line 1204 makes plain-text parsing conditional on zero bold matches, which drops edges for mixed-format cells (e.g., **Builds on:** A.1 Used by: B.2 loses used_by).
💡 Suggested fix
- if (relationEdges.length === 0) {
- const escapedLabels = Object.keys(RELATION_LABELS).map(
- (label) => label.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'),
- );
- const labelAlternation = escapedLabels.join('|');
- const plainRegex = new RegExp(
- `(${labelAlternation}):\\s*(.*?)(?=(?:${labelAlternation}):|$)`,
- 'gis',
- );
- for (const match of text.matchAll(plainRegex)) {
- pushRelationEdges(relationEdges, sourceId, match[1] ?? '', match[2] ?? '', sourceCitation);
- }
- }
+ const escapedLabels = Object.keys(RELATION_LABELS).map(
+ (label) => label.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'),
+ );
+ const labelAlternation = escapedLabels.join('|');
+ const plainRegex = new RegExp(
+ `(${labelAlternation}):\\s*(.*?)(?=(?:${labelAlternation}):|$)`,
+ 'gis',
+ );
+ for (const match of text.matchAll(plainRegex)) {
+ pushRelationEdges(relationEdges, sourceId, match[1] ?? '', match[2] ?? '', sourceCitation);
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (relationEdges.length === 0) { | |
| const escapedLabels = Object.keys(RELATION_LABELS).map( | |
| (label) => label.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'), | |
| ); | |
| const labelAlternation = escapedLabels.join('|'); | |
| const plainRegex = new RegExp( | |
| `(${labelAlternation}):\\s*(.*?)(?=(?:${labelAlternation}):|$)`, | |
| 'gis', | |
| ); | |
| for (const match of text.matchAll(plainRegex)) { | |
| pushRelationEdges(relationEdges, sourceId, match[1] ?? '', match[2] ?? '', sourceCitation); | |
| } | |
| } | |
| const escapedLabels = Object.keys(RELATION_LABELS).map( | |
| (label) => label.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'), | |
| ); | |
| const labelAlternation = escapedLabels.join('|'); | |
| const plainRegex = new RegExp( | |
| `(${labelAlternation}):\\s*(.*?)(?=(?:${labelAlternation}):|$)`, | |
| 'gis', | |
| ); | |
| for (const match of text.matchAll(plainRegex)) { | |
| pushRelationEdges(relationEdges, sourceId, match[1] ?? '', match[2] ?? '', sourceCitation); | |
| } |
🧰 Tools
🪛 ast-grep (0.42.1)
[warning] 1208-1211: Regular expression constructed from variable input detected. This can lead to Regular Expression Denial of Service (ReDoS) attacks if the variable contains malicious patterns. Use libraries like 'recheck' to validate regex safety or use static patterns.
Context: new RegExp(
(${labelAlternation}):\\s*(.*?)(?=(?:${labelAlternation}):|$),
'gis',
)
Note: [CWE-1333] Inefficient Regular Expression Complexity [REFERENCES]
- https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
- https://cwe.mitre.org/data/definitions/1333.html
(regexp-from-variable)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/runtime/compiler.ts` around lines 1204 - 1216, The current logic skips
plain-text label parsing when relationEdges already contains matches (the if
(relationEdges.length === 0) branch), causing mixed-format cells to miss edges;
remove the conditional so the plainRegex pass always runs after the bold/initial
parse: keep the escapedLabels / labelAlternation / plainRegex construction and
the loop that calls pushRelationEdges(relationEdges, sourceId, match[1] ?? '',
match[2] ?? '', sourceCitation), but execute it unconditionally (or after the
bold-match pass without gating on relationEdges.length) so both bold and plain
labels in text are parsed and appended to relationEdges.
There was a problem hiding this comment.
Pull request overview
Adds a plain-text fallback for parsing typed relation edges from catalog dependency cells/sections when relation labels are written as Builds on: A.1 (non-bold), and refactors edge creation into a shared helper to avoid duplication.
Changes:
- Renamed the existing bold-markdown relation regex and routed edge creation through a new
pushRelationEdgeshelper. - Added a plain-text label parser that activates as a fallback when no relation edges were produced by the bold pass.
- Introduced
pushRelationEdgesto centralize label normalization, relation kind lookup, and target ID extraction.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Plain-text catalog format: Builds on: A.1, A.2. Constrains: B.3. | ||
| // Uses lookahead to split on the next known label or end-of-string. | ||
| if (relationEdges.length === 0) { | ||
| const escapedLabels = Object.keys(RELATION_LABELS).map( |
There was a problem hiding this comment.
The fallback condition is relationEdges.length === 0, which means the plain-text parser will run not only when the bold regex finds zero matches, but also when bold matches exist but produce zero edges (e.g., unknown labels or labels with no parseable IDs). If the intended behavior is “only fallback when the bold-markdown regex finds zero matches” (per PR description), track whether boldRegex matched at least once (separately from edge count) and use that to gate the fallback, or update the PR description to reflect the actual behavior.
| if (relationEdges.length === 0) { | ||
| const escapedLabels = Object.keys(RELATION_LABELS).map( | ||
| (label) => label.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'), | ||
| ); | ||
| const labelAlternation = escapedLabels.join('|'); | ||
| const plainRegex = new RegExp( | ||
| `(${labelAlternation}):\\s*(.*?)(?=(?:${labelAlternation}):|$)`, | ||
| 'gis', | ||
| ); |
There was a problem hiding this comment.
escapedLabels, labelAlternation, and the plainRegex are rebuilt on every parseLabeledRelations call that reaches the fallback. Since RELATION_LABELS is constant, consider hoisting a precomputed plainRegex (or precomputed labelAlternation) to module scope to avoid repeated allocations during full-spec compilation.
| // Plain-text catalog format: Builds on: A.1, A.2. Constrains: B.3. | ||
| // Uses lookahead to split on the next known label or end-of-string. | ||
| if (relationEdges.length === 0) { | ||
| const escapedLabels = Object.keys(RELATION_LABELS).map( | ||
| (label) => label.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'), | ||
| ); | ||
| const labelAlternation = escapedLabels.join('|'); | ||
| const plainRegex = new RegExp( | ||
| `(${labelAlternation}):\\s*(.*?)(?=(?:${labelAlternation}):|$)`, | ||
| 'gis', | ||
| ); | ||
| for (const match of text.matchAll(plainRegex)) { | ||
| pushRelationEdges(relationEdges, sourceId, match[1] ?? '', match[2] ?? '', sourceCitation); | ||
| } | ||
| } |
There was a problem hiding this comment.
New plain-text relation parsing is introduced here, but the existing integration tests only assert that the snapshot has some relations (e.g., outline_child / explicit_reference) and don’t validate any typed catalog relations. Consider adding an assertion in the runtime snapshot test to check at least one known builds_on / coordinates_with edge from a section that uses the plain Builds on: ... format, so regressions in this parser are caught.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 737c03c56e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ); | ||
| const labelAlternation = escapedLabels.join('|'); | ||
| const plainRegex = new RegExp( | ||
| `(${labelAlternation}):\\s*(.*?)(?=(?:${labelAlternation}):|$)`, |
There was a problem hiding this comment.
Stop plain-text parser at unknown label boundaries
The plain-text regex only terminates a relation value at the next known relation label or end-of-string, so IDs that belong to other labels get misattributed to the previous relation. For example, FPF-spec.md row A.6.1 has Builds on: ... Instances: USM (A.2.6), UNM (A.19), and this parser currently emits builds_on edges to A.2.6 and A.19. That silently corrupts the relation graph and can skew retrieval/provenance because extractIds runs over text that should have been excluded from the Builds on segment.
Useful? React with 👍 / 👎.
|
Superseded by #24 which includes the full Mastra convention restructure |
Summary
parseLabeledRelationsincompiler.tsfor catalog dependency cells that useBuilds on: A.1format instead of**Builds on:** A.1pushRelationEdgeshelper to deduplicate edge-creation logic between bold and plain-text parsersImpact
builds_oncoordinates_withprerequisite_forused_by218 of 224 patterns with dependency text now have parsed typed relations (was 76 before). The remaining 6 reference non-standard IDs (
P-4,CHR-CAL,Part C (CHR)) thatextractIdscannot parse.Test plan
npx rstest run)refresh_fpf_indexMCP toolbuilds_onandcoordinates_withedges (was onlyexplicit_reference)ask_fpf,query_fpf_spec,inspect_fpf_node,trace_fpf_pathall return correct results against the improved index🤖 Generated with Claude Code
Summary by CodeRabbit