Skip to content

parser: detect bold-as-heading + collapse letter-spacing (fixes filing parse)#12

Merged
hallelx2 merged 4 commits into
mainfrom
fix/parser-bold-headings
May 26, 2026
Merged

parser: detect bold-as-heading + collapse letter-spacing (fixes filing parse)#12
hallelx2 merged 4 commits into
mainfrom
fix/parser-bold-headings

Conversation

@hallelx2
Copy link
Copy Markdown
Owner

@hallelx2 hallelx2 commented May 26, 2026

pkg/parser/pdf.go: detect bold-as-heading + collapse letter-spacing

SEC filings have no PDF outline and use bold at body font size (not larger
fonts) for section headings, so the size-only heading heuristic missed every
real section and collapsed the entire body into one giant block. Wide
letter-tracking on cover/header rows also extracted as "U N I T E D".

Three targeted changes:

  • Per-row bold detection from the glyph font name. Bold rows at >= median
    font size qualify as headings, nested one level below the smallest
    size-derived heading.

  • collapseLetterSpacing(): rejoins letter-tracked text only on rows whose
    pattern is unmistakable (majority single-char tokens), preserving word
    boundaries via runs of 2+ spaces. Normal prose is untouched.

  • looksLikeHeading: raise the word cap from 14 to 25 so verbose filing
    headings ("Item 2. Management's Discussion and Analysis of Financial
    Condition and Results of Operations") are not filtered out.

Validated on a real 10-Q (3M Q2 2023, 92 pages): one 680K-char blob became
174 retrievable sections (Item 1, Consolidated Balance Sheet, PART I, ...);
title "U N I T E D S T A T E S" became "UNITED STATES". All existing parser
tests pass; no regression.

Summary by Sourcery

Improve PDF filing parsing by treating bold body-size text as headings and collapsing artificial letter spacing to recover proper section structure and titles.

New Features:

  • Infer heading levels from bold font usage at or above the median body font size, nested beneath size-based headings.
  • Detect and collapse letter-spaced text patterns to reconstruct normal words in PDF rows.

Enhancements:

  • Relax heading word-length limits so long, verbose section titles in filings are still recognized as headings.

Summary by CodeRabbit

  • New Features
    • Oversized document sections are auto-split into smaller chunks with derived titles.
  • Improvements
    • Better heading detection using bold/typography and expanded heading heuristics.
    • Collapses spaced-out letter runs into normal words during text assembly.
    • Summarization prompts now produce a single retrieval-focused sentence per profile.
  • Reliability
    • Selection requests retry on parse failures, return empty selection on final failure, and aggregate usage/model info.
  • Tests
    • Added tests for chunking behavior and graceful non-JSON selection handling.

Review Change Stack

SEC filings have no PDF outline and use bold at body font size (not larger
fonts) for section headings, so the size-only heading heuristic missed every
real section and collapsed the entire body into one giant block. Wide
letter-tracking on cover/header rows also extracted as "U N I T E D".

Three targeted changes:

- Per-row bold detection from the glyph font name. Bold rows at >= median
  font size qualify as headings, nested one level below the smallest
  size-derived heading.

- collapseLetterSpacing(): rejoins letter-tracked text only on rows whose
  pattern is unmistakable (majority single-char tokens), preserving word
  boundaries via runs of 2+ spaces. Normal prose is untouched.

- looksLikeHeading: raise the word cap from 14 to 25 so verbose filing
  headings ("Item 2. Management's Discussion and Analysis of Financial
  Condition and Results of Operations") are not filtered out.

Validated on a real 10-Q (3M Q2 2023, 92 pages): one 680K-char blob became
174 retrievable sections (Item 1, Consolidated Balance Sheet, PART I, ...);
title "U N I T E D S T A T E S" became "UNITED STATES". All existing parser
tests pass; no regression.
Copilot AI review requested due to automatic review settings May 26, 2026 01:51
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 26, 2026

Reviewer's Guide

Adds bold-weight-aware heading detection and robust letter-spacing collapsing to the PDF parser so SEC filings without outlines are split into sensible sections instead of a single blob, while expanding the heading heuristic to allow longer titles.

Flow diagram for bold-aware heading detection and letter-spacing collapsing in PDF parser

flowchart TD
  subgraph Extraction[Row extraction]
    A[extractPDFRows] --> B[Iterate glyphs in block]
    B --> C[Build text with spacing]
    C --> D[collapseLetterSpacing]
    B --> E[Count boldGlyphs using isBoldFont]
    D --> F[Create pdfRow.text]
    E --> G[Set pdfRow.bold]
  end

  subgraph HeadingDetection[Heading detection in Parse]
    H[Parse] --> I[buildHeadingLevelMap]
    I --> J[Compute boldLevel from levelForSize]
    J --> K[Iterate pdfRow]
    K --> L[Lookup lvl,isHeading from levelForSize]
    L --> M{!isHeading
row.bold
row.fontSize >= median
looksLikeHeading}
    M -->|true| N[Set isHeading=true
lvl=boldLevel]
    M -->|false| O[Keep original lvl,isHeading]
    N --> P{isHeading
looksLikeHeading}
    O --> P
    P -->|true| Q[Adjust level via numberedHeadingDepth]
  end

  F --> K
Loading

File-Level Changes

Change Details Files
Augment heading detection to treat bold rows at body size as headings nested under size-derived headings.
  • Compute a bold heading level one deeper than the deepest size-derived heading level, clamped to level 6.
  • Extend the per-row heading classification to mark rows as headings when they are bold, at least median font size, and pass the heading heuristic.
  • Preserve existing size-based heading detection and numbered sub-heading depth logic.
pkg/parser/pdf.go
Track boldness per row based on glyph font names and derive a row-level bold flag.
  • Accumulate counts of non-whitespace glyphs and those using bold fonts while building row text from PDF glyphs.
  • Introduce a bold field on pdfRow and set it when a strict majority of glyphs in the row are bold-weight.
  • Add isBoldFont helper to recognize bold font faces via common font-name patterns such as 'bold', '-bd', and ',bd'.
pkg/parser/pdf.go
Collapse wide letter-spacing patterns into normal words while preserving word boundaries for non-letterspaced text.
  • Introduce looksLetterSpaced to detect rows dominated by single-character tokens indicative of letter-tracked text.
  • Implement collapseLetterSpacing to rejoin single-character tokens into words, preserving word boundaries via runs of multiple spaces and leaving non-letterspaced rows unchanged.
  • Apply collapseLetterSpacing to each extracted row’s text before trimming and filtering empty rows.
  • Add a multiSpaceRe regexp helper to split on runs of 2+ spaces when collapsing letter-spacing.
pkg/parser/pdf.go
Broaden the heading text heuristic to accept longer headings typical of SEC filings.
  • Increase the maximum allowed word count in looksLikeHeading from 14 to 25.
  • Update the accompanying documentation comment to describe the new limit and justify it with filing-style headings.
pkg/parser/pdf.go

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

📝 Walkthrough

Walkthrough

Adds bold glyph tracking and letter-spaced normalization to PDF parsing, promotes bold rows to headings and splits oversized leaf sections into word-boundary chunks; introduces a selection LLM retry helper that aggregates usage and degrades to empty selection on persistent parse failures; updates ingest summarization prompts.

Changes

Bold Typography Heading Detection & Chunking

Layer / File(s) Summary
Row extraction: bold tracking and collapse letter-spacing
pkg/parser/pdf.go
Initialize per-row bold counters; detect bold fonts while assembling rows; collapse letter-spaced glyph runs into normal words; set pdfRow.bold from bold/total glyph ratios.
Heading detection and bold-level computation
pkg/parser/pdf.go
Compute a document-wide boldLevel from font-derived heading buckets; promote bold, median-or-larger rows that pass heading heuristics to headings using boldLevel; increase allowed heading word count.
Oversized leaf chunking and title derivation
pkg/parser/pdf.go
Post-process parsed sections with chunkOversizedLeaves; split oversized leaf content near word boundaries using thresholds and derive child titles from colon-terminated phrases or leading words.
Chunking unit tests
pkg/parser/chunk_test.go
Adds tests for oversized-leaf splitting, colon-header-derived titles, fallback titles, small-section pass-through, and recursive chunking of nested internals.

Selection LLM Retry and Integration

Layer / File(s) Summary
Selection retry helper and SinglePass integration
pkg/retrieval/single_pass.go
Adds runSelectionWithRetry, imports log, refactors SinglePass.SelectWithCost to use the helper, aggregates usage across attempts, sets ModelUsed from budget, and returns empty selection after parse exhaustion while logging.
Caller update and test for non-JSON responses
pkg/retrieval/chunked_tree.go, pkg/retrieval/retrieval_test.go
reasonOverSliceWithCost builds a llmgate.Request and delegates selection/usage parsing to the retry helper; adds TestSinglePassGracefulOnNonJSON to assert graceful degradation and cumulative LLM call accounting when the model returns prose.

Ingest summarization prompts

Layer / File(s) Summary
Summarization request and system prompt changes
pkg/ingest/ingest.go
Increase MaxTokens for summarization calls and rewrite user/system prompts to require a single retrieval-focused sentence (≤60 words), with profile-specific system instructions for research/medical/default.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nibble glyphs and chase the bold,

I stitch spaced letters into words of gold.
Headings rise where font-weights show,
big leaves split so searches grow.
I hop, I test, the parser sings — hooray!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.82% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'parser: detect bold-as-heading + collapse letter-spacing (fixes filing parse)' directly corresponds to the main changes in pkg/parser/pdf.go: adding bold-typography support for heading detection and collapsing letter-tracked text runs.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/parser-bold-headings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've reviewed your changes and they look great!


Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves PDF parsing for SEC filings by enhancing heading detection (to handle bold, body-sized section headers) and by collapsing artificial letter-spacing in extracted rows to restore normal words.

Changes:

  • Add bold-row heading detection (based on glyph font names) and assign bold headings a level nested below size-derived headings (capped at level 6).
  • Collapse wide letter-tracking in extracted rows (e.g., "U N I T E D S T A T E S""UNITED STATES") while aiming to preserve word boundaries.
  • Relax looksLikeHeading word-count cap from 14 to 25 to allow verbose filing headings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/parser/pdf.go
Comment on lines 150 to 155
lvl, isHeading := levelForSize[roundSize(row.fontSize)]
if !isHeading && row.bold && row.fontSize >= median && looksLikeHeading(text) {
isHeading = true
lvl = boldLevel
}
if isHeading && looksLikeHeading(text) {
Comment thread pkg/parser/pdf.go
Comment on lines +624 to +626
// runs of 2+ spaces; within each word the single spaces between solitary glyphs
// are removed ("F O R M 1 0 - Q" → "FORM 10-Q"). Rows that aren't
// letter-spaced are returned unchanged, so normal prose is never touched.
Comment thread pkg/parser/pdf.go
Comment on lines +615 to +619
for _, t := range toks {
if len([]rune(t)) == 1 {
single++
}
}
Comment thread pkg/parser/pdf.go
Comment on lines +635 to +639
for _, p := range parts {
if len([]rune(p)) > 1 {
allSingle = false
break
}
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/parser/pdf.go`:
- Around line 582-587: The PDF struct's documentation still claims headings are
"<= 14 words" but the code in function that parses headings (uses strings.Fields
and checks len(words) > 25) treats headings as up to 25 words; update the PDF
struct doc comment to reflect the current behavior (change the "<= 14 words"
phrase to "<= 25 words" or reword to match the implemented 25-word cap) so
documentation and the PDF struct comment are consistent with the heading
detection logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b1096aaf-80c3-4eb9-b3a1-92aff1027b15

📥 Commits

Reviewing files that changed from the base of the PR and between 15940d3 and fe76029.

📒 Files selected for processing (1)
  • pkg/parser/pdf.go

Comment thread pkg/parser/pdf.go
Comment on lines +582 to +587
// Headings are rarely > 25 words and never end with sentence punctuation
// from the middle of a paragraph. (Filing headings like "Item 2.
// Management's Discussion and Analysis of Financial Condition and Results
// of Operations" run long, so the cap is generous.)
words := strings.Fields(s)
if len(words) == 0 || len(words) > 14 {
if len(words) == 0 || len(words) > 25 {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Documentation mismatch: struct comment still says 14 words.

Line 26 in the PDF struct doc comment states <= 14 words but the implementation now uses 25. Update the doc comment to stay in sync.

Suggested fix
-//  3. Treat any row whose font size exceeds a threshold (1.2× median)
-//     AND that is short (<= 14 words) as a heading candidate.
+//  3. Treat any row whose font size exceeds a threshold (1.2× median)
+//     AND that is short (<= 25 words) as a heading candidate.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/parser/pdf.go` around lines 582 - 587, The PDF struct's documentation
still claims headings are "<= 14 words" but the code in function that parses
headings (uses strings.Fields and checks len(words) > 25) treats headings as up
to 25 words; update the PDF struct doc comment to reflect the current behavior
(change the "<= 14 words" phrase to "<= 25 words" or reword to match the
implemented 25-word cap) so documentation and the PDF struct comment are
consistent with the heading detection logic.

…lures

The selection LLM call (chunked-tree slices and single-pass alike) sometimes
returns plain text instead of the JSON the schema asks for. Most often this
is Gemini briefly ignoring JSON mode. Today that surfaces as a 500 to the
SDK on every blip, plus the wasted LLM cost — and the SDK's transport-level
retry just repeats the same blow-up.

Wrap Complete + ParseSelection in a small retry loop (2 retries by default,
3 attempts total). On retry the last user message gets an extra "ONLY JSON,
no prose, no fences" reminder, which Gemini usually honors on the second
try. If all attempts still fail, log a warning and return an empty selection
so the HTTP request succeeds with no sections instead of erroring out — one
bad LLM response can no longer take down a multi-slice retrieval.

Test TestSinglePassGracefulOnNonJSON locks the behaviour: prose-only
response → empty selection, nil error, 3 LLM attempts counted in usage.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/retrieval/chunked_tree.go (1)

112-115: 💤 Low value

Consider setting ModelUsed for consistency with SinglePass and Cached.

SinglePass.SelectWithCost and Cached.SelectWithCost both populate ModelUsed: budget.ModelName, but ChunkedTree leaves it empty. Downstream consumers (e.g., the query handler) may expect this field to be populated for usage reporting.

Suggested diff
 	return &Result{
 		SelectedIDs: c.Merge.Merge(allIDs),
+		ModelUsed:   budget.ModelName,
 		Usage:       totalUsage,
 	}, nil
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/retrieval/chunked_tree.go` around lines 112 - 115, The Result returned
from ChunkedTree.SelectWithCost doesn't set ModelUsed; update the return struct
in the ChunkedTree.SelectWithCost implementation to include ModelUsed:
budget.ModelName (matching SinglePass.SelectWithCost and Cached.SelectWithCost)
so downstream consumers get the model name for usage reporting.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/retrieval/chunked_tree.go`:
- Around line 112-115: The Result returned from ChunkedTree.SelectWithCost
doesn't set ModelUsed; update the return struct in the
ChunkedTree.SelectWithCost implementation to include ModelUsed: budget.ModelName
(matching SinglePass.SelectWithCost and Cached.SelectWithCost) so downstream
consumers get the model name for usage reporting.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 10de44d1-b73a-4980-aa20-e83989e1c5c1

📥 Commits

Reviewing files that changed from the base of the PR and between fe76029 and 1277324.

📒 Files selected for processing (3)
  • pkg/retrieval/chunked_tree.go
  • pkg/retrieval/retrieval_test.go
  • pkg/retrieval/single_pass.go

hallelx2 added 2 commits May 26, 2026 11:25
The current summary prompt asks for "a single factual sentence" — fine for
human reading, but the resulting summaries describe sections generically
("Cover page of 3M's 10-Q with company identification") instead of naming
their concrete topics ("registered debt securities, trading symbols MMM26
/ MMM30, NYSE listings, IRS employer ID"). The downstream retrieval LLM,
given only those summaries, then can't tell which section answers a
specific question — e.g. q_00941 ("Which debt securities are registered to
trade on a national exchange under 3M's name?") picks two "Long-Term Debt"
sections instead of the cover-page section that actually contains the
registration table.

Rewrite the summary prompt for retrieval: explicitly ask the model to name
the section's concrete entities, identifiers, table contents, named items,
and key numbers. One sentence, raised cap to ≤60 words (with MaxTokens
260) so dense sections aren't truncated mid-list. The domain framings
(research / medical / default) are preserved and now include the same
retrieval rule. Existing ingest tests pass.
…evable

Filing cover pages (and any other long, mixed-topic leaf section) produce
one 2-3k-char blob under a generic title like "3M COMPANY" — mixing
registration tables, addresses, IRS IDs, contact info. A single summary
can't cover all those topics, so retrieval picks unrelated "long-term
debt" sections instead of the one that actually holds the answer.

Add chunkOversizedLeaves: any LEAF section whose Content exceeds 2400
chars is replaced by a parent (title preserved) with smaller children at
the next level. Children are sized around 900 chars and split at word
boundaries. The chunk title prefers a natural colon-terminated header
within the first 80 chars ("Securities registered pursuant to Section
12(b) of the Act:") when available — exactly the pattern in filings —
otherwise the first ~60 chars trimmed at a word boundary, falling back
to "<parent title> — part N".

Internal nodes are recursed into but never split (they're already
structured). Threshold deliberately high (2400) so most paper sub-
sections aren't affected; combined with the retrieval-friendly summary
prompt (previous commit), each chunk gets a topic-rich summary downstream
so the retrieval LLM can match it to specific questions.

Tests in chunk_test.go: oversized leaf gets split with the parent title
preserved + children at level+1; first chunk takes the colon-header
title; small sections are untouched; oversized leaves nested inside
internal nodes are still split.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
pkg/parser/pdf.go (2)

583-586: 💤 Low value

Consider applying chunking to the outline-based path for consistency.

The outline-based parsing path returns sections without chunkOversizedLeaves, while the heuristic path (line 233) applies it. Documents with outlines could still have oversized leaf sections (e.g., a long "Appendix" with no sub-bookmarks). Applying the same post-processing would ensure uniform behavior.

Suggested fix
 	return &ParsedDoc{
 		Title:    title,
-		Sections: rootSec.Children,
+		Sections: chunkOversizedLeaves(rootSec.Children),
 	}, true
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/parser/pdf.go` around lines 583 - 586, The outline-based return in the
PDF parser currently returns Sections directly (rootSec.Children) without
applying chunkOversizedLeaves, causing inconsistent handling versus the
heuristic path; modify the outline-based path that returns &ParsedDoc{Title:
title, Sections: rootSec.Children} to first call chunkOversizedLeaves(rootSec)
(or the same helper used on the heuristic path) and use the processed children
when constructing ParsedDoc so oversized leaf sections are chunked consistently.

270-278: 💤 Low value

Child level may exceed the document-wide cap of 6.

When s.Level is already 6 (the max used elsewhere in heading detection), the children receive Level: 7. Other parts of the parser clamp levels to 6 (e.g., buildHeadingLevelMap, numberedHeadingDepth). For consistency:

Suggested fix
 		for i, piece := range pieces {
 			fallback := fmt.Sprintf("%s — part %d", s.Title, i+1)
+			childLevel := s.Level + 1
+			if childLevel > 6 {
+				childLevel = 6
+			}
 			parent.Children = append(parent.Children, Section{
-				Level:   s.Level + 1,
+				Level:   childLevel,
 				Title:   deriveChunkTitle(piece, fallback),
 				Content: piece,
 			})
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/parser/pdf.go` around lines 270 - 278, Child sections created from pieces
can exceed the document cap of 6 (when s.Level == 6), so clamp the child Level
to a maximum of 6 when appending to parent.Children. In the loop that builds
children (referencing Section, s.Level, deriveChunkTitle, pieces and
parent.Children), compute childLevel := min(s.Level+1, 6) (or otherwise clamp to
6) and set Level: childLevel for each appended Section so child headings never
go above level 6.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/parser/pdf.go`:
- Around line 583-586: The outline-based return in the PDF parser currently
returns Sections directly (rootSec.Children) without applying
chunkOversizedLeaves, causing inconsistent handling versus the heuristic path;
modify the outline-based path that returns &ParsedDoc{Title: title, Sections:
rootSec.Children} to first call chunkOversizedLeaves(rootSec) (or the same
helper used on the heuristic path) and use the processed children when
constructing ParsedDoc so oversized leaf sections are chunked consistently.
- Around line 270-278: Child sections created from pieces can exceed the
document cap of 6 (when s.Level == 6), so clamp the child Level to a maximum of
6 when appending to parent.Children. In the loop that builds children
(referencing Section, s.Level, deriveChunkTitle, pieces and parent.Children),
compute childLevel := min(s.Level+1, 6) (or otherwise clamp to 6) and set Level:
childLevel for each appended Section so child headings never go above level 6.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 78884a71-4a36-4414-90d9-621bf32250fb

📥 Commits

Reviewing files that changed from the base of the PR and between 1277324 and 5052ecb.

📒 Files selected for processing (3)
  • pkg/ingest/ingest.go
  • pkg/parser/chunk_test.go
  • pkg/parser/pdf.go

@hallelx2 hallelx2 merged commit 24672bd into main May 26, 2026
6 of 8 checks passed
@hallelx2 hallelx2 deleted the fix/parser-bold-headings branch May 26, 2026 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants