parser: detect bold-as-heading + collapse letter-spacing (fixes filing parse) by hallelx2 · Pull Request #12 · hallelx2/vectorless-engine

hallelx2 · 2026-05-26T01:51:51Z

pkg/parser/pdf.go: detect bold-as-heading + collapse letter-spacing

SEC filings have no PDF outline and use bold at body font size (not larger
fonts) for section headings, so the size-only heading heuristic missed every
real section and collapsed the entire body into one giant block. Wide
letter-tracking on cover/header rows also extracted as "U N I T E D".

Three targeted changes:

Per-row bold detection from the glyph font name. Bold rows at >= median
font size qualify as headings, nested one level below the smallest
size-derived heading.
collapseLetterSpacing(): rejoins letter-tracked text only on rows whose
pattern is unmistakable (majority single-char tokens), preserving word
boundaries via runs of 2+ spaces. Normal prose is untouched.
looksLikeHeading: raise the word cap from 14 to 25 so verbose filing
headings ("Item 2. Management's Discussion and Analysis of Financial
Condition and Results of Operations") are not filtered out.

Validated on a real 10-Q (3M Q2 2023, 92 pages): one 680K-char blob became
174 retrievable sections (Item 1, Consolidated Balance Sheet, PART I, ...);
title "U N I T E D S T A T E S" became "UNITED STATES". All existing parser
tests pass; no regression.

Summary by Sourcery

Improve PDF filing parsing by treating bold body-size text as headings and collapsing artificial letter spacing to recover proper section structure and titles.

New Features:

Infer heading levels from bold font usage at or above the median body font size, nested beneath size-based headings.
Detect and collapse letter-spaced text patterns to reconstruct normal words in PDF rows.

Enhancements:

Relax heading word-length limits so long, verbose section titles in filings are still recognized as headings.

Summary by CodeRabbit

New Features
- Oversized document sections are auto-split into smaller chunks with derived titles.
Improvements
- Better heading detection using bold/typography and expanded heading heuristics.
- Collapses spaced-out letter runs into normal words during text assembly.
- Summarization prompts now produce a single retrieval-focused sentence per profile.
Reliability
- Selection requests retry on parse failures, return empty selection on final failure, and aggregate usage/model info.
Tests
- Added tests for chunking behavior and graceful non-JSON selection handling.

SEC filings have no PDF outline and use bold at body font size (not larger fonts) for section headings, so the size-only heading heuristic missed every real section and collapsed the entire body into one giant block. Wide letter-tracking on cover/header rows also extracted as "U N I T E D". Three targeted changes: - Per-row bold detection from the glyph font name. Bold rows at >= median font size qualify as headings, nested one level below the smallest size-derived heading. - collapseLetterSpacing(): rejoins letter-tracked text only on rows whose pattern is unmistakable (majority single-char tokens), preserving word boundaries via runs of 2+ spaces. Normal prose is untouched. - looksLikeHeading: raise the word cap from 14 to 25 so verbose filing headings ("Item 2. Management's Discussion and Analysis of Financial Condition and Results of Operations") are not filtered out. Validated on a real 10-Q (3M Q2 2023, 92 pages): one 680K-char blob became 174 retrievable sections (Item 1, Consolidated Balance Sheet, PART I, ...); title "U N I T E D S T A T E S" became "UNITED STATES". All existing parser tests pass; no regression.

sourcery-ai · 2026-05-26T01:51:56Z

Reviewer's Guide

Adds bold-weight-aware heading detection and robust letter-spacing collapsing to the PDF parser so SEC filings without outlines are split into sensible sections instead of a single blob, while expanding the heading heuristic to allow longer titles.

Flow diagram for bold-aware heading detection and letter-spacing collapsing in PDF parser

flowchart TD
  subgraph Extraction[Row extraction]
    A[extractPDFRows] --> B[Iterate glyphs in block]
    B --> C[Build text with spacing]
    C --> D[collapseLetterSpacing]
    B --> E[Count boldGlyphs using isBoldFont]
    D --> F[Create pdfRow.text]
    E --> G[Set pdfRow.bold]
  end

  subgraph HeadingDetection[Heading detection in Parse]
    H[Parse] --> I[buildHeadingLevelMap]
    I --> J[Compute boldLevel from levelForSize]
    J --> K[Iterate pdfRow]
    K --> L[Lookup lvl,isHeading from levelForSize]
    L --> M{!isHeading
row.bold
row.fontSize >= median
looksLikeHeading}
    M -->|true| N[Set isHeading=true
lvl=boldLevel]
    M -->|false| O[Keep original lvl,isHeading]
    N --> P{isHeading
looksLikeHeading}
    O --> P
    P -->|true| Q[Adjust level via numberedHeadingDepth]
  end

  F --> K

File-Level Changes

Change	Details	Files
Augment heading detection to treat bold rows at body size as headings nested under size-derived headings.	Compute a bold heading level one deeper than the deepest size-derived heading level, clamped to level 6. Extend the per-row heading classification to mark rows as headings when they are bold, at least median font size, and pass the heading heuristic. Preserve existing size-based heading detection and numbered sub-heading depth logic.	`pkg/parser/pdf.go`
Track boldness per row based on glyph font names and derive a row-level bold flag.	Accumulate counts of non-whitespace glyphs and those using bold fonts while building row text from PDF glyphs. Introduce a bold field on pdfRow and set it when a strict majority of glyphs in the row are bold-weight. Add isBoldFont helper to recognize bold font faces via common font-name patterns such as 'bold', '-bd', and ',bd'.	`pkg/parser/pdf.go`
Collapse wide letter-spacing patterns into normal words while preserving word boundaries for non-letterspaced text.	Introduce looksLetterSpaced to detect rows dominated by single-character tokens indicative of letter-tracked text. Implement collapseLetterSpacing to rejoin single-character tokens into words, preserving word boundaries via runs of multiple spaces and leaving non-letterspaced rows unchanged. Apply collapseLetterSpacing to each extracted row’s text before trimming and filtering empty rows. Add a multiSpaceRe regexp helper to split on runs of 2+ spaces when collapsing letter-spacing.	`pkg/parser/pdf.go`
Broaden the heading text heuristic to accept longer headings typical of SEC filings.	Increase the maximum allowed word count in looksLikeHeading from 14 to 25. Update the accompanying documentation comment to describe the new limit and justify it with filing-style headings.	`pkg/parser/pdf.go`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

coderabbitai · 2026-05-26T01:52:02Z

📝 Walkthrough

Walkthrough

Adds bold glyph tracking and letter-spaced normalization to PDF parsing, promotes bold rows to headings and splits oversized leaf sections into word-boundary chunks; introduces a selection LLM retry helper that aggregates usage and degrades to empty selection on persistent parse failures; updates ingest summarization prompts.

Changes

Bold Typography Heading Detection & Chunking

Layer / File(s)	Summary
Row extraction: bold tracking and collapse letter-spacing `pkg/parser/pdf.go`	Initialize per-row bold counters; detect bold fonts while assembling rows; collapse letter-spaced glyph runs into normal words; set `pdfRow.bold` from bold/total glyph ratios.
Heading detection and bold-level computation `pkg/parser/pdf.go`	Compute a document-wide `boldLevel` from font-derived heading buckets; promote bold, median-or-larger rows that pass heading heuristics to headings using `boldLevel`; increase allowed heading word count.
Oversized leaf chunking and title derivation `pkg/parser/pdf.go`	Post-process parsed sections with `chunkOversizedLeaves`; split oversized leaf content near word boundaries using thresholds and derive child titles from colon-terminated phrases or leading words.
Chunking unit tests `pkg/parser/chunk_test.go`	Adds tests for oversized-leaf splitting, colon-header-derived titles, fallback titles, small-section pass-through, and recursive chunking of nested internals.

Selection LLM Retry and Integration

Layer / File(s)	Summary
Selection retry helper and SinglePass integration `pkg/retrieval/single_pass.go`	Adds `runSelectionWithRetry`, imports `log`, refactors `SinglePass.SelectWithCost` to use the helper, aggregates usage across attempts, sets `ModelUsed` from budget, and returns empty selection after parse exhaustion while logging.
Caller update and test for non-JSON responses `pkg/retrieval/chunked_tree.go`, `pkg/retrieval/retrieval_test.go`	`reasonOverSliceWithCost` builds a `llmgate.Request` and delegates selection/usage parsing to the retry helper; adds `TestSinglePassGracefulOnNonJSON` to assert graceful degradation and cumulative LLM call accounting when the model returns prose.

Ingest summarization prompts

Layer / File(s)	Summary
Summarization request and system prompt changes `pkg/ingest/ingest.go`	Increase `MaxTokens` for summarization calls and rewrite user/system prompts to require a single retrieval-focused sentence (≤60 words), with profile-specific system instructions for research/medical/default.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nibble glyphs and chase the bold,

I stitch spaced letters into words of gold.
Headings rise where font-weights show,
big leaves split so searches grow.
I hop, I test, the parser sings — hooray!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 58.82% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'parser: detect bold-as-heading + collapse letter-spacing (fixes filing parse)' directly corresponds to the main changes in pkg/parser/pdf.go: adding bold-typography support for heading detection and collapsing letter-tracked text runs.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/parser-bold-headings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sourcery-ai

Hey - I've reviewed your changes and they look great!

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

Copilot

Pull request overview

This PR improves PDF parsing for SEC filings by enhancing heading detection (to handle bold, body-sized section headers) and by collapsing artificial letter-spacing in extracted rows to restore normal words.

Changes:

Add bold-row heading detection (based on glyph font names) and assign bold headings a level nested below size-derived headings (capped at level 6).
Collapse wide letter-tracking in extracted rows (e.g., "U N I T E D S T A T E S" → "UNITED STATES") while aiming to preserve word boundaries.
Relax looksLikeHeading word-count cap from 14 to 25 to allow verbose filing headings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 		lvl, isHeading := levelForSize[roundSize(row.fontSize)]
+		if !isHeading && row.bold && row.fontSize >= median && looksLikeHeading(text) {
+			isHeading = true
+			lvl = boldLevel
+		}
 		if isHeading && looksLikeHeading(text) {


+// runs of 2+ spaces; within each word the single spaces between solitary glyphs
+// are removed ("F O R M   1 0 - Q" → "FORM 10-Q"). Rows that aren't
+// letter-spaced are returned unchanged, so normal prose is never touched.


+	for _, t := range toks {
+		if len([]rune(t)) == 1 {
+			single++
+		}
+	}


+		for _, p := range parts {
+			if len([]rune(p)) > 1 {
+				allSingle = false
+				break
+			}


coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/parser/pdf.go`:
- Around line 582-587: The PDF struct's documentation still claims headings are
"<= 14 words" but the code in function that parses headings (uses strings.Fields
and checks len(words) > 25) treats headings as up to 25 words; update the PDF
struct doc comment to reflect the current behavior (change the "<= 14 words"
phrase to "<= 25 words" or reword to match the implemented 25-word cap) so
documentation and the PDF struct comment are consistent with the heading
detection logic.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b1096aaf-80c3-4eb9-b3a1-92aff1027b15

📥 Commits

Reviewing files that changed from the base of the PR and between 15940d3 and fe76029.

📒 Files selected for processing (1)

pkg/parser/pdf.go

coderabbitai · 2026-05-26T01:56:18Z

+	// Headings are rarely > 25 words and never end with sentence punctuation
+	// from the middle of a paragraph. (Filing headings like "Item 2.
+	// Management's Discussion and Analysis of Financial Condition and Results
+	// of Operations" run long, so the cap is generous.)
 	words := strings.Fields(s)
-	if len(words) == 0 || len(words) > 14 {
+	if len(words) == 0 || len(words) > 25 {


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Documentation mismatch: struct comment still says 14 words.

Line 26 in the PDF struct doc comment states <= 14 words but the implementation now uses 25. Update the doc comment to stay in sync.

Suggested fix

-// 3. Treat any row whose font size exceeds a threshold (1.2× median) -// AND that is short (<= 14 words) as a heading candidate. +// 3. Treat any row whose font size exceeds a threshold (1.2× median) +// AND that is short (<= 25 words) as a heading candidate.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/parser/pdf.go` around lines 582 - 587, The PDF struct's documentation still claims headings are "<= 14 words" but the code in function that parses headings (uses strings.Fields and checks len(words) > 25) treats headings as up to 25 words; update the PDF struct doc comment to reflect the current behavior (change the "<= 14 words" phrase to "<= 25 words" or reword to match the implemented 25-word cap) so documentation and the PDF struct comment are consistent with the heading detection logic.

…lures The selection LLM call (chunked-tree slices and single-pass alike) sometimes returns plain text instead of the JSON the schema asks for. Most often this is Gemini briefly ignoring JSON mode. Today that surfaces as a 500 to the SDK on every blip, plus the wasted LLM cost — and the SDK's transport-level retry just repeats the same blow-up. Wrap Complete + ParseSelection in a small retry loop (2 retries by default, 3 attempts total). On retry the last user message gets an extra "ONLY JSON, no prose, no fences" reminder, which Gemini usually honors on the second try. If all attempts still fail, log a warning and return an empty selection so the HTTP request succeeds with no sections instead of erroring out — one bad LLM response can no longer take down a multi-slice retrieval. Test TestSinglePassGracefulOnNonJSON locks the behaviour: prose-only response → empty selection, nil error, 3 LLM attempts counted in usage.

coderabbitai

🧹 Nitpick comments (1)

pkg/retrieval/chunked_tree.go (1)
112-115: 💤 Low value

Consider setting ModelUsed for consistency with SinglePass and Cached.

SinglePass.SelectWithCost and Cached.SelectWithCost both populate ModelUsed: budget.ModelName, but ChunkedTree leaves it empty. Downstream consumers (e.g., the query handler) may expect this field to be populated for usage reporting.
Suggested diff
 	return &Result{
 		SelectedIDs: c.Merge.Merge(allIDs),
+		ModelUsed:   budget.ModelName,
 		Usage:       totalUsage,
 	}, nil
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/retrieval/chunked_tree.go` around lines 112 - 115, The Result returned
from ChunkedTree.SelectWithCost doesn't set ModelUsed; update the return struct
in the ChunkedTree.SelectWithCost implementation to include ModelUsed:
budget.ModelName (matching SinglePass.SelectWithCost and Cached.SelectWithCost)
so downstream consumers get the model name for usage reporting.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/retrieval/chunked_tree.go`:
- Around line 112-115: The Result returned from ChunkedTree.SelectWithCost
doesn't set ModelUsed; update the return struct in the
ChunkedTree.SelectWithCost implementation to include ModelUsed: budget.ModelName
(matching SinglePass.SelectWithCost and Cached.SelectWithCost) so downstream
consumers get the model name for usage reporting.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 10de44d1-b73a-4980-aa20-e83989e1c5c1

📥 Commits

Reviewing files that changed from the base of the PR and between fe76029 and 1277324.

📒 Files selected for processing (3)

pkg/retrieval/chunked_tree.go
pkg/retrieval/retrieval_test.go
pkg/retrieval/single_pass.go

The current summary prompt asks for "a single factual sentence" — fine for human reading, but the resulting summaries describe sections generically ("Cover page of 3M's 10-Q with company identification") instead of naming their concrete topics ("registered debt securities, trading symbols MMM26 / MMM30, NYSE listings, IRS employer ID"). The downstream retrieval LLM, given only those summaries, then can't tell which section answers a specific question — e.g. q_00941 ("Which debt securities are registered to trade on a national exchange under 3M's name?") picks two "Long-Term Debt" sections instead of the cover-page section that actually contains the registration table. Rewrite the summary prompt for retrieval: explicitly ask the model to name the section's concrete entities, identifiers, table contents, named items, and key numbers. One sentence, raised cap to ≤60 words (with MaxTokens 260) so dense sections aren't truncated mid-list. The domain framings (research / medical / default) are preserved and now include the same retrieval rule. Existing ingest tests pass.

…evable Filing cover pages (and any other long, mixed-topic leaf section) produce one 2-3k-char blob under a generic title like "3M COMPANY" — mixing registration tables, addresses, IRS IDs, contact info. A single summary can't cover all those topics, so retrieval picks unrelated "long-term debt" sections instead of the one that actually holds the answer. Add chunkOversizedLeaves: any LEAF section whose Content exceeds 2400 chars is replaced by a parent (title preserved) with smaller children at the next level. Children are sized around 900 chars and split at word boundaries. The chunk title prefers a natural colon-terminated header within the first 80 chars ("Securities registered pursuant to Section 12(b) of the Act:") when available — exactly the pattern in filings — otherwise the first ~60 chars trimmed at a word boundary, falling back to "<parent title> — part N". Internal nodes are recursed into but never split (they're already structured). Threshold deliberately high (2400) so most paper sub- sections aren't affected; combined with the retrieval-friendly summary prompt (previous commit), each chunk gets a topic-rich summary downstream so the retrieval LLM can match it to specific questions. Tests in chunk_test.go: oversized leaf gets split with the parent title preserved + children at level+1; first chunk takes the colon-header title; small sections are untouched; oversized leaves nested inside internal nodes are still split.

coderabbitai

🧹 Nitpick comments (2)

pkg/parser/pdf.go (2)

583-586: 💤 Low value

Consider applying chunking to the outline-based path for consistency.

The outline-based parsing path returns sections without chunkOversizedLeaves, while the heuristic path (line 233) applies it. Documents with outlines could still have oversized leaf sections (e.g., a long "Appendix" with no sub-bookmarks). Applying the same post-processing would ensure uniform behavior.

Suggested fix

 	return &ParsedDoc{
 		Title:    title,
-		Sections: rootSec.Children,
+		Sections: chunkOversizedLeaves(rootSec.Children),
 	}, true

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/parser/pdf.go` around lines 583 - 586, The outline-based return in the
PDF parser currently returns Sections directly (rootSec.Children) without
applying chunkOversizedLeaves, causing inconsistent handling versus the
heuristic path; modify the outline-based path that returns &ParsedDoc{Title:
title, Sections: rootSec.Children} to first call chunkOversizedLeaves(rootSec)
(or the same helper used on the heuristic path) and use the processed children
when constructing ParsedDoc so oversized leaf sections are chunked consistently.

270-278: 💤 Low value

Child level may exceed the document-wide cap of 6.

When s.Level is already 6 (the max used elsewhere in heading detection), the children receive Level: 7. Other parts of the parser clamp levels to 6 (e.g., buildHeadingLevelMap, numberedHeadingDepth). For consistency:

Suggested fix

 		for i, piece := range pieces {
 			fallback := fmt.Sprintf("%s — part %d", s.Title, i+1)
+			childLevel := s.Level + 1
+			if childLevel > 6 {
+				childLevel = 6
+			}
 			parent.Children = append(parent.Children, Section{
-				Level:   s.Level + 1,
+				Level:   childLevel,
 				Title:   deriveChunkTitle(piece, fallback),
 				Content: piece,
 			})
 		}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/parser/pdf.go` around lines 270 - 278, Child sections created from pieces
can exceed the document cap of 6 (when s.Level == 6), so clamp the child Level
to a maximum of 6 when appending to parent.Children. In the loop that builds
children (referencing Section, s.Level, deriveChunkTitle, pieces and
parent.Children), compute childLevel := min(s.Level+1, 6) (or otherwise clamp to
6) and set Level: childLevel for each appended Section so child headings never
go above level 6.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/parser/pdf.go`:
- Around line 583-586: The outline-based return in the PDF parser currently
returns Sections directly (rootSec.Children) without applying
chunkOversizedLeaves, causing inconsistent handling versus the heuristic path;
modify the outline-based path that returns &ParsedDoc{Title: title, Sections:
rootSec.Children} to first call chunkOversizedLeaves(rootSec) (or the same
helper used on the heuristic path) and use the processed children when
constructing ParsedDoc so oversized leaf sections are chunked consistently.
- Around line 270-278: Child sections created from pieces can exceed the
document cap of 6 (when s.Level == 6), so clamp the child Level to a maximum of
6 when appending to parent.Children. In the loop that builds children
(referencing Section, s.Level, deriveChunkTitle, pieces and parent.Children),
compute childLevel := min(s.Level+1, 6) (or otherwise clamp to 6) and set Level:
childLevel for each appended Section so child headings never go above level 6.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 78884a71-4a36-4414-90d9-621bf32250fb

📥 Commits

Reviewing files that changed from the base of the PR and between 1277324 and 5052ecb.

📒 Files selected for processing (3)

pkg/ingest/ingest.go
pkg/parser/chunk_test.go
pkg/parser/pdf.go

Copilot AI review requested due to automatic review settings May 26, 2026 01:51

Copilot started reviewing on behalf of hallelx2 May 26, 2026 01:51 View session

sourcery-ai Bot reviewed May 26, 2026

View reviewed changes

Copilot AI reviewed May 26, 2026

View reviewed changes

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

hallelx2 added 2 commits May 26, 2026 11:25

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

hallelx2 mentioned this pull request May 26, 2026

Page citations + HyDE candidate questions #14

Merged

6 tasks

hallelx2 merged commit 24672bd into main May 26, 2026
6 of 8 checks passed

hallelx2 deleted the fix/parser-bold-headings branch May 26, 2026 23:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parser: detect bold-as-heading + collapse letter-spacing (fixes filing parse)#12

parser: detect bold-as-heading + collapse letter-spacing (fixes filing parse)#12
hallelx2 merged 4 commits into
mainfrom
fix/parser-bold-headings

hallelx2 commented May 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

sourcery-ai Bot commented May 26, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hallelx2 commented May 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Summary by CodeRabbit

Uh oh!

sourcery-ai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Flow diagram for bold-aware heading detection and letter-spacing collapsing in PDF parser

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hallelx2 commented May 26, 2026 •

edited by coderabbitai Bot

Loading

sourcery-ai Bot commented May 26, 2026 •

edited

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading