Skip to content

Phase 1.3.C: line-strategy table finding (v0.2.0)#3

Merged
hallelx2 merged 4 commits into
mainfrom
feat/table-finder-lines
May 27, 2026
Merged

Phase 1.3.C: line-strategy table finding (v0.2.0)#3
hallelx2 merged 4 commits into
mainfrom
feat/table-finder-lines

Conversation

@hallelx2
Copy link
Copy Markdown
Owner

@hallelx2 hallelx2 commented May 27, 2026

Summary

Ports pdfplumber's TableFinder (edges → intersections → cells →
tables pipeline) into Go and wires it into the Page interface via
two new methods, completing Phase 1.3.C of the pdftable port.

What's in

  • Page.FindTables(settings) ([]TableFinder, error) — geometry-only
    stage. Returns one TableFinder per detected table group,
    exposing the merged edges, intersections, raw cells, and
    assembled per-table CellsGrid for debugging / custom rendering.
  • Page.ExtractTables(settings) ([]*Table, error) — wraps
    FindTables, runs per-cell text extraction (chars whose centre
    lies in the cell bbox, joined via the existing extract_text
    path, whitespace-trimmed), returns *Table structs with rows /
    per-cell bbox grid / page number / union bbox.
  • TableSettings value type with DefaultTableSettings()
    carrying pdfplumber-matching defaults
    (snap_tolerance=3, join_tolerance=3, edge_min_length=3,
    edge_min_length_prefilter=1, intersection_tolerance=3,
    text_tolerance=3).
  • TableStrategy enum: StrategyLines (default — Lines + Rects +
    axis-aligned Curve segments), StrategyLinesStrict (only drawn
    Line segments).
  • Internal internal/layout package: Edge type with
    FromLine/FromRect/FromCurve plus the snap → join →
    filter pipeline. Built by the previous WIP commit; this PR adds
    the consumer.
  • Algorithm-level unit tests on hand-crafted edges (table_test.go)
    • a public-API integration test against the new
      testdata/table-2x3-ruled.pdf fixture.
  • pdfplumber parity golden test loader: any
    testdata/golden/*.tables.expected.json fixture is loaded and
    compared cell-for-cell after whitespace normalisation. First
    case: issue-466-example.pdf (copied from pdfplumber's own
    fixtures) — 4×3 + 2×3 ruled tables; pdftable matches pdfplumber
    cell-for-cell.
  • Runnable example at examples/extract_tables/main.go so the
    README snippet doesn't bit-rot.

What's out (deferred to v0.3.0)

  • StrategyText (infer edges from word alignment).
  • StrategyExplicit (caller-supplied edges only).

Both return ErrUnsupported with a clear "Phase 1.3.D" message so
callers don't silently get empty results. TableSettings.Explicit*Lines
fields ARE honoured under the lines strategies (added on top of
derived edges).

Test plan

  • go build ./... clean.
  • go vet ./... clean.
  • go test ./... all green.
  • Algorithm unit tests on hand-crafted edge lists exercise
    edgesToIntersections, intersectionsToCells,
    cellsToTables, assembleTableBox, runTableFinder.
  • Public-API integration test extracts the hand-crafted 2×3
    ruled fixture and asserts row count + cell text exactly:
    [[Name, Age], [Alice, 30], [Bob, 25]].
  • pdfplumber parity: issue-466-example.pdf matches
    cell-for-cell after whitespace normalisation.
  • README's examples/extract_tables/main.go runs against the
    parity fixture and prints the expected rows.

Known limitations (not regressions)

  • Cells whose glyphs use standard-14 fonts WITHOUT the bundled AFM
    tables can have intra-word gaps reported as "no gap"; this is the
    same word-bbox drift documented for v0.1.0 and bites
    la-precinct-bulletin-2014-p1.pdf cell text on parity. AFM bundle
    is on the v0.2.x roadmap.
  • senate-expenditures.pdf produces 7 cells where pdfplumber finds
    10; under investigation as a follow-up issue (snap+join unifies
    edges that share a near-collinear endpoint slightly differently
    from pdfplumber on this fixture).

Both fixtures are NOT in the golden set (CI would be noisy);
they're recorded in the CHANGELOG's "Known limitations" section.

Summary by Sourcery

Add line-based table detection and extraction to the Page API and supporting internals, plus pdfplumber parity tests and documentation for the new v0.2.0 release.

New Features:

  • Expose FindTables and ExtractTables on Page for ruled table detection and text extraction using configurable TableSettings and TableStrategy.
  • Introduce public table types (Table, TableFinder, TableBox, Intersection) and internal layout edge pipeline for geometry-only table finding.
  • Provide a runnable extract_tables example and README guidance mirroring pdfplumber usage.

Enhancements:

  • Add algorithm-level unit tests and golden parity tests against pdfplumber for tables, including a generated ruled-table PDF fixture and golden table JSONs.
  • Extend golden test harness and generation script to cover table outputs alongside existing text/word fixtures.
  • Update changelog and README to document the v0.2.0 table-finding release, roadmap, and package layout changes.

hallelx2 added 4 commits May 27, 2026 01:17
Lands the public types for line/lines_strict table finding plus
the internal edge-detection helpers, ahead of the
intersection-construction + page-integration work.

Types in table.go:
  - TableStrategy enum (lines, lines_strict, text, explicit).
  - TableSettings with pdfplumber-matching defaults
    (SnapTolerance, JoinTolerance, EdgeMinLength,
    IntersectionTolerance, TextTolerance, MinWordsVertical,
    MinWordsHorizontal, KeepBlankChars, ExplicitVerticalLines,
    ExplicitHorizontalLines).
  - Table struct (Rows [][]string, BBox, Page, CellsBBox).
  - DefaultTableSettings() / applyDefaults() helpers.

internal/layout/lines.go:
  - Edge value type + Orientation/Source enums.
  - FromLine / FromRect / FromCurve constructors that lift
    drawn primitives into orientation-tagged Edges.
  - SnapEdges + JoinEdges + MergeEdges — port of
    pdfplumber's edge-snapping/joining pipeline.
  - FilterEdges{ByLength,BySource,ByOrientation},
    SortEdges utilities.

NOT YET INCLUDED (next commits on this branch):
  - finder.go with getEdges/edgesToIntersections/
    intersectionsToCells/cellsToTables.
  - Page.FindTables / Page.ExtractTables.
  - Golden-file fixtures + tests.
Port pdfplumber's TableFinder pipeline (edges -> intersections ->
cells -> tables) into Go. Wire it into the Page interface via two
new methods: FindTables returns the geometry-only TableFinder per
detected table group; ExtractTables runs per-cell text extraction
and returns assembled Table structs with row/col text + bbox grid.

Only the 'lines' and 'lines_strict' strategies are implemented in
this commit. 'text' and 'explicit' return ErrUnsupported with a
clear Phase 1.3.D message so callers don't silently get empty
results.

Includes:
- finder.go: Intersection, TableBox, TableFinder types; pure
  algorithm functions (edgesToIntersections, intersectionsToCells,
  cellsToTables, assembleTableBox, runTableFinder).
- page.go: findTableEdges (derives + filters + merges edges from
  Lines/Rects/Curves), FindTables, ExtractTables, charsInCell
  predicate matching pdfplumber's char_in_bbox.
- testdata/fixtures.go: TableRuled() helper producing a minimal
  2-col x 3-row ruled fixture with known text content.
- scripts/gen_table_fixture.go: build-tag-gated generator script.
- table_test.go: algorithm unit tests on hand-crafted edge lists
  plus a public-API integration test against the ruled fixture.
- .gitattributes: mark *.pdf as binary so test fixtures don't get
  line-ending normalised on checkout.
Add a pdfplumber-parity golden test that loads any
testdata/golden/*.tables.expected.json fixture and compares
ExtractTables output cell-for-cell against the reference. Cells are
compared after whitespace normalisation (collapse runs of whitespace,
trim) so pdftable's dense extract_text output (space-joined within a
line) matches pdfplumber's intra-cell line-break output ("A\nB") on
token sequence.

Add issue-466-example.pdf from pdfplumber's own test fixtures as the
first parity case: 4r x 3c + 2r x 3c ruled tables with predictable
"T0-C0" / "T0-00" style cell labels. pdftable and pdfplumber agree
on cell count, row/column shape, and cell text for this fixture.

Extend scripts/gen_golden.py to emit the new *.tables.expected.json
alongside the existing *.expected.json (words + text golden), so a
single run of gen_golden.py keeps both v0.1.x and v0.2.0 fixtures in
sync.

Update TestGoldenAgainstPdfplumber to skip *.tables.expected.json
files (different schema, exercised by the new test instead).
Document v0.2.0 in CHANGELOG with the full added surface
(TableSettings, TableStrategy, Page.FindTables / ExtractTables,
TableFinder, TableBox, Table, Intersection, internal/layout
package), what's deferred to v0.3.0 (text + explicit strategies),
and the known limitations (cell text on standard-14 fonts without
the AFM bundle).

Update README with:
- Status bumped to v0.2.0.
- Page interface excerpt with the two new methods.
- A "Tables (lines strategy)" section showing how to call
  ExtractTables with default settings + switch to lines_strict.
- A side-by-side pdfplumber -> pdftable example for find_tables.
- Architecture diagram updated with table.go, finder.go, and the
  internal/layout package.
- Roadmap updated to push text + explicit strategies to v0.3.x.

Add examples/extract_tables/main.go: the runnable form of the
README's tables example. It's a compile target in `go build ./...`
so a future change to the public API breaks the snippet at build
time rather than letting it drift.
Copilot AI review requested due to automatic review settings May 27, 2026 00:37
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 27, 2026

Reviewer's Guide

Implements v0.2.0 table-finding via a Go port of pdfplumber’s TableFinder, wiring a geometry-only pipeline (edges → intersections → cells → tables) and text extraction into the Page API, backed by a new internal layout/edge-merging layer, tests, fixtures, golden parity, examples, and docs/changelog updates.

Sequence diagram for Page.ExtractTables pipeline

sequenceDiagram
    actor Caller
    participant Page
    participant Layout as layout
    participant Finder as runTableFinder

    Caller->>Page: ExtractTables(settings)
    Page->>Page: settings.applyDefaults()
    Page->>Page: ensureSupportedStrategies(settings)
    Page->>Page: FindTables(settings)
    Page->>Page: findTableEdges(settings)
    Page->>Page: Objects()
    Page->>Layout: FromLine/FromRect/FromCurve
    Page->>Layout: FilterEdgesByOrientation
    Page->>Layout: FilterEdgesBySource
    Page->>Layout: FilterEdgesByLength
    Page->>Layout: MergeEdges
    Page-->>Page: []layout.Edge
    Page->>Finder: runTableFinder(edges, IntersectionTolerance, IntersectionTolerance)
    Finder->>Finder: edgesToIntersections
    Finder->>Finder: intersectionsToCells
    Finder->>Finder: cellsToTables
    Finder->>Finder: assembleTableBox
    Finder-->>Page: []TableFinder
    Page->>Page: Chars()
    Page->>Page: assembleTableText(TableBox, chars, settings, pageNumber)
    Page-->>Caller: []*Table
Loading

File-Level Changes

Change Details Files
Extend Page API with geometry-only and text-extracting table detection operations and integrate them with existing primitives/char extraction.
  • Add FindTables and ExtractTables methods to the Page interface and implement them on page, including strategy validation via ensureSupportedStrategies.
  • Implement findTableEdges to derive layout.Edge values from page Lines/Rects/Curves, apply per-orientation source filtering for lines_strict, add explicit lines from settings, prefilter by length, snap+join via layout.MergeEdges, and post-filter by length.
  • Implement ExtractTables to call FindTables, fetch page chars, and fill Tables by assigning chars to cells via centre-in-bbox check and reuse of existing text extraction options with per-cell whitespace trimming.
  • Implement charsInCell and assembleTableText helpers to encapsulate char-in-cell hit-testing and per-cell text assembly using the existing word/text pipeline and TableSettings’ TextTolerance/KeepBlankChars.
page.go
finder.go
table.go
Introduce an internal layout package that normalizes, filters, and merges axis-aligned edges derived from Lines/Rects/Curves for use by the table-finding pipeline.
  • Define Edge type with orientation, source, stroke width, and helpers for length/position plus normalization invariants for horizontal/vertical edges.
  • Add FromLine/FromRect/FromCurve constructors over minimal segment structs with near-axis-aligned tolerance handling to turn primitives into edges.
  • Implement SnapEdges and JoinEdges (with MergeEdges wrapper) to cluster edges by perpendicular position within snap tolerance and then join collinear segments whose along-axis gaps are within join tolerance.
  • Provide FilterEdgesByLength, FilterEdgesBySource, FilterEdgesByOrientation, and SortEdges utilities to support both page-level edge derivation and downstream intersection logic.
internal/layout/lines.go
Port pdfplumber’s TableFinder algorithms (edges → intersections → cells → tables) and strategy validation into standalone, testable functions.
  • Define Intersection, TableBox, and TableFinder structs encapsulating edges, intersections, raw cell bboxes, and assembled table grids with helpers to flatten cells.
  • Implement edgesToIntersections that splits edges by orientation, sorts them deterministically, and finds crossing points within x/y tolerances while deduplicating and accumulating involved edges per intersection.
  • Implement intersectionsToCells and edgeConnects to locate minimal rectangular cells by scanning “below” and “right” intersections and ensuring corners share actual edges, and implement cellsToTables to group touching cells into tables and discard singleton cells.
  • Implement assembleTableBox to project flat cell lists into row/column grids (top-to-bottom, left-to-right) and runTableFinder as the composition of intersections, cell finding, table grouping, and TableBox assembly.
  • Add ensureSupportedStrategies to reject text/explicit strategies with ErrUnsupported and clear phase messaging, while accepting lines and lines_strict.
finder.go
Define the public table-related types/settings and hook them into the overall API surface and documentation.
  • Add TableStrategy enum (lines, lines_strict, text, explicit) and TableSettings struct with pdfplumber-matching defaults plus explicit-line and text-related fields, with DefaultTableSettings and applyDefaults to hydrate zero values.
  • Add Table struct (rows of strings, union bbox, page number, per-cell bbox grid) with helper Cells to flatten bboxes, and document coordinate conventions and behavioural diffs vs pdfplumber.
  • Update README to describe v0.2.0 status, document FindTables/ExtractTables usage and strategies, add a “Tables (lines strategy)” section with Go/pdfplumber side-by-side examples, update install version, and extend the layout overview with new files.
  • Update CHANGELOG with a detailed 0.2.0 entry covering new APIs, internal components, deferred strategies, and known limitations, and extend the roadmap version bullets accordingly.
table.go
README.md
CHANGELOG.md
Add golden-table parity tests, unit tests, fixtures, generator scripts, and example program to validate and demonstrate the table-finding pipeline.
  • Extend golden_test to skip .tables.expected.json in the existing text/word parity test, define goldenTables schema and loader, and add TestGoldenTablesAgainstPdfplumber that runs ExtractTables on fixtures and compares table/row/col counts and cell text after whitespace normalisation using normaliseCellText.
  • Add table_test with algorithm-level unit tests for edgesToIntersections, intersectionsToCells, cellsToTables, assembleTableBox, runTableFinder; strategy validation tests for ensureSupportedStrategies/applyDefaults; and public integration tests using the synthetic TableRuled fixture and unsupported-strategy cases.
  • Add TableRuled() PDF fixture generator in testdata/fixtures.go to produce a simple 2×3 ruled table; add scripts/gen_table_fixture.go (build-tagged) to write table-2x3-ruled.pdf to disk; and adjust testdata fixtures list accordingly.
  • Extend scripts/gen_golden.py to additionally emit .tables.expected.json capturing pdfplumber’s find_tables(lines) output per page for parity tests and to update its top-of-file docs; check in issue-466-example.tables.expected.json as the first golden tables fixture.
  • Add examples/extract_tables/main.go as the runnable form of the README tables example and include a .gitattributes stub plus the new golden fixture file path in the repo layout.
golden_test.go
table_test.go
testdata/fixtures.go
scripts/gen_table_fixture.go
scripts/gen_golden.py
examples/extract_tables/main.go
testdata/golden/issue-466-example.tables.expected.json
.gitattributes

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Warning

Review limit reached

@hallelx2, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 16 minutes and 31 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3bd98be7-6d1a-49c5-ad56-f4ae2ab488c2

📥 Commits

Reviewing files that changed from the base of the PR and between 645955d and 14d2d8c.

⛔ Files ignored due to path filters (2)
  • testdata/golden/issue-466-example.pdf is excluded by !**/*.pdf
  • testdata/table-2x3-ruled.pdf is excluded by !**/*.pdf
📒 Files selected for processing (14)
  • .gitattributes
  • CHANGELOG.md
  • README.md
  • examples/extract_tables/main.go
  • finder.go
  • golden_test.go
  • internal/layout/lines.go
  • page.go
  • scripts/gen_golden.py
  • scripts/gen_table_fixture.go
  • table.go
  • table_test.go
  • testdata/fixtures.go
  • testdata/golden/issue-466-example.tables.expected.json
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/table-finder-lines

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • In assembleTableText, the locally constructed wordOpts is never used; either wire it through to the text extraction path if it’s supposed to control word grouping, or remove it to avoid dead code and confusion.
  • The hard-coded axis-alignment tolerance (tol := 0.1) in findTableEdges could be made configurable (e.g., via TableSettings) so callers dealing with very small or very large coordinate scales can tune what counts as ‘near-horizontal/vertical’.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In assembleTableText, the locally constructed wordOpts is never used; either wire it through to the text extraction path if it’s supposed to control word grouping, or remove it to avoid dead code and confusion.
- The hard-coded axis-alignment tolerance (tol := 0.1) in findTableEdges could be made configurable (e.g., via TableSettings) so callers dealing with very small or very large coordinate scales can tune what counts as ‘near-horizontal/vertical’.

## Individual Comments

### Comment 1
<location path="page.go" line_range="589-592" />
<code_context>
+	textOpts.XTolerance = s.TextTolerance
+	textOpts.YTolerance = s.TextTolerance
+
+	wordOpts := DefaultWordOpts()
+	wordOpts.XTolerance = s.TextTolerance
+	wordOpts.YTolerance = s.TextTolerance
+	wordOpts.KeepBlankChars = s.KeepBlankChars
+
+	for ri, row := range tb.CellsGrid {
</code_context>
<issue_to_address>
**issue (bug_risk):** KeepBlankChars setting and wordOpts are currently unused in cell text extraction

`assembleTableText` builds `wordOpts` and sets `KeepBlankChars` from `TableSettings`, but `wordOpts` is never used (only `extractTextFromChars` with `textOpts` is called). This means `KeepBlankChars` does nothing for `ExtractTables`, breaking the documented behaviour and pdfplumber parity. Please either thread `wordOpts` / `KeepBlankChars` into the extraction path or remove them from `TableSettings` to avoid a misleading API.
</issue_to_address>

### Comment 2
<location path="table_test.go" line_range="92-95" />
<code_context>
+	}
+}
+
+// TestEdgesToIntersections_NoCrossing asserts that two parallel
+// edges, even when their Y / X spans overlap, produce no
+// intersections.
+func TestEdgesToIntersections_NoCrossing(t *testing.T) {
+	edges := []layout.Edge{
+		makeH(0, 100, 50),
</code_context>
<issue_to_address>
**issue (testing):** Test name and comment are misleading relative to what is actually asserted

The setup here (two horizontals, two verticals, expecting 4 intersections) doesn’t match the “no crossing” / parallel-edge description. Please either rename the test and adjust the comment to describe the actual behaviour, or change the fixture to truly parallel edges with zero intersections so the test intent is clear.
</issue_to_address>

### Comment 3
<location path="page.go" line_range="495" />
<code_context>
+	return merged, nil
+}
+
+// FindTables runs the geometry-only pipeline (edges → intersections
+// → cells → tables) and returns one TableFinder per detected table
+// group. Strategies "text" and "explicit" return ErrUnsupported.
</code_context>
<issue_to_address>
**issue (complexity):** Consider delegating FindTables to the shared runTableFinder pipeline and then reshaping its result to keep the per-table TableFinder API while avoiding duplicated geometry logic in page.go.

You can drop the duplicated geometry pipeline in `FindTables` and delegate to `runTableFinder`, keeping the per-table `TableFinder` behaviour by post-processing the result.

That keeps all the functionality you added (one `TableFinder` per table, with edges/intersections/cells exposed) but centralises the algorithm in `finder.go` so there’s a single source of truth.

For example, refactor `FindTables` like this:

```go
func (p *page) FindTables(settings TableSettings) ([]TableFinder, error) {
	s := settings.applyDefaults()
	if err := ensureSupportedStrategies(s); err != nil {
		return nil, err
	}

	edges, err := p.findTableEdges(s)
	if err != nil {
		return nil, err
	}
	if len(edges) == 0 {
		return nil, nil
	}

	// Use the centralised pipeline
	tf := runTableFinder(edges, s.IntersectionTolerance, s.IntersectionTolerance)
	if len(tf.Tables) == 0 {
		return nil, nil
	}

	// Preserve the “one TableFinder per table” API
	out := make([]TableFinder, 0, len(tf.Tables))
	for _, tb := range tf.Tables {
		out = append(out, TableFinder{
			Edges:         tf.Edges,
			Intersections: tf.Intersections,
			Cells:         tf.Cells,
			Tables:        []TableBox{tb},
		})
	}
	return out, nil
}
```

If `runTableFinder` currently computes intersections/cells/tables internally from `edges`, you can also remove the now-redundant calls to `edgesToIntersections`, `intersectionsToCells`, and `cellsToTables` from `page.go`. That keeps `page.go` focused on page-level orchestration (building `edges`) and leaves all geometry-only logic in `finder.go`.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread page.go
Comment on lines +589 to +592
wordOpts := DefaultWordOpts()
wordOpts.XTolerance = s.TextTolerance
wordOpts.YTolerance = s.TextTolerance
wordOpts.KeepBlankChars = s.KeepBlankChars
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): KeepBlankChars setting and wordOpts are currently unused in cell text extraction

assembleTableText builds wordOpts and sets KeepBlankChars from TableSettings, but wordOpts is never used (only extractTextFromChars with textOpts is called). This means KeepBlankChars does nothing for ExtractTables, breaking the documented behaviour and pdfplumber parity. Please either thread wordOpts / KeepBlankChars into the extraction path or remove them from TableSettings to avoid a misleading API.

Comment thread table_test.go
Comment on lines +92 to +95
// TestEdgesToIntersections_NoCrossing asserts that two parallel
// edges, even when their Y / X spans overlap, produce no
// intersections.
func TestEdgesToIntersections_NoCrossing(t *testing.T) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (testing): Test name and comment are misleading relative to what is actually asserted

The setup here (two horizontals, two verticals, expecting 4 intersections) doesn’t match the “no crossing” / parallel-edge description. Please either rename the test and adjust the comment to describe the actual behaviour, or change the fixture to truly parallel edges with zero intersections so the test intent is clear.

Comment thread page.go
return merged, nil
}

// FindTables runs the geometry-only pipeline (edges → intersections
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider delegating FindTables to the shared runTableFinder pipeline and then reshaping its result to keep the per-table TableFinder API while avoiding duplicated geometry logic in page.go.

You can drop the duplicated geometry pipeline in FindTables and delegate to runTableFinder, keeping the per-table TableFinder behaviour by post-processing the result.

That keeps all the functionality you added (one TableFinder per table, with edges/intersections/cells exposed) but centralises the algorithm in finder.go so there’s a single source of truth.

For example, refactor FindTables like this:

func (p *page) FindTables(settings TableSettings) ([]TableFinder, error) {
	s := settings.applyDefaults()
	if err := ensureSupportedStrategies(s); err != nil {
		return nil, err
	}

	edges, err := p.findTableEdges(s)
	if err != nil {
		return nil, err
	}
	if len(edges) == 0 {
		return nil, nil
	}

	// Use the centralised pipeline
	tf := runTableFinder(edges, s.IntersectionTolerance, s.IntersectionTolerance)
	if len(tf.Tables) == 0 {
		return nil, nil
	}

	// Preserve the “one TableFinder per table” API
	out := make([]TableFinder, 0, len(tf.Tables))
	for _, tb := range tf.Tables {
		out = append(out, TableFinder{
			Edges:         tf.Edges,
			Intersections: tf.Intersections,
			Cells:         tf.Cells,
			Tables:        []TableBox{tb},
		})
	}
	return out, nil
}

If runTableFinder currently computes intersections/cells/tables internally from edges, you can also remove the now-redundant calls to edgesToIntersections, intersectionsToCells, and cellsToTables from page.go. That keeps page.go focused on page-level orchestration (building edges) and leaves all geometry-only logic in finder.go.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@hallelx2 hallelx2 merged commit 599c309 into main May 27, 2026
5 of 6 checks passed
@hallelx2 hallelx2 deleted the feat/table-finder-lines branch May 27, 2026 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants