Phase 1.3.C: line-strategy table finding (v0.2.0)#3
Conversation
Lands the public types for line/lines_strict table finding plus
the internal edge-detection helpers, ahead of the
intersection-construction + page-integration work.
Types in table.go:
- TableStrategy enum (lines, lines_strict, text, explicit).
- TableSettings with pdfplumber-matching defaults
(SnapTolerance, JoinTolerance, EdgeMinLength,
IntersectionTolerance, TextTolerance, MinWordsVertical,
MinWordsHorizontal, KeepBlankChars, ExplicitVerticalLines,
ExplicitHorizontalLines).
- Table struct (Rows [][]string, BBox, Page, CellsBBox).
- DefaultTableSettings() / applyDefaults() helpers.
internal/layout/lines.go:
- Edge value type + Orientation/Source enums.
- FromLine / FromRect / FromCurve constructors that lift
drawn primitives into orientation-tagged Edges.
- SnapEdges + JoinEdges + MergeEdges — port of
pdfplumber's edge-snapping/joining pipeline.
- FilterEdges{ByLength,BySource,ByOrientation},
SortEdges utilities.
NOT YET INCLUDED (next commits on this branch):
- finder.go with getEdges/edgesToIntersections/
intersectionsToCells/cellsToTables.
- Page.FindTables / Page.ExtractTables.
- Golden-file fixtures + tests.
Port pdfplumber's TableFinder pipeline (edges -> intersections -> cells -> tables) into Go. Wire it into the Page interface via two new methods: FindTables returns the geometry-only TableFinder per detected table group; ExtractTables runs per-cell text extraction and returns assembled Table structs with row/col text + bbox grid. Only the 'lines' and 'lines_strict' strategies are implemented in this commit. 'text' and 'explicit' return ErrUnsupported with a clear Phase 1.3.D message so callers don't silently get empty results. Includes: - finder.go: Intersection, TableBox, TableFinder types; pure algorithm functions (edgesToIntersections, intersectionsToCells, cellsToTables, assembleTableBox, runTableFinder). - page.go: findTableEdges (derives + filters + merges edges from Lines/Rects/Curves), FindTables, ExtractTables, charsInCell predicate matching pdfplumber's char_in_bbox. - testdata/fixtures.go: TableRuled() helper producing a minimal 2-col x 3-row ruled fixture with known text content. - scripts/gen_table_fixture.go: build-tag-gated generator script. - table_test.go: algorithm unit tests on hand-crafted edge lists plus a public-API integration test against the ruled fixture. - .gitattributes: mark *.pdf as binary so test fixtures don't get line-ending normalised on checkout.
Add a pdfplumber-parity golden test that loads any
testdata/golden/*.tables.expected.json fixture and compares
ExtractTables output cell-for-cell against the reference. Cells are
compared after whitespace normalisation (collapse runs of whitespace,
trim) so pdftable's dense extract_text output (space-joined within a
line) matches pdfplumber's intra-cell line-break output ("A\nB") on
token sequence.
Add issue-466-example.pdf from pdfplumber's own test fixtures as the
first parity case: 4r x 3c + 2r x 3c ruled tables with predictable
"T0-C0" / "T0-00" style cell labels. pdftable and pdfplumber agree
on cell count, row/column shape, and cell text for this fixture.
Extend scripts/gen_golden.py to emit the new *.tables.expected.json
alongside the existing *.expected.json (words + text golden), so a
single run of gen_golden.py keeps both v0.1.x and v0.2.0 fixtures in
sync.
Update TestGoldenAgainstPdfplumber to skip *.tables.expected.json
files (different schema, exercised by the new test instead).
Document v0.2.0 in CHANGELOG with the full added surface (TableSettings, TableStrategy, Page.FindTables / ExtractTables, TableFinder, TableBox, Table, Intersection, internal/layout package), what's deferred to v0.3.0 (text + explicit strategies), and the known limitations (cell text on standard-14 fonts without the AFM bundle). Update README with: - Status bumped to v0.2.0. - Page interface excerpt with the two new methods. - A "Tables (lines strategy)" section showing how to call ExtractTables with default settings + switch to lines_strict. - A side-by-side pdfplumber -> pdftable example for find_tables. - Architecture diagram updated with table.go, finder.go, and the internal/layout package. - Roadmap updated to push text + explicit strategies to v0.3.x. Add examples/extract_tables/main.go: the runnable form of the README's tables example. It's a compile target in `go build ./...` so a future change to the public API breaks the snippet at build time rather than letting it drift.
Reviewer's GuideImplements v0.2.0 table-finding via a Go port of pdfplumber’s TableFinder, wiring a geometry-only pipeline (edges → intersections → cells → tables) and text extraction into the Page API, backed by a new internal layout/edge-merging layer, tests, fixtures, golden parity, examples, and docs/changelog updates. Sequence diagram for Page.ExtractTables pipelinesequenceDiagram
actor Caller
participant Page
participant Layout as layout
participant Finder as runTableFinder
Caller->>Page: ExtractTables(settings)
Page->>Page: settings.applyDefaults()
Page->>Page: ensureSupportedStrategies(settings)
Page->>Page: FindTables(settings)
Page->>Page: findTableEdges(settings)
Page->>Page: Objects()
Page->>Layout: FromLine/FromRect/FromCurve
Page->>Layout: FilterEdgesByOrientation
Page->>Layout: FilterEdgesBySource
Page->>Layout: FilterEdgesByLength
Page->>Layout: MergeEdges
Page-->>Page: []layout.Edge
Page->>Finder: runTableFinder(edges, IntersectionTolerance, IntersectionTolerance)
Finder->>Finder: edgesToIntersections
Finder->>Finder: intersectionsToCells
Finder->>Finder: cellsToTables
Finder->>Finder: assembleTableBox
Finder-->>Page: []TableFinder
Page->>Page: Chars()
Page->>Page: assembleTableText(TableBox, chars, settings, pageNumber)
Page-->>Caller: []*Table
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Warning Review limit reached
More reviews will be available in 16 minutes and 31 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (2)
📒 Files selected for processing (14)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- In assembleTableText, the locally constructed wordOpts is never used; either wire it through to the text extraction path if it’s supposed to control word grouping, or remove it to avoid dead code and confusion.
- The hard-coded axis-alignment tolerance (tol := 0.1) in findTableEdges could be made configurable (e.g., via TableSettings) so callers dealing with very small or very large coordinate scales can tune what counts as ‘near-horizontal/vertical’.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In assembleTableText, the locally constructed wordOpts is never used; either wire it through to the text extraction path if it’s supposed to control word grouping, or remove it to avoid dead code and confusion.
- The hard-coded axis-alignment tolerance (tol := 0.1) in findTableEdges could be made configurable (e.g., via TableSettings) so callers dealing with very small or very large coordinate scales can tune what counts as ‘near-horizontal/vertical’.
## Individual Comments
### Comment 1
<location path="page.go" line_range="589-592" />
<code_context>
+ textOpts.XTolerance = s.TextTolerance
+ textOpts.YTolerance = s.TextTolerance
+
+ wordOpts := DefaultWordOpts()
+ wordOpts.XTolerance = s.TextTolerance
+ wordOpts.YTolerance = s.TextTolerance
+ wordOpts.KeepBlankChars = s.KeepBlankChars
+
+ for ri, row := range tb.CellsGrid {
</code_context>
<issue_to_address>
**issue (bug_risk):** KeepBlankChars setting and wordOpts are currently unused in cell text extraction
`assembleTableText` builds `wordOpts` and sets `KeepBlankChars` from `TableSettings`, but `wordOpts` is never used (only `extractTextFromChars` with `textOpts` is called). This means `KeepBlankChars` does nothing for `ExtractTables`, breaking the documented behaviour and pdfplumber parity. Please either thread `wordOpts` / `KeepBlankChars` into the extraction path or remove them from `TableSettings` to avoid a misleading API.
</issue_to_address>
### Comment 2
<location path="table_test.go" line_range="92-95" />
<code_context>
+ }
+}
+
+// TestEdgesToIntersections_NoCrossing asserts that two parallel
+// edges, even when their Y / X spans overlap, produce no
+// intersections.
+func TestEdgesToIntersections_NoCrossing(t *testing.T) {
+ edges := []layout.Edge{
+ makeH(0, 100, 50),
</code_context>
<issue_to_address>
**issue (testing):** Test name and comment are misleading relative to what is actually asserted
The setup here (two horizontals, two verticals, expecting 4 intersections) doesn’t match the “no crossing” / parallel-edge description. Please either rename the test and adjust the comment to describe the actual behaviour, or change the fixture to truly parallel edges with zero intersections so the test intent is clear.
</issue_to_address>
### Comment 3
<location path="page.go" line_range="495" />
<code_context>
+ return merged, nil
+}
+
+// FindTables runs the geometry-only pipeline (edges → intersections
+// → cells → tables) and returns one TableFinder per detected table
+// group. Strategies "text" and "explicit" return ErrUnsupported.
</code_context>
<issue_to_address>
**issue (complexity):** Consider delegating FindTables to the shared runTableFinder pipeline and then reshaping its result to keep the per-table TableFinder API while avoiding duplicated geometry logic in page.go.
You can drop the duplicated geometry pipeline in `FindTables` and delegate to `runTableFinder`, keeping the per-table `TableFinder` behaviour by post-processing the result.
That keeps all the functionality you added (one `TableFinder` per table, with edges/intersections/cells exposed) but centralises the algorithm in `finder.go` so there’s a single source of truth.
For example, refactor `FindTables` like this:
```go
func (p *page) FindTables(settings TableSettings) ([]TableFinder, error) {
s := settings.applyDefaults()
if err := ensureSupportedStrategies(s); err != nil {
return nil, err
}
edges, err := p.findTableEdges(s)
if err != nil {
return nil, err
}
if len(edges) == 0 {
return nil, nil
}
// Use the centralised pipeline
tf := runTableFinder(edges, s.IntersectionTolerance, s.IntersectionTolerance)
if len(tf.Tables) == 0 {
return nil, nil
}
// Preserve the “one TableFinder per table” API
out := make([]TableFinder, 0, len(tf.Tables))
for _, tb := range tf.Tables {
out = append(out, TableFinder{
Edges: tf.Edges,
Intersections: tf.Intersections,
Cells: tf.Cells,
Tables: []TableBox{tb},
})
}
return out, nil
}
```
If `runTableFinder` currently computes intersections/cells/tables internally from `edges`, you can also remove the now-redundant calls to `edgesToIntersections`, `intersectionsToCells`, and `cellsToTables` from `page.go`. That keeps `page.go` focused on page-level orchestration (building `edges`) and leaves all geometry-only logic in `finder.go`.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| wordOpts := DefaultWordOpts() | ||
| wordOpts.XTolerance = s.TextTolerance | ||
| wordOpts.YTolerance = s.TextTolerance | ||
| wordOpts.KeepBlankChars = s.KeepBlankChars |
There was a problem hiding this comment.
issue (bug_risk): KeepBlankChars setting and wordOpts are currently unused in cell text extraction
assembleTableText builds wordOpts and sets KeepBlankChars from TableSettings, but wordOpts is never used (only extractTextFromChars with textOpts is called). This means KeepBlankChars does nothing for ExtractTables, breaking the documented behaviour and pdfplumber parity. Please either thread wordOpts / KeepBlankChars into the extraction path or remove them from TableSettings to avoid a misleading API.
| // TestEdgesToIntersections_NoCrossing asserts that two parallel | ||
| // edges, even when their Y / X spans overlap, produce no | ||
| // intersections. | ||
| func TestEdgesToIntersections_NoCrossing(t *testing.T) { |
There was a problem hiding this comment.
issue (testing): Test name and comment are misleading relative to what is actually asserted
The setup here (two horizontals, two verticals, expecting 4 intersections) doesn’t match the “no crossing” / parallel-edge description. Please either rename the test and adjust the comment to describe the actual behaviour, or change the fixture to truly parallel edges with zero intersections so the test intent is clear.
| return merged, nil | ||
| } | ||
|
|
||
| // FindTables runs the geometry-only pipeline (edges → intersections |
There was a problem hiding this comment.
issue (complexity): Consider delegating FindTables to the shared runTableFinder pipeline and then reshaping its result to keep the per-table TableFinder API while avoiding duplicated geometry logic in page.go.
You can drop the duplicated geometry pipeline in FindTables and delegate to runTableFinder, keeping the per-table TableFinder behaviour by post-processing the result.
That keeps all the functionality you added (one TableFinder per table, with edges/intersections/cells exposed) but centralises the algorithm in finder.go so there’s a single source of truth.
For example, refactor FindTables like this:
func (p *page) FindTables(settings TableSettings) ([]TableFinder, error) {
s := settings.applyDefaults()
if err := ensureSupportedStrategies(s); err != nil {
return nil, err
}
edges, err := p.findTableEdges(s)
if err != nil {
return nil, err
}
if len(edges) == 0 {
return nil, nil
}
// Use the centralised pipeline
tf := runTableFinder(edges, s.IntersectionTolerance, s.IntersectionTolerance)
if len(tf.Tables) == 0 {
return nil, nil
}
// Preserve the “one TableFinder per table” API
out := make([]TableFinder, 0, len(tf.Tables))
for _, tb := range tf.Tables {
out = append(out, TableFinder{
Edges: tf.Edges,
Intersections: tf.Intersections,
Cells: tf.Cells,
Tables: []TableBox{tb},
})
}
return out, nil
}If runTableFinder currently computes intersections/cells/tables internally from edges, you can also remove the now-redundant calls to edgesToIntersections, intersectionsToCells, and cellsToTables from page.go. That keeps page.go focused on page-level orchestration (building edges) and leaves all geometry-only logic in finder.go.
Summary
Ports pdfplumber's
TableFinder(edges → intersections → cells →tables pipeline) into Go and wires it into the
Pageinterface viatwo new methods, completing Phase 1.3.C of the pdftable port.
What's in
Page.FindTables(settings) ([]TableFinder, error)— geometry-onlystage. Returns one
TableFinderper detected table group,exposing the merged edges, intersections, raw cells, and
assembled per-table
CellsGridfor debugging / custom rendering.Page.ExtractTables(settings) ([]*Table, error)— wrapsFindTables, runs per-cell text extraction (chars whose centrelies in the cell bbox, joined via the existing extract_text
path, whitespace-trimmed), returns
*Tablestructs with rows /per-cell bbox grid / page number / union bbox.
TableSettingsvalue type withDefaultTableSettings()carrying pdfplumber-matching defaults
(snap_tolerance=3, join_tolerance=3, edge_min_length=3,
edge_min_length_prefilter=1, intersection_tolerance=3,
text_tolerance=3).
TableStrategyenum:StrategyLines(default — Lines + Rects +axis-aligned Curve segments),
StrategyLinesStrict(only drawnLinesegments).internal/layoutpackage:Edgetype withFromLine/FromRect/FromCurveplus the snap → join →filter pipeline. Built by the previous WIP commit; this PR adds
the consumer.
table_test.go)testdata/table-2x3-ruled.pdffixture.testdata/golden/*.tables.expected.jsonfixture is loaded andcompared cell-for-cell after whitespace normalisation. First
case:
issue-466-example.pdf(copied from pdfplumber's ownfixtures) — 4×3 + 2×3 ruled tables; pdftable matches pdfplumber
cell-for-cell.
examples/extract_tables/main.goso theREADME snippet doesn't bit-rot.
What's out (deferred to v0.3.0)
StrategyText(infer edges from word alignment).StrategyExplicit(caller-supplied edges only).Both return
ErrUnsupportedwith a clear "Phase 1.3.D" message socallers don't silently get empty results.
TableSettings.Explicit*Linesfields ARE honoured under the lines strategies (added on top of
derived edges).
Test plan
go build ./...clean.go vet ./...clean.go test ./...all green.edgesToIntersections,intersectionsToCells,cellsToTables,assembleTableBox,runTableFinder.ruled fixture and asserts row count + cell text exactly:
[[Name, Age], [Alice, 30], [Bob, 25]].issue-466-example.pdfmatchescell-for-cell after whitespace normalisation.
examples/extract_tables/main.goruns against theparity fixture and prints the expected rows.
Known limitations (not regressions)
tables can have intra-word gaps reported as "no gap"; this is the
same word-bbox drift documented for v0.1.0 and bites
la-precinct-bulletin-2014-p1.pdfcell text on parity. AFM bundleis on the v0.2.x roadmap.
senate-expenditures.pdfproduces 7 cells where pdfplumber finds10; under investigation as a follow-up issue (snap+join unifies
edges that share a near-collinear endpoint slightly differently
from pdfplumber on this fixture).
Both fixtures are NOT in the golden set (CI would be noisy);
they're recorded in the CHANGELOG's "Known limitations" section.
Summary by Sourcery
Add line-based table detection and extraction to the Page API and supporting internals, plus pdfplumber parity tests and documentation for the new v0.2.0 release.
New Features:
Enhancements: