diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000..bd051a0 --- /dev/null +++ b/.gitattributes @@ -0,0 +1,4 @@ +* text=auto eol=lf + +# Binary fixtures — keep raw bytes intact across platforms. +*.pdf binary diff --git a/CHANGELOG.md b/CHANGELOG.md index 698032f..096b104 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,98 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [0.2.0] - 2026-05-27 + +Phase 1.3.C — table-finding via ruled lines. Direct port of +pdfplumber's `TableFinder` + cells-from-edges algorithm (`table.py`). +The v0.1.x public API surface is unchanged; v0.2.0 only adds methods +to the `Page` interface and new top-level types, so existing callers +compile and run as-is. + +### Added + +- `Page.FindTables(settings TableSettings) ([]TableFinder, error)` — + geometry-only stage of the pipeline. Returns one TableFinder per + detected table group with the merged edges, intersections, raw + cells, and assembled per-table CellsGrid exposed for debugging / + custom rendering. +- `Page.ExtractTables(settings TableSettings) ([]*Table, error)` — + wraps FindTables, runs per-cell text extraction, returns fully + populated `Table` structs. Cell text is the dense extract\_text + output for chars whose centre point falls inside the cell bbox, + with leading / trailing whitespace stripped. Empty cells produce + `""`. +- `TableSettings` struct with `DefaultTableSettings()` constructor + carrying pdfplumber-matching defaults (snap\_tolerance=3, + join\_tolerance=3, edge\_min\_length=3, edge\_min\_length\_prefilter=1, + intersection\_tolerance=3, text\_tolerance=3). +- `TableStrategy` enum with constants `StrategyLines`, + `StrategyLinesStrict`, `StrategyText`, `StrategyExplicit`. Only + `StrategyLines` and `StrategyLinesStrict` are implemented in this + release; `StrategyText` and `StrategyExplicit` are deferred to + v0.3.0 and return `ErrUnsupported` (with a clear "Phase 1.3.D" + message) so callers don't get silent empty results. +- `Table` (rows × columns of cell text + bbox + per-cell bbox grid), + `TableFinder` (edges + intersections + cells + tables), `TableBox` + (one assembled table's geometry: bbox + Rows × Cols grid), + `Intersection` (one edge-crossing point with its participating + vertical and horizontal edges). +- Internal `internal/layout` package: `Edge` type with `FromLine`, + `FromRect`, `FromCurve` constructors, plus the snap → join → + filter pipeline (`SnapEdges`, `JoinEdges`, `MergeEdges`, + `FilterEdgesByLength`, `FilterEdgesBySource`, + `FilterEdgesByOrientation`, `SortEdges`). +- Golden-file parity test against pdfplumber's `find_tables({"lines"})` + on the `issue-466-example.pdf` fixture (4×3 + 2×3 ruled tables). + Test infrastructure (`TestGoldenTablesAgainstPdfplumber` in + `golden_test.go`) loads any `*.tables.expected.json` fixture in + `testdata/golden/` and compares cell-for-cell after whitespace + normalisation. Regenerate via `python scripts/gen_golden.py`. +- New hand-crafted fixture: `testdata.TableRuled()` — minimal + 2-column × 3-row ruled table with predictable text ("Name", "Age"; + "Alice", "30"; "Bob", "25") for unit testing the public API + surface without depending on third-party PDFs. Generator script + at `scripts/gen_table_fixture.go`. +- Algorithm-level unit tests in `table_test.go`: hand-crafted edge + lists exercising `edgesToIntersections`, `intersectionsToCells`, + `cellsToTables`, `assembleTableBox`, and the full `runTableFinder` + pipeline. +- README "Tables" section with a side-by-side Go / pdfplumber + example. The example is also extracted as a runnable program at + `examples/extract_tables/main.go` so changes to the API surface + break the example at build time. + +### Deferred (planned for v0.3.0 — Phase 1.3.D) + +- `StrategyText`: infer table edges from word alignment (clusters of + words sharing x0 / x1 / centre, clusters of words sharing top / + bottom). Useful for PDFs whose tables have no ruled lines (e.g. + banking statements, scanned-then-OCR'd documents). +- `StrategyExplicit`: caller-supplied edges via + `TableSettings.ExplicitVerticalLines` / + `ExplicitHorizontalLines`. In v0.2.0 these settings are accepted + and added on top of the derived edges (helpful when a column + boundary isn't drawn), but they don't form the only source of + edges yet. + +### Known limitations + +- The cell-text extraction shares the v0.1.x word-grouping engine, + which depends on font metrics. Cells whose glyphs use standard-14 + fonts WITHOUT the bundled AFM tables can have intra-word gaps + reported as "no gap" — e.g. "Hello World" comes out as + "HelloWorld". This was already documented for v0.1.0; for v0.2.0 + it means the parity test against + `la-precinct-bulletin-2014-p1.pdf` (which uses Helvetica-Bold) + fails on cell text equality. The fixture is not checked in to + avoid CI noise; it'll be re-added once the AFM bundle lands in + v0.2.x. +- `senate-expenditures.pdf` produces 7 cells where pdfplumber finds + 10. The divergence is in how snap+join unifies edges that share a + near-collinear endpoint but differ slightly in the perpendicular + axis; under investigation as a follow-up issue. The fixture is + not in the golden set yet. + ## [0.1.1] - 2026-05-27 ### Fixed @@ -127,6 +219,7 @@ Initial release. Phase 1.3.A — content-stream primitives layer. - Type 3 fonts (their glyph procedures are themselves content streams). - Vertical writing mode. +[0.2.0]: https://github.com/hallelx2/pdftable/releases/tag/v0.2.0 [0.1.1]: https://github.com/hallelx2/pdftable/releases/tag/v0.1.1 [0.1.0]: https://github.com/hallelx2/pdftable/releases/tag/v0.1.0 [0.0.1]: https://github.com/hallelx2/pdftable/releases/tag/v0.0.1 diff --git a/README.md b/README.md index 64642dd..12b418d 100644 --- a/README.md +++ b/README.md @@ -19,9 +19,10 @@ heuristics on. This is that. ## Status -`v0.1.0` — words and text extraction. `Page.Words`, `Page.ExtractText`, -and `Page.ExtractTextSimple` ship with this release; table-finding -(`FindTables`, `ExtractTables`) is the next phase. +`v0.2.0` — line-strategy table finding. `Page.FindTables` and +`Page.ExtractTables` ship with this release covering the `lines` and +`lines_strict` strategies (PDFs with ruled tables). `text` and +`explicit` strategies return `ErrUnsupported` and land in v0.3.0. [![Go Reference](https://pkg.go.dev/badge/github.com/hallelx2/pdftable.svg)](https://pkg.go.dev/github.com/hallelx2/pdftable) [![CI](https://github.com/hallelx2/pdftable/actions/workflows/test.yml/badge.svg)](https://github.com/hallelx2/pdftable/actions/workflows/test.yml) @@ -30,7 +31,7 @@ and `Page.ExtractTextSimple` ship with this release; table-finding ## Install ```sh -go get github.com/hallelx2/pdftable@v0.1.0 +go get github.com/hallelx2/pdftable@v0.2.0 ``` Requires Go 1.25+ (uses the standard-library `iter` package for the `Pages()` range-over-func iterator, and pdfcpu v0.12+). @@ -111,6 +112,10 @@ type Page interface { Words(opts WordOpts) ([]Word, error) ExtractText(opts TextOpts) (string, error) ExtractTextSimple(xTolerance, yTolerance float64) (string, error) + + // New in v0.2.0: line-strategy table finding. + FindTables(settings TableSettings) ([]TableFinder, error) + ExtractTables(settings TableSettings) ([]*Table, error) } // Primitives. @@ -206,6 +211,91 @@ laid, _ := page.ExtractText(opts) fmt.Println(laid) ``` +## Tables (lines strategy) + +`Page.ExtractTables` is the table-detection entry point. It runs the +edges → intersections → cells → tables pipeline (a direct port of +pdfplumber's `TableFinder`) and returns one `*Table` per detected +ruled table, with cell text already extracted. + +```go +doc, _ := pdftable.OpenFile("invoice.pdf") +defer doc.Close() +page, _ := doc.Page(1) + +settings := pdftable.DefaultTableSettings() +// settings.VerticalStrategy = pdftable.StrategyLinesStrict // ignore rect outlines + +tables, _ := page.ExtractTables(settings) +for ti, t := range tables { + fmt.Printf("table %d: %d rows × %d cols at %+v\n", + ti, len(t.Rows), len(t.Rows[0]), t.BBox) + for _, row := range t.Rows { + fmt.Println(row) + } +} +``` + +`TableSettings` defaults match pdfplumber's +(`snap_tolerance=3`, `join_tolerance=3`, `edge_min_length=3`, +`intersection_tolerance=3`, `text_tolerance=3`). Override any field +on the value returned from `DefaultTableSettings()` to tighten or +loosen the heuristics. The two implemented strategies are: + +- `StrategyLines` — edges come from drawn `Line` segments, `Rect` + outlines (all four sides), and axis-aligned `Curve` segments. + Default. Best for typical PDFs whose tables have rule lines. +- `StrategyLinesStrict` — only drawn `Line` segments are used. Use + this when your PDF draws cell BACKGROUNDS as filled rectangles + that you do NOT want treated as row boundaries. + +`StrategyText` (word-alignment-based) and `StrategyExplicit` +(caller-supplied edges) return `ErrUnsupported` in v0.2.0 — they +land in v0.3.0. + +### Side-by-side: pdfplumber → pdftable + +```python +# Python (pdfplumber) +import pdfplumber + +with pdfplumber.open("invoice.pdf") as pdf: + page = pdf.pages[0] + for table in page.find_tables({"vertical_strategy": "lines", + "horizontal_strategy": "lines"}): + for row in table.extract(): + print(row) +``` + +```go +// Go (pdftable) +import "github.com/hallelx2/pdftable" + +doc, _ := pdftable.OpenFile("invoice.pdf") +defer doc.Close() +page, _ := doc.Page(1) + +settings := pdftable.DefaultTableSettings() +settings.VerticalStrategy = pdftable.StrategyLines +settings.HorizontalStrategy = pdftable.StrategyLines + +tables, _ := page.ExtractTables(settings) +for _, t := range tables { + for _, row := range t.Rows { + fmt.Println(row) + } +} +``` + +The two outputs match cell-for-cell on ruled fixtures (see +`testdata/golden/issue-466-example.*` for the parity test). Field +naming differs in the obvious places: pdftable returns a slice of +`*Table` instead of `Table` objects you have to call `.extract()` +on; rows are `[]string` instead of `list[Optional[str]]` (missing +cells produce `""` rather than `nil`); and table bboxes use +`(X0, Y0, X1, Y1)` PDF user space rather than pdfplumber's +image-space `(x0, top, x1, bottom)`. + ## Side-by-side comparison with pdfplumber ```python @@ -299,16 +389,21 @@ pdftable/ ├── page.go // Page interface + implementation ├── char.go // Public Char / Line / Rect / Curve / Objects ├── text.go // Word + ExtractText + ExtractTextSimple (v0.1.0) +├── table.go // TableStrategy / TableSettings / Table types (v0.2.0) +├── finder.go // Cells-from-edges algorithm (v0.2.0) ├── clustering.go // 1-D clusterObjects, groupObjectsByAttr, dedupeChars ├── geometry.go // BBox helpers: Union, Intersect, Contains, Snap ├── errors.go // Sentinel errors -└── internal/pdf/ - ├── reader.go // pdfcpu bridge - ├── content.go // Content-stream interpreter - ├── ops.go // Operator dispatch table - ├── state.go // Graphics + text state, matrix math - ├── font.go // Font + encoding tables + glyph-name resolution - └── cmap.go // ToUnicode CMap parser +└── internal/ + ├── layout/ + │ └── lines.go // Edge type + snap/join/filter pipeline (v0.2.0) + └── pdf/ + ├── reader.go // pdfcpu bridge + ├── content.go // Content-stream interpreter + ├── ops.go // Operator dispatch table + ├── state.go // Graphics + text state, matrix math + ├── font.go // Font + encoding tables + glyph-name resolution + └── cmap.go // ToUnicode CMap parser ``` The public `pdftable` package is small and stable. The `internal/pdf` @@ -333,12 +428,15 @@ stdlib-only. - `v0.0.x` — content-stream primitives. - `v0.1.x` — text extraction: `Page.ExtractText`, `Page.Words`, - `Page.ExtractTextSimple` (this release). -- `v0.2.x` — table finding: `Page.FindTables` using ruling-line + - whitespace heuristics, `Page.ExtractTables` returning row/cell text. - Bundle the standard-14 AFM metrics so word bboxes match pdfplumber - to within 1 PDF point. -- `v0.3.x` — performance pass: parser benchmarking against pdfminer.six + `Page.ExtractTextSimple`. +- `v0.2.x` — table finding via ruling lines (this release): + `Page.FindTables` / `Page.ExtractTables` covering the `lines` and + `lines_strict` strategies. +- `v0.3.x` — remaining table strategies: `text` (word-alignment + edges) and `explicit` (caller-supplied edges). Bundle the + standard-14 AFM metrics so word bboxes (and therefore cell text) + match pdfplumber to within 1 PDF point on standard fonts. +- `v0.4.x` — performance pass: parser benchmarking against pdfminer.six and pdfplumber on a representative document corpus. ## License diff --git a/examples/extract_tables/main.go b/examples/extract_tables/main.go new file mode 100644 index 0000000..86ea48d --- /dev/null +++ b/examples/extract_tables/main.go @@ -0,0 +1,65 @@ +// Copyright (c) 2026 Halleluyah Oludele +// Licensed under the MIT License. + +// examples/extract_tables/main.go is the runnable form of the +// README's "Tables (lines strategy)" example. It exists so that +// changes to the public API surface break the example at build time +// rather than letting a stale snippet drift in the README. +// +// Run from the repo root: +// +// go run ./examples/extract_tables testdata/golden/issue-466-example.pdf +// +// The example uses the ExtractTables call with default settings +// (which select the "lines" strategy on both axes). It prints each +// detected table's rows × cols and dimensions, then each row as a +// flat slice — exactly the snippet documented in README.md. +package main + +import ( + "fmt" + "log" + "os" + + "github.com/hallelx2/pdftable" +) + +func main() { + if len(os.Args) < 2 { + fmt.Fprintln(os.Stderr, "usage: extract_tables ") + os.Exit(2) + } + path := os.Args[1] + + doc, err := pdftable.OpenFile(path) + if err != nil { + log.Fatalf("OpenFile %s: %v", path, err) + } + defer doc.Close() + + page, err := doc.Page(1) + if err != nil { + log.Fatalf("Page(1): %v", err) + } + + settings := pdftable.DefaultTableSettings() + // Uncomment to ignore Rect outlines (filled cell backgrounds + // that aren't real row boundaries): + // settings.VerticalStrategy = pdftable.StrategyLinesStrict + + tables, err := page.ExtractTables(settings) + if err != nil { + log.Fatalf("ExtractTables: %v", err) + } + for ti, t := range tables { + cols := 0 + if len(t.Rows) > 0 { + cols = len(t.Rows[0]) + } + fmt.Printf("table %d: %d rows × %d cols at %+v\n", + ti, len(t.Rows), cols, t.BBox) + for _, row := range t.Rows { + fmt.Println(row) + } + } +} diff --git a/finder.go b/finder.go new file mode 100644 index 0000000..9d1a75e --- /dev/null +++ b/finder.go @@ -0,0 +1,645 @@ +// Copyright (c) 2026 Halleluyah Oludele +// Licensed under the MIT License. + +package pdftable + +// finder.go is the Go port of pdfplumber/table.py's TableFinder. The +// algorithm runs in four stages, each implemented as a pure function +// below so it can be unit-tested without spinning up a Page: +// +// 1. getEdges — derive Lines/Rects/Curves from page primitives, +// apply prefilter, merge (snap + join), apply the +// post-merge min-length filter. Vertical edges +// from the "vertical_strategy" go in alongside the +// horizontal edges from the "horizontal_strategy". +// Implemented as Page.findTableEdges in page.go. +// 2. edgesToIntersections — pair every vertical edge with every +// horizontal edge and record the (x, y) +// intersection points whose perpendicular +// distance is within intersectionTolerance. +// 3. intersectionsToCells — for each intersection, walk down and right +// looking for the smallest closed rectangle +// whose four corners are all intersections +// joined by edges. Each found rectangle is +// one cell. +// 4. cellsToTables — group cells that share at least one corner into +// the same table. Tables sorted top-to-bottom- +// then-left-to-right. +// +// Coordinate system note: pdfplumber operates in image space (Y growing +// DOWN). pdftable uses PDF user space (Y growing UP). The intersection +// algorithm is invariant under that flip — "below" in image space is +// "below" in user space if we substitute "lower Y" — but the wording in +// pdfplumber's source talks about "directly below and directly right". +// We keep the same algorithm, sorted by (Y descending, X ascending) so +// "below" means smaller Y (visually lower) and "right" means larger X. +// pdfplumber's points list is sorted ascending in image-space Y (so the +// FIRST point is the visually-topmost), which maps in user space to +// DESCENDING Y. The intersection logic uses point equality on (x, y) +// so the sort order only affects iteration order and the resulting +// cell list is the same. + +import ( + "fmt" + "sort" + + "github.com/hallelx2/pdftable/internal/layout" +) + +// Intersection records one crossing point: an (x, y) tuple plus the +// vertical and horizontal edges that meet there. We need the edge sets +// (not just the count) because the cell-finder asks "does the same +// edge connect points p1 and p2?" — checking that two points lie on a +// shared edge is how the algorithm distinguishes "two intersections on +// the same ruler" from "two intersections on parallel rulers that +// happen to align". +// +// Field naming follows pdfplumber's intersections dict-of-dicts shape: +// the X/Y are the keys, V/H are the value lists. We keep them as slice +// fields so the struct is value-comparable on (X, Y) alone. +type Intersection struct { + X, Y float64 + V []layout.Edge // vertical edges passing through (X, Y) + H []layout.Edge // horizontal edges passing through (X, Y) +} + +// TableBox is one detected table, expressed as a bbox plus a 2-D grid +// of cell bboxes. Rows are visually top-to-bottom; columns are left-to- +// right. CellsGrid[i][j] gives the bbox of the cell at row i, column j; +// missing cells (rectangular gaps in the grid) are reported as the +// zero BBox, NOT removed — callers can detect "this cell was missing" +// by checking IsZero on the entry. +// +// This is the geometry-only intermediate between FindTables and +// ExtractTables: FindTables returns one of these per detected table; +// ExtractTables then runs text-extraction per cell and wraps the +// result in a Table. +type TableBox struct { + // BBox is the union of every cell's bbox. + BBox BBox + + // Rows is the row count. + Rows int + + // Cols is the column count. + Cols int + + // CellsGrid is the per-cell bbox aligned to Rows × Cols. The + // entry at [i][j] is the bbox of the cell at visual row i (0 is + // topmost) and column j (0 is leftmost). Empty cells are the zero + // BBox. + CellsGrid [][]BBox +} + +// Cells returns the cell bboxes flattened into reading order +// (left-to-right, top-to-bottom). Zero-bbox entries (holes in the +// grid) are skipped. Convenience helper for callers that want a single +// iterable. +func (t TableBox) Cells() []BBox { + out := make([]BBox, 0, t.Rows*t.Cols) + for _, row := range t.CellsGrid { + for _, c := range row { + if !c.IsZero() { + out = append(out, c) + } + } + } + return out +} + +// TableFinder is the geometry-only result of running the cells-from- +// edges pipeline on a page. It exposes the intermediate stages +// (edges, intersections, raw cells) alongside the assembled TableBox +// list so callers building debugging tools or custom text-extraction +// can see exactly what the pipeline produced. +// +// Pdfplumber bundles the page reference inside its TableFinder and +// exposes Table objects with an .extract() method; we keep the +// finder a pure value (no Page pointer) and let callers either grab +// the assembled Tables from Page.ExtractTables or compose their own +// text-fill loop using the public Cells and CellsGrid. +type TableFinder struct { + // Edges is the merged, length-filtered edge list used as the + // input to the intersection scan. Useful for debugging "why + // didn't this rule get picked up" issues. + Edges []layout.Edge + + // Intersections is the full set of edge crossings, keyed by + // (X, Y). The order is deterministic — sorted by Y descending, + // then X ascending — so callers can rely on iteration order. + Intersections []Intersection + + // Cells is the raw list of detected cell bboxes BEFORE grouping + // into tables. Each is a single rectangle whose four corners are + // intersections joined by shared edges. + Cells []BBox + + // Tables is the final list of detected tables. Each carries a + // bbox plus a CellsGrid aligned to row/column order. Tables are + // sorted top-to-bottom-then-left-to-right by their topmost cell. + Tables []TableBox +} + +// edgesToIntersections is the Go port of pdfplumber's +// edges_to_intersections. Given a slice of merged edges, return the +// list of crossing points where a vertical edge meets a horizontal +// edge within the supplied perpendicular tolerance. +// +// The algorithm: +// - Split edges by orientation. +// - Sort vertical edges by (X, smallest Y); sort horizontal edges +// by (Y descending so top-to-bottom in user space, X). +// - For each (v, h) pair, test whether the vertical edge's Y span +// covers h.Y (within yTol) AND h's X span covers v.X (within xTol). +// If yes, register (v.X, h.Y) as an intersection. +// +// We deduplicate intersection points on (X, Y) equality — if multiple +// edge pairs land on the same exact point, the V and H slices +// accumulate all participating edges. This matches pdfplumber's +// behaviour: its dict-keyed-on-vertex collapses repeats. +// +// xTol and yTol default to 0 if zero; the caller (TableSettings) +// already substitutes the pdfplumber default (3) before reaching this +// function. +func edgesToIntersections(edges []layout.Edge, xTol, yTol float64) []Intersection { + if len(edges) == 0 { + return nil + } + + vEdges := layout.FilterEdgesByOrientation(edges, layout.Vertical) + hEdges := layout.FilterEdgesByOrientation(edges, layout.Horizontal) + + // Sort: pdfplumber sorts v by (x0, top) and h by (top, x0). In + // PDF user space "top" means LARGER Y. We sort v by (X0 asc, + // Y0 asc) and h by (Y0 desc, X0 asc) so iteration order matches + // the visual top-to-bottom traversal pdfplumber uses. + sort.SliceStable(vEdges, func(i, j int) bool { + if vEdges[i].X0 != vEdges[j].X0 { + return vEdges[i].X0 < vEdges[j].X0 + } + return vEdges[i].Y0 < vEdges[j].Y0 + }) + sort.SliceStable(hEdges, func(i, j int) bool { + if hEdges[i].Y0 != hEdges[j].Y0 { + return hEdges[i].Y0 > hEdges[j].Y0 + } + return hEdges[i].X0 < hEdges[j].X0 + }) + + // Key on (X, Y) using two floats — float64 comparisons are + // exact-equality after snap_edges has unified positions onto + // cluster means, so a struct key works without epsilon games. + type key struct { + x, y float64 + } + indexByKey := make(map[key]int) + var out []Intersection + + for _, v := range vEdges { + for _, h := range hEdges { + // pdfplumber's test (translated from image-space "top" / + // "bottom" to PDF user-space Y0 / Y1): + // + // image space: v.top <= h.top + yTol AND + // v.bottom >= h.top - yTol + // + // In image space "top" is the smaller Y (visually higher) + // and "bottom" is the larger Y. In PDF user space the + // orientation flips: "visually higher" means LARGER Y. + // + // The two conditions in image space say: h.top is between + // v.top and v.bottom (i.e. v's Y range covers h's Y). In + // PDF user space the same constraint is: + // + // v.Y0 <= h.Y0 + yTol AND v.Y1 >= h.Y0 - yTol + // + // — the horizontal edge's Y (which equals h.Y0 == h.Y1) + // must lie within the vertical edge's [Y0, Y1] span, + // with yTol slack on both ends. + if v.Y0 > h.Y0+yTol { + continue + } + if v.Y1 < h.Y0-yTol { + continue + } + // h must cover v.X0 with xTol slack. + if v.X0 < h.X0-xTol { + continue + } + if v.X0 > h.X1+xTol { + continue + } + k := key{x: v.X0, y: h.Y0} + if idx, ok := indexByKey[k]; ok { + out[idx].V = append(out[idx].V, v) + out[idx].H = append(out[idx].H, h) + } else { + indexByKey[k] = len(out) + out = append(out, Intersection{ + X: v.X0, + Y: h.Y0, + V: []layout.Edge{v}, + H: []layout.Edge{h}, + }) + } + } + } + + // Deterministic order: pdfplumber sorts the keys ascending in + // (image-space) top, then x0 — which in user space is Y + // DESCENDING (visually top first), then X ascending. + sort.SliceStable(out, func(i, j int) bool { + if out[i].Y != out[j].Y { + return out[i].Y > out[j].Y + } + return out[i].X < out[j].X + }) + return out +} + +// edgeSharedKey is the equality test used to decide whether two +// intersection lists share an edge. We hash edges by their full +// geometric tuple so that two intersections that lie on the SAME +// merged edge (post snap+join, so identical X/Y) share the key. +type edgeSharedKey struct { + x0, y0, x1, y1 float64 + o layout.Orientation +} + +func edgeKey(e layout.Edge) edgeSharedKey { + return edgeSharedKey{x0: e.X0, y0: e.Y0, x1: e.X1, y1: e.Y1, o: e.Orientation} +} + +// edgeConnects reports whether p1 and p2 share an edge — i.e. lie on +// the same merged ruler. This is the predicate pdfplumber uses to +// distinguish "two points on the same line" from "two points on +// parallel lines that happen to align". +// +// p1 and p2 must share an axis (same X for vertical-shared, same Y +// for horizontal-shared); the function returns false otherwise. +func edgeConnects(p1, p2 Intersection) bool { + if p1.X == p2.X { + // Look for a vertical edge present in both intersections' + // V slices. + seen := make(map[edgeSharedKey]struct{}, len(p1.V)) + for _, e := range p1.V { + seen[edgeKey(e)] = struct{}{} + } + for _, e := range p2.V { + if _, ok := seen[edgeKey(e)]; ok { + return true + } + } + } + if p1.Y == p2.Y { + seen := make(map[edgeSharedKey]struct{}, len(p1.H)) + for _, e := range p1.H { + seen[edgeKey(e)] = struct{}{} + } + for _, e := range p2.H { + if _, ok := seen[edgeKey(e)]; ok { + return true + } + } + } + return false +} + +// intersectionsToCells is the Go port of pdfplumber's +// intersections_to_cells. Given a sorted-by-(Y desc, X asc) list of +// intersections, return the smallest closed rectangle anchored at +// each intersection — that's one cell. +// +// The algorithm at each point pt: +// - Find all `below` points (same X as pt, lower Y in user space). +// - Find all `right` points (same Y as pt, larger X). +// - For every below_pt with which pt shares a vertical edge, and +// every right_pt with which pt shares a horizontal edge, check +// whether the diagonal corner (right_pt.X, below_pt.Y) is also an +// intersection AND shares the necessary edges to close the +// rectangle. If yes, that's the smallest cell — return it. +// +// "Smallest" comes from the sort order: `below` and `right` slices +// preserve the sorted suffix of the intersections list, so the +// nearest points are tried first. +func intersectionsToCells(intersections []Intersection) []BBox { + if len(intersections) == 0 { + return nil + } + + // Index by (X, Y) for fast "is this corner an intersection" lookup. + type key struct { + x, y float64 + } + idxByKey := make(map[key]int, len(intersections)) + for i, p := range intersections { + idxByKey[key{x: p.X, y: p.Y}] = i + } + + // In user space the intersections list is sorted by Y descending + // (visually top first) then X ascending. "directly below pt" + // means same X with smaller Y; "directly right" means same Y + // with larger X. Both are in the SUFFIX of the sorted list, + // matching pdfplumber's "rest = points[i+1:]" walk. + var cells []BBox + for i, pt := range intersections { + if i == len(intersections)-1 { + break + } + rest := intersections[i+1:] + + // below: same X, smaller Y (already true for the suffix + // because the list is Y-descending and X-ascending; same X + // entries that follow pt all have smaller Y). + var below []Intersection + var right []Intersection + for _, q := range rest { + if q.X == pt.X && q.Y < pt.Y { + below = append(below, q) + } + if q.Y == pt.Y && q.X > pt.X { + right = append(right, q) + } + } + + found := false + for _, bp := range below { + if !edgeConnects(pt, bp) { + continue + } + for _, rp := range right { + if !edgeConnects(pt, rp) { + continue + } + cornerKey := key{x: rp.X, y: bp.Y} + cornerIdx, ok := idxByKey[cornerKey] + if !ok { + continue + } + corner := intersections[cornerIdx] + if !edgeConnects(corner, rp) { + continue + } + if !edgeConnects(corner, bp) { + continue + } + // Cell corners in PDF user space: + // pt (top-left) : (pt.X, pt.Y) + // rp (top-right) : (rp.X, pt.Y) + // bp (bottom-left) : (pt.X, bp.Y) + // corner (bottom-right): (rp.X, bp.Y) + // Cell bbox: x0 = pt.X, x1 = rp.X, y0 = bp.Y, y1 = pt.Y. + cells = append(cells, NewBBox(pt.X, bp.Y, rp.X, pt.Y)) + found = true + break + } + if found { + break + } + } + } + return cells +} + +// cellsToTables is the Go port of pdfplumber's cells_to_tables. Given +// the raw cells from intersectionsToCells, group cells that share at +// least one corner into the same table. Standalone cells (those that +// don't touch any other cell on a corner) are dropped — a "table" of +// one cell is almost always a decorative box, not real tabular data. +// +// The implementation: +// - Initialise a current table with the first remaining cell. +// - In a pass over the remaining cells, append every cell that shares +// at least one corner with the current table; remove appended +// cells from the remaining list. +// - Repeat until no more cells get appended in a pass; close the +// current table and start a new one with the next remaining cell. +// - Drop tables with fewer than 2 cells. +// +// Returns each table as a 1-D slice of its constituent cells; the +// caller (assembleTableBox) then projects the cells into the row / +// column grid. +func cellsToTables(cells []BBox) [][]BBox { + if len(cells) == 0 { + return nil + } + remaining := make([]BBox, len(cells)) + copy(remaining, cells) + + type corner struct { + x, y float64 + } + bboxCorners := func(b BBox) [4]corner { + return [4]corner{ + {b.X0, b.Y1}, // top-left + {b.X0, b.Y0}, // bottom-left + {b.X1, b.Y1}, // top-right + {b.X1, b.Y0}, // bottom-right + } + } + + var tables [][]BBox + currentCells := make([]BBox, 0) + currentCorners := make(map[corner]struct{}) + + for len(remaining) > 0 { + initialCount := len(currentCells) + // One pass over remaining; collect newly assigned indices to + // remove after the pass so we don't disturb the slice during + // iteration. + assigned := make([]int, 0, len(remaining)) + for i, cell := range remaining { + cc := bboxCorners(cell) + if len(currentCells) == 0 { + for _, c := range cc { + currentCorners[c] = struct{}{} + } + currentCells = append(currentCells, cell) + assigned = append(assigned, i) + continue + } + cornerCount := 0 + for _, c := range cc { + if _, ok := currentCorners[c]; ok { + cornerCount++ + } + } + if cornerCount > 0 { + for _, c := range cc { + currentCorners[c] = struct{}{} + } + currentCells = append(currentCells, cell) + assigned = append(assigned, i) + } + } + // Apply removals in reverse so indices stay valid. + sort.Sort(sort.Reverse(sort.IntSlice(assigned))) + for _, idx := range assigned { + remaining = append(remaining[:idx], remaining[idx+1:]...) + } + + if len(currentCells) == initialCount { + // Nothing was added this pass — close the table and start + // a fresh one with the next remaining cell. + tables = append(tables, currentCells) + currentCells = make([]BBox, 0) + currentCorners = make(map[corner]struct{}) + } + } + if len(currentCells) > 0 { + tables = append(tables, currentCells) + } + + // Sort tables visually top-to-bottom, then left-to-right. We key + // each table on its topmost-leftmost cell — pdfplumber uses + // min((top, x0)) (image space); in user space that's the cell + // with the LARGEST Y1 (visual top), tie-broken by smallest X0. + sort.SliceStable(tables, func(i, j int) bool { + ti := pickTopLeft(tables[i]) + tj := pickTopLeft(tables[j]) + if ti.Y1 != tj.Y1 { + return ti.Y1 > tj.Y1 + } + return ti.X0 < tj.X0 + }) + + // Drop standalone-cell tables (pdfplumber: `len(t) > 1`). + filtered := tables[:0] + for _, t := range tables { + if len(t) > 1 { + filtered = append(filtered, t) + } + } + return filtered +} + +// pickTopLeft returns the cell that's visually topmost (largest Y1) +// and leftmost on ties (smallest X0). +func pickTopLeft(cells []BBox) BBox { + best := cells[0] + for _, c := range cells[1:] { + if c.Y1 > best.Y1 || (c.Y1 == best.Y1 && c.X0 < best.X0) { + best = c + } + } + return best +} + +// assembleTableBox projects a flat list of cells into a 2-D +// row/column grid. The algorithm collects the unique X0 values +// (column lefts) and Y1 values (row tops) across all cells, sorts +// them, and indexes each cell by (row from Y1, column from X0). +// +// Holes in the grid (a cell that should be at row i, col j but +// wasn't detected) remain as zero BBox entries — the caller can +// detect them with IsZero. We don't try to "fill" them by inferring +// boundaries; pdfplumber doesn't either. +func assembleTableBox(cells []BBox) TableBox { + if len(cells) == 0 { + return TableBox{} + } + + // Collect unique row/column anchor positions. Row 0 is visually + // topmost — largest Y1; column 0 is leftmost — smallest X0. + xs := make(map[float64]struct{}) + ys := make(map[float64]struct{}) + for _, c := range cells { + xs[c.X0] = struct{}{} + ys[c.Y1] = struct{}{} + } + xList := make([]float64, 0, len(xs)) + for x := range xs { + xList = append(xList, x) + } + yList := make([]float64, 0, len(ys)) + for y := range ys { + yList = append(yList, y) + } + sort.Float64s(xList) + sort.Slice(yList, func(i, j int) bool { return yList[i] > yList[j] }) + + xIndex := make(map[float64]int, len(xList)) + for i, x := range xList { + xIndex[x] = i + } + yIndex := make(map[float64]int, len(yList)) + for i, y := range yList { + yIndex[y] = i + } + + rows := len(yList) + cols := len(xList) + grid := make([][]BBox, rows) + for i := range grid { + grid[i] = make([]BBox, cols) + } + + // Union bbox. + bbox := cells[0] + for _, c := range cells { + ri, ok1 := yIndex[c.Y1] + ci, ok2 := xIndex[c.X0] + if ok1 && ok2 { + grid[ri][ci] = c + } + bbox = bbox.Union(c) + } + + return TableBox{ + BBox: bbox, + Rows: rows, + Cols: cols, + CellsGrid: grid, + } +} + +// runTableFinder is the geometry-only pipeline: given the page's +// edges (already merged + length-filtered by Page.findTableEdges) and +// the intersection / settings tolerances, build a TableFinder. +// +// This is the seam between page.go (which knows how to enumerate the +// page primitives) and the algorithms in this file. Splitting it out +// keeps the algorithm tests in table_test.go fast: they construct +// edges in-memory and call this function directly without ever +// opening a PDF. +func runTableFinder(edges []layout.Edge, xTol, yTol float64) TableFinder { + intersections := edgesToIntersections(edges, xTol, yTol) + cells := intersectionsToCells(intersections) + groups := cellsToTables(cells) + tables := make([]TableBox, 0, len(groups)) + for _, g := range groups { + tables = append(tables, assembleTableBox(g)) + } + return TableFinder{ + Edges: edges, + Intersections: intersections, + Cells: cells, + Tables: tables, + } +} + +// ensureSupportedStrategies returns an error if either strategy is +// "text" or "explicit" — those are deferred to Phase 1.3.D (v0.3.0). +// Returning a clear ErrUnsupported keeps callers from silently getting +// empty results when they ask for a strategy we don't implement yet. +func ensureSupportedStrategies(s TableSettings) error { + for _, pair := range []struct { + axis string + strategy TableStrategy + }{ + {"vertical", s.VerticalStrategy}, + {"horizontal", s.HorizontalStrategy}, + } { + switch pair.strategy { + case StrategyLines, StrategyLinesStrict: + // ok + case StrategyText: + return fmt.Errorf("%w: %s_strategy=%q (Phase 1.3.D)", ErrUnsupported, pair.axis, pair.strategy) + case StrategyExplicit: + return fmt.Errorf("%w: %s_strategy=%q (Phase 1.3.D)", ErrUnsupported, pair.axis, pair.strategy) + default: + return fmt.Errorf("%w: unknown %s_strategy %q", ErrUnsupported, pair.axis, pair.strategy) + } + } + return nil +} diff --git a/golden_test.go b/golden_test.go index 2453e81..d6e9dcc 100644 --- a/golden_test.go +++ b/golden_test.go @@ -72,7 +72,10 @@ func TestGoldenAgainstPdfplumber(t *testing.T) { t.Fatalf("read golden dir: %v", err) } - // Find every .expected.json and run a sub-test for each. + // Find every .expected.json (but NOT .tables.expected.json — the + // tables golden files have a different schema and are exercised + // by TestGoldenTablesAgainstPdfplumber below) and run a sub-test + // for each. for _, e := range entries { if e.IsDir() { continue @@ -81,6 +84,9 @@ func TestGoldenAgainstPdfplumber(t *testing.T) { if !strings.HasSuffix(name, ".expected.json") { continue } + if strings.HasSuffix(name, ".tables.expected.json") { + continue + } stem := strings.TrimSuffix(name, ".expected.json") t.Run(stem, func(t *testing.T) { runGoldenCase(t, dir, stem) @@ -88,6 +94,130 @@ func TestGoldenAgainstPdfplumber(t *testing.T) { } } +// goldenTables is the schema written by scripts/gen_golden_tables.py +// (or the inline snippet documented in CHANGELOG / README) — one +// entry per page, with the page's `find_tables(...)` output captured +// as a [][][]string per table. +// +// Cells in pdfplumber's output that contain "\n" run their cell text +// across multiple visual lines; we compare after whitespace +// normalisation so the Go output (which assembles cell text via the +// dense extract_text path) is allowed to use either "\n" or " " as +// the intra-cell separator. The token sequence must match exactly. +type goldenTables struct { + Name string `json:"name"` + Pages []goldenTablesPage `json:"pages"` +} + +type goldenTablesPage struct { + Number int `json:"number"` + Width float64 `json:"width"` + Height float64 `json:"height"` + Tables [][][]string `json:"tables"` +} + +// TestGoldenTablesAgainstPdfplumber runs ExtractTables against each +// fixture and asserts the table count, row count, column count, and +// cell text (after whitespace normalisation) matches pdfplumber's +// reference output. +// +// Pdfplumber occasionally produces cells whose text contains "\n" +// (from intra-cell line breaks); our dense extract_text path joins +// chars on the same visual line with spaces and lines with newlines, +// so a 3-line cell in pdfplumber might be "A\nB\nC" while we +// produce "A B C". We normalise both sides by collapsing runs of +// whitespace into a single space and stripping leading/trailing +// space. +func TestGoldenTablesAgainstPdfplumber(t *testing.T) { + dir := filepath.Join("testdata", "golden") + entries, err := os.ReadDir(dir) + if err != nil { + t.Fatalf("read golden dir: %v", err) + } + for _, e := range entries { + if e.IsDir() { + continue + } + name := e.Name() + if !strings.HasSuffix(name, ".tables.expected.json") { + continue + } + stem := strings.TrimSuffix(name, ".tables.expected.json") + t.Run(stem, func(t *testing.T) { + runGoldenTablesCase(t, dir, stem) + }) + } +} + +func runGoldenTablesCase(t *testing.T, dir, stem string) { + t.Helper() + pdfPath := filepath.Join(dir, stem+".pdf") + jsonPath := filepath.Join(dir, stem+".tables.expected.json") + + data, err := os.ReadFile(jsonPath) + if err != nil { + t.Fatalf("read %s: %v", jsonPath, err) + } + var g goldenTables + if err := json.Unmarshal(data, &g); err != nil { + t.Fatalf("parse %s: %v", jsonPath, err) + } + + doc, err := pdftable.OpenFile(pdfPath) + if err != nil { + t.Fatalf("OpenFile %s: %v", pdfPath, err) + } + defer doc.Close() + + for _, expPage := range g.Pages { + p, err := doc.Page(expPage.Number) + if err != nil { + t.Fatalf("Page(%d): %v", expPage.Number, err) + } + gotTables, err := p.ExtractTables(pdftable.DefaultTableSettings()) + if err != nil { + t.Fatalf("ExtractTables: %v", err) + } + if len(gotTables) != len(expPage.Tables) { + t.Fatalf("page %d: got %d tables, want %d", + expPage.Number, len(gotTables), len(expPage.Tables)) + } + for ti := range expPage.Tables { + wantTable := expPage.Tables[ti] + gotTable := gotTables[ti] + if len(gotTable.Rows) != len(wantTable) { + t.Errorf("page %d table %d: rows got %d, want %d", + expPage.Number, ti, len(gotTable.Rows), len(wantTable)) + continue + } + for ri := range wantTable { + if len(gotTable.Rows[ri]) != len(wantTable[ri]) { + t.Errorf("page %d table %d row %d: cols got %d, want %d", + expPage.Number, ti, ri, len(gotTable.Rows[ri]), len(wantTable[ri])) + continue + } + for ci := range wantTable[ri] { + got := normaliseCellText(gotTable.Rows[ri][ci]) + want := normaliseCellText(wantTable[ri][ci]) + if got != want { + t.Errorf("page %d table %d cell [%d][%d]: got %q, want %q", + expPage.Number, ti, ri, ci, got, want) + } + } + } + } + } +} + +// normaliseCellText collapses runs of whitespace into single spaces +// and strips leading / trailing whitespace. Used to compare +// pdftable's cell output against pdfplumber's: both sides should +// agree on the token sequence even if the intra-cell line break +// convention differs. +func normaliseCellText(s string) string { + return strings.Join(strings.Fields(s), " ") +} + func runGoldenCase(t *testing.T, dir, stem string) { t.Helper() diff --git a/internal/layout/lines.go b/internal/layout/lines.go new file mode 100644 index 0000000..3a94e85 --- /dev/null +++ b/internal/layout/lines.go @@ -0,0 +1,557 @@ +// Copyright (c) 2026 Halleluyah Oludele +// Licensed under the MIT License. + +// Package layout owns the lower-level geometry primitives that drive +// table-finding: edges, edge-derivation from Lines/Rects/Curves, and +// edge merging (snap + join). +// +// The split between this internal package and the public pdftable +// package mirrors the pdfplumber split between +// pdfplumber/utils/geometry.py (edge maths) and pdfplumber/table.py +// (the TableFinder). Keeping the edge maths here lets us evolve the +// representation freely while the public surface in pdftable's +// table.go / finder.go stays stable. +// +// Coordinate system: PDF user space — origin at bottom-left, Y growing +// UP. An edge is a single-axis line segment: +// +// - Horizontal edge: Y0 == Y1, X0 <= X1. Orientation "h". +// - Vertical edge: X0 == X1, Y0 <= Y1. Orientation "v". +// +// (Pdfplumber operates in IMAGE space, Y growing down; its "top" is +// our larger Y, its "bottom" is our smaller Y. The algorithms here +// are the same, only the coordinate sign is flipped — see comments at +// each step for the explicit mapping.) +package layout + +import ( + "math" + "sort" +) + +// Orientation is the axis an edge lies along. "h" = horizontal, +// "v" = vertical. Diagonal lines never become edges; they're dropped +// at derivation time. +type Orientation string + +const ( + Horizontal Orientation = "h" + Vertical Orientation = "v" +) + +// Source tags say which kind of drawn primitive produced an edge. +// pdfplumber distinguishes between "line", "rect_edge", and +// "curve_edge" so that lines_strict mode can ignore everything that +// isn't a literal stroked line. We carry the same distinction. +type Source uint8 + +const ( + // SourceLine: an edge derived from a stroked Line. + SourceLine Source = iota + // SourceRect: an edge derived from one side of a Rect. + SourceRect + // SourceCurve: an edge derived from a Curve's straight segment. + // We accept curve edges in "lines" mode but not "lines_strict". + SourceCurve + // SourceExplicit: an edge constructed from an + // ExplicitVerticalLines / ExplicitHorizontalLines setting. + SourceExplicit +) + +// Edge is one axis-aligned line segment carrying the data the table- +// finder needs for snap + join + intersection. +// +// Invariants (constructor-enforced via newEdge / normalise): +// +// - Orientation == "h" → Y0 == Y1, X0 <= X1. +// - Orientation == "v" → X0 == X1, Y0 <= Y1. +// +// Length() returns the along-direction extent; Pos() returns the +// perpendicular position (the "snap axis" for merging). +type Edge struct { + X0, Y0, X1, Y1 float64 + Orientation Orientation + Source Source + // Width is the stroke width of the originating primitive (used by + // some callers to filter hair-thin construction lines). Zero is + // "unknown". + Width float64 +} + +// Length returns the edge's extent along its orientation axis. +// +// - Horizontal edge → X1 - X0. +// - Vertical edge → Y1 - Y0. +func (e Edge) Length() float64 { + if e.Orientation == Horizontal { + return e.X1 - e.X0 + } + return e.Y1 - e.Y0 +} + +// Pos returns the perpendicular position of the edge — the coordinate +// that's constant along its length. We use this as the snap-axis key +// for grouping edges that lie on the same infinite line. +// +// - Horizontal edge → Y0 (which equals Y1). +// - Vertical edge → X0 (which equals X1). +func (e Edge) Pos() float64 { + if e.Orientation == Horizontal { + return e.Y0 + } + return e.X0 +} + +// normalise enforces the invariants. We never construct an Edge +// directly; callers go through one of the New* / FromLine / FromRect +// helpers which always produce a normalised result. +func (e Edge) normalise() Edge { + if e.Orientation == Horizontal { + // Ensure Y0 == Y1 (use the average if a caller passed a near- + // horizontal edge; the table-finding code only ever sees fully + // horizontal edges so this branch is a guardrail). + y := (e.Y0 + e.Y1) / 2 + e.Y0 = y + e.Y1 = y + if e.X1 < e.X0 { + e.X0, e.X1 = e.X1, e.X0 + } + return e + } + // Vertical. + x := (e.X0 + e.X1) / 2 + e.X0 = x + e.X1 = x + if e.Y1 < e.Y0 { + e.Y0, e.Y1 = e.Y1, e.Y0 + } + return e +} + +// LineSegment is a minimal struct describing a drawn straight-line +// segment. It exists so this package doesn't have to import the +// public pdftable types — keeps the dependency direction one-way +// (pdftable depends on layout, not the other way round). +type LineSegment struct { + X0, Y0, X1, Y1 float64 + Width float64 +} + +// RectSegment is a minimal Rect descriptor for the same reason. Only +// the bbox matters for edge derivation; we drop the Stroke/Fill flags +// at the layer above by filtering out non-stroked rects before +// calling FromRect. +type RectSegment struct { + X0, Y0, X1, Y1 float64 + Width float64 +} + +// CurveSegment is the point list of a curve path. We turn each +// horizontal or vertical pair of consecutive points into an edge — +// curves that are entirely diagonal contribute zero edges. +type CurveSegment struct { + Points [][2]float64 + Width float64 +} + +// FromLine returns one Edge for a horizontal or vertical line +// segment, or (zero, false) for diagonal lines (they aren't axis- +// aligned and so can't be table rules). +// +// Tolerance is the same near-axis-aligned slack pdfplumber's +// line_to_edge predicate uses implicitly: a "horizontal" line is one +// whose Y0 and Y1 are equal. We treat them as equal when their +// difference is below the supplied tolerance to absorb floating- +// point drift from the content-stream interpreter. +func FromLine(l LineSegment, tolerance float64) (Edge, bool) { + dx := math.Abs(l.X1 - l.X0) + dy := math.Abs(l.Y1 - l.Y0) + switch { + case dy <= tolerance && dx > tolerance: + // Horizontal. + e := Edge{ + X0: l.X0, X1: l.X1, + Y0: l.Y0, Y1: l.Y1, + Orientation: Horizontal, + Source: SourceLine, + Width: l.Width, + } + return e.normalise(), true + case dx <= tolerance && dy > tolerance: + // Vertical. + e := Edge{ + X0: l.X0, X1: l.X1, + Y0: l.Y0, Y1: l.Y1, + Orientation: Vertical, + Source: SourceLine, + Width: l.Width, + } + return e.normalise(), true + } + return Edge{}, false +} + +// FromRect returns the four edges of a rectangle's outline. We mirror +// pdfplumber's rect_to_edges: top, bottom, left, right — all tagged +// SourceRect. Filled-only (non-stroked) rectangles still produce +// edges because pdfplumber's edges property aggregates BOTH stroked +// and filled rects (a filled cell-background still defines a row +// boundary). +func FromRect(r RectSegment) []Edge { + bottom := Edge{ + X0: r.X0, X1: r.X1, + Y0: r.Y0, Y1: r.Y0, + Orientation: Horizontal, + Source: SourceRect, + Width: r.Width, + } + top := Edge{ + X0: r.X0, X1: r.X1, + Y0: r.Y1, Y1: r.Y1, + Orientation: Horizontal, + Source: SourceRect, + Width: r.Width, + } + left := Edge{ + X0: r.X0, X1: r.X0, + Y0: r.Y0, Y1: r.Y1, + Orientation: Vertical, + Source: SourceRect, + Width: r.Width, + } + right := Edge{ + X0: r.X1, X1: r.X1, + Y0: r.Y0, Y1: r.Y1, + Orientation: Vertical, + Source: SourceRect, + Width: r.Width, + } + return []Edge{ + top.normalise(), + bottom.normalise(), + left.normalise(), + right.normalise(), + } +} + +// FromCurve returns one edge per consecutive pair of points that lies +// on the same axis. Diagonal pairs are dropped. pdfplumber does the +// same in curve_to_edges. +func FromCurve(c CurveSegment, tolerance float64) []Edge { + if len(c.Points) < 2 { + return nil + } + out := make([]Edge, 0, len(c.Points)-1) + for i := 1; i < len(c.Points); i++ { + p0 := c.Points[i-1] + p1 := c.Points[i] + dx := math.Abs(p1[0] - p0[0]) + dy := math.Abs(p1[1] - p0[1]) + switch { + case dy <= tolerance && dx > tolerance: + e := Edge{ + X0: p0[0], X1: p1[0], + Y0: p0[1], Y1: p1[1], + Orientation: Horizontal, + Source: SourceCurve, + Width: c.Width, + } + out = append(out, e.normalise()) + case dx <= tolerance && dy > tolerance: + e := Edge{ + X0: p0[0], X1: p1[0], + Y0: p0[1], Y1: p1[1], + Orientation: Vertical, + Source: SourceCurve, + Width: c.Width, + } + out = append(out, e.normalise()) + } + } + return out +} + +// SnapEdges replaces near-collinear edges with edges sharing the +// average perpendicular position. Horizontal edges within +// snapYTolerance of each other on the Y axis get unified onto their +// mean Y; vertical edges within snapXTolerance of each other on the +// X axis get unified onto their mean X. +// +// This is the Go port of pdfplumber's table.snap_edges, which +// dispatches into utils.snap_objects per orientation. +// +// A tolerance of 0 leaves the edges unchanged for that orientation. +func SnapEdges(edges []Edge, snapXTolerance, snapYTolerance float64) []Edge { + if len(edges) == 0 { + return nil + } + // Partition. + var vEdges, hEdges []Edge + for _, e := range edges { + if e.Orientation == Horizontal { + hEdges = append(hEdges, e) + } else { + vEdges = append(vEdges, e) + } + } + + snappedV := snapAlongAxis(vEdges, snapXTolerance, true) + snappedH := snapAlongAxis(hEdges, snapYTolerance, false) + + out := make([]Edge, 0, len(snappedV)+len(snappedH)) + out = append(out, snappedV...) + out = append(out, snappedH...) + return out +} + +// snapAlongAxis groups edges by their perpendicular position +// (vertical=true → group by X0; vertical=false → group by Y0), forms +// clusters where consecutive sorted positions differ by <= tolerance, +// and replaces each edge's position with the cluster mean. +// +// Tolerance <= 0 returns edges unchanged. +func snapAlongAxis(edges []Edge, tolerance float64, vertical bool) []Edge { + if len(edges) == 0 || tolerance <= 0 { + return edges + } + // Indices sorted by perpendicular position. + type item struct { + idx int + pos float64 + } + items := make([]item, len(edges)) + for i, e := range edges { + items[i] = item{idx: i, pos: e.Pos()} + } + sort.SliceStable(items, func(i, j int) bool { + return items[i].pos < items[j].pos + }) + + // Cluster: greedy single-pass over the sorted positions, opening + // a new cluster every time the gap to the previous position + // exceeds tolerance. Pdfplumber's cluster_objects uses the same + // single-pass with cluster_list semantics — adjacent items within + // tolerance form one cluster. + clusters := make([][]int, 0) + current := []int{items[0].idx} + last := items[0].pos + for _, it := range items[1:] { + if it.pos-last <= tolerance { + current = append(current, it.idx) + } else { + clusters = append(clusters, current) + current = []int{it.idx} + } + last = it.pos + } + clusters = append(clusters, current) + + // Per-cluster mean and snap. + out := make([]Edge, len(edges)) + copy(out, edges) + for _, c := range clusters { + var sum float64 + for _, i := range c { + sum += edges[i].Pos() + } + mean := sum / float64(len(c)) + for _, i := range c { + if vertical { + out[i].X0 = mean + out[i].X1 = mean + } else { + out[i].Y0 = mean + out[i].Y1 = mean + } + } + } + return out +} + +// JoinEdges merges collinear edges whose along-direction extents +// touch (within joinTolerance). Two horizontal edges with the same +// Y and overlapping or near-touching X ranges become one edge +// spanning their union; similarly for vertical edges. +// +// Edges that don't overlap and aren't within the join tolerance pass +// through unchanged. +// +// This is the Go port of pdfplumber's table.join_edge_group, called +// once per (orientation, perpendicular-position) group. +func JoinEdges(edges []Edge, joinXTolerance, joinYTolerance float64) []Edge { + if len(edges) == 0 { + return nil + } + // Group by (orientation, perpendicular position). Pdfplumber does + // itertools.groupby over a sorted iterable; we do the same. + type groupKey struct { + o Orientation + p float64 + } + sorted := make([]Edge, len(edges)) + copy(sorted, edges) + sort.SliceStable(sorted, func(i, j int) bool { + gi := groupKey{o: sorted[i].Orientation, p: sorted[i].Pos()} + gj := groupKey{o: sorted[j].Orientation, p: sorted[j].Pos()} + if gi.o != gj.o { + return gi.o < gj.o + } + if gi.p != gj.p { + return gi.p < gj.p + } + // Within a group sort by along-direction start, so the join + // scan can walk the line left-to-right or bottom-to-top. + if sorted[i].Orientation == Horizontal { + return sorted[i].X0 < sorted[j].X0 + } + return sorted[i].Y0 < sorted[j].Y0 + }) + + var out []Edge + i := 0 + for i < len(sorted) { + j := i + 1 + k := groupKey{o: sorted[i].Orientation, p: sorted[i].Pos()} + for j < len(sorted) { + kj := groupKey{o: sorted[j].Orientation, p: sorted[j].Pos()} + if kj != k { + break + } + j++ + } + group := sorted[i:j] + tol := joinXTolerance + if k.o == Vertical { + tol = joinYTolerance + } + out = append(out, joinGroup(group, tol)...) + i = j + } + return out +} + +// joinGroup expects all edges in `group` to share the same +// orientation AND perpendicular position. It walks along the along- +// direction sorted edges, merging adjacent edges whose gap is <= +// tolerance. +func joinGroup(group []Edge, tolerance float64) []Edge { + if len(group) == 0 { + return nil + } + joined := []Edge{group[0]} + if group[0].Orientation == Horizontal { + for _, e := range group[1:] { + last := &joined[len(joined)-1] + if e.X0 <= last.X1+tolerance { + if e.X1 > last.X1 { + last.X1 = e.X1 + } + } else { + joined = append(joined, e) + } + } + } else { + for _, e := range group[1:] { + last := &joined[len(joined)-1] + if e.Y0 <= last.Y1+tolerance { + if e.Y1 > last.Y1 { + last.Y1 = e.Y1 + } + } else { + joined = append(joined, e) + } + } + } + return joined +} + +// MergeEdges is the snap-then-join pipeline. It's the entry point +// that table.go calls; it mirrors pdfplumber's table.merge_edges. +// +// Order: +// 1. Snap (collapse near-collinear edges onto their mean position). +// 2. Group by (orientation, position) and join within each group. +func MergeEdges(edges []Edge, snapXTol, snapYTol, joinXTol, joinYTol float64) []Edge { + if len(edges) == 0 { + return nil + } + snapped := edges + if snapXTol > 0 || snapYTol > 0 { + snapped = SnapEdges(edges, snapXTol, snapYTol) + } + return JoinEdges(snapped, joinXTol, joinYTol) +} + +// FilterEdgesByLength drops edges whose along-direction length is +// below minLength. Pdfplumber calls this twice in the TableFinder +// pipeline: once with edge_min_length_prefilter on the raw edges +// (default 1 — drop hairline construction lines) and once with +// edge_min_length on the merged set (default 3 — drop short stubs +// after snap+join). +func FilterEdgesByLength(edges []Edge, minLength float64) []Edge { + if minLength <= 0 { + return edges + } + out := edges[:0:0] + for _, e := range edges { + if e.Length() >= minLength { + out = append(out, e) + } + } + return out +} + +// FilterEdgesBySource keeps only edges produced by an allowed Source. +// Used by lines_strict mode to drop rect and curve edges. +func FilterEdgesBySource(edges []Edge, allow ...Source) []Edge { + if len(allow) == 0 { + return edges + } + allowed := make(map[Source]struct{}, len(allow)) + for _, s := range allow { + allowed[s] = struct{}{} + } + out := edges[:0:0] + for _, e := range edges { + if _, ok := allowed[e.Source]; ok { + out = append(out, e) + } + } + return out +} + +// FilterEdgesByOrientation keeps only edges with the given +// orientation. Convenience over a one-line loop, used by the +// TableFinder when separating the v / h edge lists post-merge for +// intersection detection. +func FilterEdgesByOrientation(edges []Edge, o Orientation) []Edge { + out := edges[:0:0] + for _, e := range edges { + if e.Orientation == o { + out = append(out, e) + } + } + return out +} + +// SortEdges returns a stable-sorted copy of edges keyed by +// (orientation, perpendicular position, along-direction start). This +// is the deterministic order downstream stages rely on for +// intersection enumeration. +func SortEdges(edges []Edge) []Edge { + out := make([]Edge, len(edges)) + copy(out, edges) + sort.SliceStable(out, func(i, j int) bool { + if out[i].Orientation != out[j].Orientation { + return out[i].Orientation < out[j].Orientation + } + if pa, pb := out[i].Pos(), out[j].Pos(); pa != pb { + return pa < pb + } + if out[i].Orientation == Horizontal { + return out[i].X0 < out[j].X0 + } + return out[i].Y0 < out[j].Y0 + }) + return out +} diff --git a/page.go b/page.go index 996e7db..48e0225 100644 --- a/page.go +++ b/page.go @@ -7,6 +7,7 @@ import ( "fmt" "strings" + "github.com/hallelx2/pdftable/internal/layout" "github.com/hallelx2/pdftable/internal/pdf" ) @@ -92,6 +93,28 @@ type Page interface { // when ExtractText's word-grouping heuristics produce undesired // results on adversarial input. ExtractTextSimple(xTolerance, yTolerance float64) (string, error) + + // FindTables runs the geometry-only stage of the table-finding + // pipeline: derive edges from the page primitives, snap+join + // into rulers, scan for intersections, assemble cells, group + // cells into tables. Returns one TableFinder per detected + // table-group so callers building debugging tools can inspect + // the intermediate stages (edges / intersections / raw cells) + // alongside the assembled per-table CellsGrid. + // + // v0.2.0 supports VerticalStrategy / HorizontalStrategy values + // of "lines" and "lines_strict". Passing "text" or "explicit" + // returns ErrUnsupported — those strategies land in Phase 1.3.D + // (v0.3.0). + FindTables(settings TableSettings) ([]TableFinder, error) + + // ExtractTables wraps FindTables and runs per-cell text + // extraction on every detected table. Cells with no chars + // produce an empty string. Leading and trailing whitespace + // inside each cell is stripped. Returns the slice of fully + // populated Table structs in visual top-to-bottom-left-to-right + // order. + ExtractTables(settings TableSettings) ([]*Table, error) } // page is the unexported implementation backing the Page interface. @@ -361,3 +384,254 @@ func charsJoinedText(chars []Char) string { } return sb.String() } + +// findTableEdges derives the input-edge slice for the table-finding +// pipeline from the page's primitives. The pipeline: +// +// 1. Walk the page once via Objects(), so we pay the content-stream +// parse cost a single time rather than once per primitive type. +// 2. Convert every Line / Rect / Curve into one or more layout.Edge +// instances. Lines produce 0 or 1 edge; Rects produce 4; Curves +// produce one per axis-aligned segment. +// 3. For lines_strict, drop SourceRect and SourceCurve edges before +// the prefilter. +// 4. Apply the prefilter (drop edges shorter than +// EdgeMinLengthPrefilter — pdfplumber default 1 pt). +// 5. Merge (snap onto cluster means, then join collinear edges +// within JoinTolerance). +// 6. Apply the post-merge length filter (drop edges shorter than +// EdgeMinLength — pdfplumber default 3 pt). +// +// The returned slice is the input both the vertical and horizontal +// stages share — pdfplumber's TableFinder.get_edges takes the union +// of vertical-strategy edges and horizontal-strategy edges and runs +// the merge across both at once. We do the same, but with one +// wrinkle: if the two strategies differ ("lines" + "lines_strict"), +// we apply the source-filter PER ORIENTATION so a strict horizontal +// strategy can still benefit from rect-derived vertical edges and +// vice versa. The pdfplumber implementation handles this implicitly +// because its filter_edges receives the requested orientation as an +// argument; our code mirrors that branch explicitly. +func (p *page) findTableEdges(s TableSettings) ([]layout.Edge, error) { + objs, err := p.Objects() + if err != nil { + return nil, err + } + + tol := 0.1 // near-axis-aligned slack for FromLine/FromCurve + rawEdges := make([]layout.Edge, 0, len(objs.Lines)+4*len(objs.Rects)+2*len(objs.Curves)) + + for _, l := range objs.Lines { + if e, ok := layout.FromLine(layout.LineSegment{ + X0: l.X0, Y0: l.Y0, X1: l.X1, Y1: l.Y1, Width: l.Width, + }, tol); ok { + rawEdges = append(rawEdges, e) + } + } + for _, r := range objs.Rects { + rawEdges = append(rawEdges, layout.FromRect(layout.RectSegment{ + X0: r.X0, Y0: r.Y0, X1: r.X1, Y1: r.Y1, Width: r.Width, + })...) + } + for _, c := range objs.Curves { + rawEdges = append(rawEdges, layout.FromCurve(layout.CurveSegment{ + Points: c.Points, Width: c.Width, + }, tol)...) + } + + // Per-orientation source filter: lines_strict on an axis drops + // non-line sources on that axis. We split into v/h, filter each + // according to its own strategy, then recombine before the + // length filter and merge. + vEdges := layout.FilterEdgesByOrientation(rawEdges, layout.Vertical) + hEdges := layout.FilterEdgesByOrientation(rawEdges, layout.Horizontal) + + if s.VerticalStrategy == StrategyLinesStrict { + vEdges = layout.FilterEdgesBySource(vEdges, layout.SourceLine, layout.SourceExplicit) + } + if s.HorizontalStrategy == StrategyLinesStrict { + hEdges = layout.FilterEdgesBySource(hEdges, layout.SourceLine, layout.SourceExplicit) + } + + // Explicit overrides are added on top of the derived edges. + // pdfplumber accepts these even with the lines / lines_strict + // strategies (the "explicit" strategy itself replaces the + // derived edges; that strategy is deferred to v0.3.0). + for _, x := range s.ExplicitVerticalLines { + vEdges = append(vEdges, layout.Edge{ + X0: x, X1: x, + Y0: 0, Y1: p.Height(), + Orientation: layout.Vertical, + Source: layout.SourceExplicit, + }) + } + for _, y := range s.ExplicitHorizontalLines { + hEdges = append(hEdges, layout.Edge{ + X0: 0, X1: p.Width(), + Y0: y, Y1: y, + Orientation: layout.Horizontal, + Source: layout.SourceExplicit, + }) + } + + combined := make([]layout.Edge, 0, len(vEdges)+len(hEdges)) + combined = append(combined, vEdges...) + combined = append(combined, hEdges...) + + // Prefilter (drop hairline construction segments). pdfplumber + // applies edge_min_length_prefilter BEFORE merging. + combined = layout.FilterEdgesByLength(combined, s.EdgeMinLengthPrefilter) + + // Merge: snap onto cluster means, then join collinear segments + // that touch within JoinTolerance. + merged := layout.MergeEdges(combined, s.SnapTolerance, s.SnapTolerance, s.JoinTolerance, s.JoinTolerance) + + // Post-merge length filter (drop short stubs). + merged = layout.FilterEdgesByLength(merged, s.EdgeMinLength) + + return merged, nil +} + +// FindTables runs the geometry-only pipeline (edges → intersections +// → cells → tables) and returns one TableFinder per detected table +// group. Strategies "text" and "explicit" return ErrUnsupported. +// +// The returned slice is in visual top-to-bottom-left-to-right order +// (sorted by the topmost-leftmost cell of each table). +func (p *page) FindTables(settings TableSettings) ([]TableFinder, error) { + s := settings.applyDefaults() + if err := ensureSupportedStrategies(s); err != nil { + return nil, err + } + + edges, err := p.findTableEdges(s) + if err != nil { + return nil, err + } + if len(edges) == 0 { + return nil, nil + } + + intersections := edgesToIntersections(edges, s.IntersectionTolerance, s.IntersectionTolerance) + cells := intersectionsToCells(intersections) + groups := cellsToTables(cells) + + out := make([]TableFinder, 0, len(groups)) + for _, g := range groups { + out = append(out, TableFinder{ + Edges: edges, + Intersections: intersections, + Cells: cells, + Tables: []TableBox{assembleTableBox(g)}, + }) + } + return out, nil +} + +// ExtractTables runs FindTables then fills each cell with the text +// of the chars whose centre lies inside the cell's bbox. Empty cells +// produce "". Leading/trailing whitespace is stripped per cell. +// +// The char-in-cell test matches pdfplumber's Table.extract: +// +// v_mid = (char.top + char.bottom) / 2 +// h_mid = (char.x0 + char.x1) / 2 +// (h_mid >= x0) AND (h_mid < x1) AND (v_mid >= top) AND (v_mid < bottom) +// +// Translating from pdfplumber's image space to PDF user space: "top" +// becomes the larger Y, "bottom" becomes the smaller Y; the inclusive +// / exclusive convention on the right and bottom edges stays the +// same so a glyph sitting EXACTLY on a column boundary lands in the +// left column. +func (p *page) ExtractTables(settings TableSettings) ([]*Table, error) { + s := settings.applyDefaults() + if err := ensureSupportedStrategies(s); err != nil { + return nil, err + } + + finders, err := p.FindTables(s) + if err != nil { + return nil, err + } + if len(finders) == 0 { + return nil, nil + } + + chars, err := p.Chars() + if err != nil { + return nil, err + } + + tables := make([]*Table, 0, len(finders)) + for _, f := range finders { + for _, tb := range f.Tables { + t := assembleTableText(tb, chars, s, p.number) + tables = append(tables, t) + } + } + return tables, nil +} + +// assembleTableText fills the per-cell text grid for one TableBox by +// (a) filtering the page's chars by centre-in-cell-bbox, (b) running +// the existing word extractor + dense text join over the cell's +// chars, and (c) trimming whitespace. +func assembleTableText(tb TableBox, chars []Char, s TableSettings, pageNumber int) *Table { + rows := make([][]string, tb.Rows) + for i := range rows { + rows[i] = make([]string, tb.Cols) + } + + textOpts := DefaultTextOpts() + textOpts.XTolerance = s.TextTolerance + textOpts.YTolerance = s.TextTolerance + + wordOpts := DefaultWordOpts() + wordOpts.XTolerance = s.TextTolerance + wordOpts.YTolerance = s.TextTolerance + wordOpts.KeepBlankChars = s.KeepBlankChars + + for ri, row := range tb.CellsGrid { + for ci, cell := range row { + if cell.IsZero() { + rows[ri][ci] = "" + continue + } + cellChars := charsInCell(chars, cell) + if len(cellChars) == 0 { + rows[ri][ci] = "" + continue + } + text := extractTextFromChars(cellChars, textOpts) + rows[ri][ci] = strings.TrimSpace(text) + } + } + + return &Table{ + Rows: rows, + BBox: tb.BBox, + Page: pageNumber, + CellsBBox: tb.CellsGrid, + } +} + +// charsInCell returns the chars whose centre point lies inside the +// supplied bbox. Mirrors pdfplumber's char_in_bbox predicate, which +// is the right thing for cell extraction: a glyph that straddles two +// cells gets assigned to the cell containing its visual centre, not +// to both. +// +// Inclusive on x0 / y0 (bottom-left), exclusive on x1 / y1 (top- +// right). The exclusivity makes the boundary deterministic — a glyph +// sitting at exactly the column boundary goes into the left column. +func charsInCell(chars []Char, cell BBox) []Char { + out := make([]Char, 0, len(chars)) + for _, c := range chars { + hMid := (c.X0 + c.X1) / 2 + vMid := (c.Y0 + c.Y1) / 2 + if hMid >= cell.X0 && hMid < cell.X1 && vMid >= cell.Y0 && vMid < cell.Y1 { + out = append(out, c) + } + } + return out +} diff --git a/scripts/gen_golden.py b/scripts/gen_golden.py index 3af3366..892b379 100644 --- a/scripts/gen_golden.py +++ b/scripts/gen_golden.py @@ -6,15 +6,19 @@ pip install pdfplumber python scripts/gen_golden.py -The script reads every *.pdf in testdata/golden/, runs pdfplumber's -extract_text() and extract_words() on each page, and writes the result -as .expected.json next to the PDF. +The script reads every *.pdf in testdata/golden/ and writes TWO +sibling JSON files: -Coordinate-system note: pdfplumber emits word "top" and "bottom" in -image space (origin at top-left, Y growing DOWN). pdftable uses PDF -user space (origin at bottom-left, Y growing UP). We translate -pdfplumber's coords into PDF-user-space here so the JSON matches the -y0/y1 fields on pdftable.Word directly. + * .expected.json — words + extract_text (v0.1.0 tests). + * .tables.expected.json — find_tables/extract output for the + `lines` strategy (v0.2.0 tests). + +Word goldens use image-space "top" / "bottom" translated into +PDF-user-space y0 / y1 so they match pdftable.Word fields directly. +Table goldens are the raw [[[str]]] output of pdfplumber's +Table.extract() — pdftable's parity test normalises whitespace before +comparing, so intra-cell line breaks ("A\\nB") match space-separated +output ("A B") and vice versa. To regenerate after upgrading pdfplumber, simply re-run this script. The file outputs are deterministic and stable. @@ -83,6 +87,30 @@ def main() -> int: json.dump(out, f, ensure_ascii=False, indent=2) nwords = sum(len(pp["extract_words"]) for pp in out["pages"]) print(f"wrote {expected}: {len(out['pages'])} pages, {nwords} words") + + # Also emit the tables-format golden so the v0.2.0 parity + # test (TestGoldenTablesAgainstPdfplumber) has reference + # output. Empty `pages[].tables` entries are fine — the test + # just compares cell counts and contents, and a page with + # zero detected tables produces an empty list on both sides. + tables_out = {"name": name, "pages": []} + with pdfplumber.open(pdf_path) as pdf: + for p in pdf.pages: + tbls = p.find_tables( + {"vertical_strategy": "lines", "horizontal_strategy": "lines"} + ) + page_obj = { + "number": p.page_number, + "width": p.width, + "height": p.height, + "tables": [t.extract() for t in tbls], + } + tables_out["pages"].append(page_obj) + tables_expected = os.path.join(target, f"{name}.tables.expected.json") + with open(tables_expected, "w", encoding="utf-8") as f: + json.dump(tables_out, f, ensure_ascii=False, indent=2) + ntables = sum(len(pp["tables"]) for pp in tables_out["pages"]) + print(f"wrote {tables_expected}: {ntables} tables across all pages") return 0 diff --git a/scripts/gen_table_fixture.go b/scripts/gen_table_fixture.go new file mode 100644 index 0000000..042e7f6 --- /dev/null +++ b/scripts/gen_table_fixture.go @@ -0,0 +1,55 @@ +// Copyright (c) 2026 Halleluyah Oludele +// Licensed under the MIT License. + +//go:build ignore + +// gen_table_fixture.go writes out the binary fixture PDFs used by the +// table-finding parity tests. The PDFs are generated by the +// testdata.TableRuled() helper (and any future ones we add) so that +// the source-of-truth for what's in the file is Go code, not a +// binary blob in git. +// +// Usage from the repo root: +// +// go run ./scripts/gen_table_fixture.go +// +// The script is build-tag-gated so it doesn't get picked up by +// `go build ./...` and `go test ./...`. It writes: +// +// testdata/table-2x3-ruled.pdf — 2-col x 3-row ruled table +// +// Re-run after editing testdata/fixtures.go's TableRuled function. +// The resulting PDFs should also be checked into git so callers +// without a Go toolchain can still run the tests. +package main + +import ( + "fmt" + "os" + "path/filepath" + + "github.com/hallelx2/pdftable/testdata" +) + +func main() { + outputs := []struct { + path string + data []byte + }{ + { + path: filepath.Join("testdata", "table-2x3-ruled.pdf"), + data: testdata.TableRuled(), + }, + } + for _, o := range outputs { + if err := os.MkdirAll(filepath.Dir(o.path), 0o755); err != nil { + fmt.Fprintf(os.Stderr, "mkdir %s: %v\n", filepath.Dir(o.path), err) + os.Exit(1) + } + if err := os.WriteFile(o.path, o.data, 0o644); err != nil { + fmt.Fprintf(os.Stderr, "write %s: %v\n", o.path, err) + os.Exit(1) + } + fmt.Printf("wrote %s (%d bytes)\n", o.path, len(o.data)) + } +} diff --git a/table.go b/table.go new file mode 100644 index 0000000..359a915 --- /dev/null +++ b/table.go @@ -0,0 +1,233 @@ +// Copyright (c) 2026 Halleluyah Oludele +// Licensed under the MIT License. + +package pdftable + +// This file defines the public types of the table-finding pipeline: +// TableSettings (with pdfplumber-matching defaults), Table (the +// extracted result), and TableFinder (the intermediate object that +// exposes edges, intersections, and cell bboxes without running text +// extraction). +// +// Algorithm and field names are direct ports of pdfplumber's +// TableSettings / TableFinder / Table from pdfplumber/table.py. The +// public surface differs in two ways: +// +// - Field names follow Go conventions (CamelCase, exported) rather +// than pdfplumber's snake_case dict keys. +// - Coordinates are PDF user space (origin at bottom-left, Y growing +// up). pdfplumber emits image-space coordinates ("top" / "bottom" +// with Y growing down); we use Y0/Y1 throughout. The intersection +// geometry is invariant under that flip; only the comments +// describing "below" / "right" change their sign. + +// TableStrategy is the enum of edge-derivation strategies. Each axis +// (vertical, horizontal) picks one. v0.2.0 implements "lines" and +// "lines_strict"; "text" and "explicit" are reserved for the next +// release (Phase 1.3.D) and return ErrUnsupported if requested. +type TableStrategy string + +const ( + // StrategyLines derives edges from drawn Lines, Rects (all four + // sides), and Curves whose segments lie on an axis. Snap and join + // tolerances are at their defaults — looser than lines_strict so + // hand-drawn or jittery rules still merge. + StrategyLines TableStrategy = "lines" + + // StrategyLinesStrict derives edges ONLY from drawn Lines. + // Rectangle outlines and curve segments are ignored, even if they + // look like a table grid. Use this when your PDF draws cell + // backgrounds as filled rects that you do NOT want treated as row + // boundaries. + StrategyLinesStrict TableStrategy = "lines_strict" + + // StrategyText (Phase 1.3.D) infers edges from word alignment. + StrategyText TableStrategy = "text" + + // StrategyExplicit (Phase 1.3.D) uses caller-supplied lines. + StrategyExplicit TableStrategy = "explicit" +) + +// TableSettings controls table finding. Construct via +// DefaultTableSettings() and override the fields you need — the zero +// value is NOT usable because the tolerances default to zero and the +// strategies are empty strings. +// +// Field naming and defaults are 1:1 with pdfplumber's TableSettings +// dataclass (see pdfplumber/table.py:486-555). Where pdfplumber +// supports independent x/y tolerances via *_x_tolerance / *_y_tolerance +// fallbacks, we expose the shared field directly; explicit per-axis +// overrides can be added later if a real-world need surfaces. +type TableSettings struct { + // VerticalStrategy picks the source of vertical edges. + // Default: StrategyLines. + VerticalStrategy TableStrategy + + // HorizontalStrategy picks the source of horizontal edges. + // Default: StrategyLines. + HorizontalStrategy TableStrategy + + // SnapTolerance is the perpendicular-axis tolerance for clustering + // near-collinear edges before joining (PDF points). Default: 3. + SnapTolerance float64 + + // JoinTolerance is the along-direction gap that still gets merged + // during the join pass (PDF points). Default: 3. + JoinTolerance float64 + + // EdgeMinLength drops merged edges shorter than this (PDF points). + // Default: 3. + EdgeMinLength float64 + + // EdgeMinLengthPrefilter drops raw edges before merging + // (PDF points). Default: 1 — kills hairline construction + // segments that snap+join shouldn't pull together. + EdgeMinLengthPrefilter float64 + + // IntersectionTolerance is the slack used when testing whether a + // vertical edge crosses a horizontal edge — accounts for tiny + // gaps between the end of a stroked line and the start of the + // next (PDF points). Default: 3. + IntersectionTolerance float64 + + // TextTolerance is forwarded to the per-cell text-extraction call + // inside ExtractTables. It overrides both x_tolerance and + // y_tolerance of the underlying WordExtractor. Default: 3. + TextTolerance float64 + + // MinWordsVertical / MinWordsHorizontal control the "text" + // strategy thresholds (Phase 1.3.D). They have no effect when + // both strategies are "lines" / "lines_strict" — kept on this + // struct so callers don't have to switch types when migrating to + // the text strategy later. + MinWordsVertical int + MinWordsHorizontal int + + // KeepBlankChars is forwarded to the per-cell WordExtractor. + // Default: false (matches pdfplumber's text_keep_blank_chars). + KeepBlankChars bool + + // ExplicitVerticalLines / ExplicitHorizontalLines hold caller- + // supplied edge positions. With StrategyLines or + // StrategyLinesStrict they are ADDED to the derived edges; with + // StrategyExplicit they ARE the edges. In v0.2.0 the explicit + // strategy is not yet implemented; these slices have effect only + // when one of the lines strategies is in use and you want extra + // hand-placed rules (e.g. when your column boundary isn't drawn). + // + // Values are X coordinates for vertical lines, Y coordinates for + // horizontal lines, both in PDF user-space points. + ExplicitVerticalLines []float64 + ExplicitHorizontalLines []float64 +} + +// DefaultTableSettings returns settings with the pdfplumber default +// values pre-populated. The intended pattern is: +// +// settings := pdftable.DefaultTableSettings() +// settings.VerticalStrategy = pdftable.StrategyLinesStrict +// tables, err := page.ExtractTables(settings) +// +// pdfplumber's defaults (table.py lines 9-12, 486-503): +// +// DEFAULT_SNAP_TOLERANCE = 3 +// DEFAULT_JOIN_TOLERANCE = 3 +// DEFAULT_MIN_WORDS_VERTICAL = 3 +// DEFAULT_MIN_WORDS_HORIZONTAL = 1 +// edge_min_length = 3 +// edge_min_length_prefilter = 1 +// intersection_tolerance = 3 +// vertical_strategy = "lines" +// horizontal_strategy = "lines" +// text_x_tolerance/y_tolerance = 3 +func DefaultTableSettings() TableSettings { + return TableSettings{ + VerticalStrategy: StrategyLines, + HorizontalStrategy: StrategyLines, + SnapTolerance: 3, + JoinTolerance: 3, + EdgeMinLength: 3, + EdgeMinLengthPrefilter: 1, + IntersectionTolerance: 3, + TextTolerance: 3, + MinWordsVertical: 3, + MinWordsHorizontal: 1, + } +} + +// applyDefaults fills in zero-valued fields with pdfplumber-matching +// defaults. Callers who construct a TableSettings literal and only set +// the fields they care about get the same defaults as if they'd used +// DefaultTableSettings(). +func (s TableSettings) applyDefaults() TableSettings { + if s.VerticalStrategy == "" { + s.VerticalStrategy = StrategyLines + } + if s.HorizontalStrategy == "" { + s.HorizontalStrategy = StrategyLines + } + if s.SnapTolerance == 0 { + s.SnapTolerance = 3 + } + if s.JoinTolerance == 0 { + s.JoinTolerance = 3 + } + if s.EdgeMinLength == 0 { + s.EdgeMinLength = 3 + } + if s.EdgeMinLengthPrefilter == 0 { + s.EdgeMinLengthPrefilter = 1 + } + if s.IntersectionTolerance == 0 { + s.IntersectionTolerance = 3 + } + if s.TextTolerance == 0 { + s.TextTolerance = 3 + } + if s.MinWordsVertical == 0 { + s.MinWordsVertical = 3 + } + if s.MinWordsHorizontal == 0 { + s.MinWordsHorizontal = 1 + } + return s +} + +// Table is the extracted result for one detected table. It carries +// the assembled cell texts plus the geometry needed for downstream +// consumers (re-rendering, click-through to source positions). +type Table struct { + // Rows is the table's text content as a 2-D slice. Row 0 is the + // VISUALLY TOP row of the table; column 0 is the leftmost. Empty + // cells appear as "". Missing cells (when a row has fewer columns + // than the table's column count, because the underlying cell + // detection found a hole) are also "" — we promote missing to + // empty so callers don't have to nil-check every entry. + Rows [][]string + + // BBox is the union of every cell's bbox, in PDF user-space + // coordinates (origin bottom-left, Y growing up). + BBox BBox + + // Page is the 1-based page number the table was found on, copied + // from the originating Page so callers can carry results across + // page boundaries without holding Page references. + Page int + + // CellsBBox is the per-cell bbox aligned to Rows: CellsBBox[i][j] + // is the bbox of Rows[i][j]. Useful for re-rendering with + // highlight overlays, or for re-cropping the page to extract the + // cell's contents in a richer format than plain text. + CellsBBox [][]BBox +} + +// Cells returns the cell bboxes flattened into reading order +// (left-to-right, top-to-bottom). Provided as a convenience for +// callers that want a single iterable rather than a nested slice. +func (t Table) Cells() []BBox { + out := make([]BBox, 0) + for _, row := range t.CellsBBox { + out = append(out, row...) + } + return out +} diff --git a/table_test.go b/table_test.go new file mode 100644 index 0000000..09eb4e7 --- /dev/null +++ b/table_test.go @@ -0,0 +1,443 @@ +// Copyright (c) 2026 Halleluyah Oludele +// Licensed under the MIT License. + +package pdftable + +// table_test.go is intentionally in the pdftable package (not +// pdftable_test) so it can reach the unexported algorithm functions: +// edgesToIntersections, intersectionsToCells, cellsToTables, +// assembleTableBox, runTableFinder. The public-API integration test +// (TestExtractTables_RuledFixture) lives at the end and uses the +// public surface only. + +import ( + "strings" + "testing" + + "github.com/hallelx2/pdftable/internal/layout" + "github.com/hallelx2/pdftable/testdata" +) + +// makeH builds a horizontal edge at Y = y from X0 = x0 to X1 = x1. +// Tests never construct layout.Edge values directly because the +// invariants (X0 <= X1 for h, Y0 == Y1 for h) are constructor- +// enforced; this helper centralises the common case. +func makeH(x0, x1, y float64) layout.Edge { + return layout.Edge{ + X0: x0, X1: x1, Y0: y, Y1: y, + Orientation: layout.Horizontal, + Source: layout.SourceLine, + } +} + +// makeV builds a vertical edge at X = x from Y0 = y0 to Y1 = y1. +func makeV(x, y0, y1 float64) layout.Edge { + return layout.Edge{ + X0: x, X1: x, Y0: y0, Y1: y1, + Orientation: layout.Vertical, + Source: layout.SourceLine, + } +} + +// TestEdgesToIntersections_Grid2x2 sets up a 2×2 cell grid (3 +// horizontal and 3 vertical edges) and asserts the intersection +// scanner finds exactly 9 crossing points at the expected (X, Y) +// pairs. +// +// Geometry (PDF user space, Y growing up): +// +// Y=100 ───────── +// │ │ │ │ +// Y= 50 ───────── +// │ │ │ │ +// Y= 0 ───────── +// X=0 X=50 X=100 +func TestEdgesToIntersections_Grid2x2(t *testing.T) { + edges := []layout.Edge{ + makeH(0, 100, 0), + makeH(0, 100, 50), + makeH(0, 100, 100), + makeV(0, 0, 100), + makeV(50, 0, 100), + makeV(100, 0, 100), + } + ints := edgesToIntersections(edges, 0.1, 0.1) + if len(ints) != 9 { + t.Fatalf("intersections: got %d, want 9", len(ints)) + } + + // Build a set of (X, Y) for easy lookup. + type pt struct{ x, y float64 } + got := make(map[pt]bool, len(ints)) + for _, p := range ints { + got[pt{x: p.X, y: p.Y}] = true + } + for _, x := range []float64{0, 50, 100} { + for _, y := range []float64{0, 50, 100} { + if !got[pt{x: x, y: y}] { + t.Errorf("missing intersection at (%v, %v)", x, y) + } + } + } + + // Each intersection should have at least one V and one H edge. + for _, p := range ints { + if len(p.V) == 0 || len(p.H) == 0 { + t.Errorf("intersection (%v,%v): V=%d H=%d, want both > 0", + p.X, p.Y, len(p.V), len(p.H)) + } + } +} + +// TestEdgesToIntersections_NoCrossing asserts that two parallel +// edges, even when their Y / X spans overlap, produce no +// intersections. +func TestEdgesToIntersections_NoCrossing(t *testing.T) { + edges := []layout.Edge{ + makeH(0, 100, 50), + makeH(0, 100, 60), + makeV(0, 0, 100), + makeV(50, 0, 100), + } + // Expected: 4 intersections (each H crosses each V). + ints := edgesToIntersections(edges, 0.1, 0.1) + if len(ints) != 4 { + t.Fatalf("got %d intersections, want 4", len(ints)) + } +} + +// TestEdgesToIntersections_Tolerance asserts the perpendicular +// tolerance lets a near-miss crossing register. The vertical edge +// ends at Y=49.5 but the horizontal edge sits at Y=50: with yTol=1 +// the intersection should still be picked up. +func TestEdgesToIntersections_Tolerance(t *testing.T) { + edges := []layout.Edge{ + makeH(0, 100, 50), + makeV(50, 0, 49.5), + } + if got := len(edgesToIntersections(edges, 1, 1)); got != 1 { + t.Errorf("with yTol=1: got %d intersections, want 1", got) + } + if got := len(edgesToIntersections(edges, 0.1, 0.1)); got != 0 { + t.Errorf("with yTol=0.1: got %d intersections, want 0", got) + } +} + +// TestIntersectionsToCells_Single2x2 feeds the 9 intersections of a +// 2×2 grid and asserts the cell finder produces exactly 4 cells +// covering the expected bboxes. +func TestIntersectionsToCells_Single2x2(t *testing.T) { + edges := []layout.Edge{ + makeH(0, 100, 0), + makeH(0, 100, 50), + makeH(0, 100, 100), + makeV(0, 0, 100), + makeV(50, 0, 100), + makeV(100, 0, 100), + } + ints := edgesToIntersections(edges, 0.1, 0.1) + cells := intersectionsToCells(ints) + if len(cells) != 4 { + t.Fatalf("got %d cells, want 4", len(cells)) + } + + type want struct{ x0, y0, x1, y1 float64 } + expected := []want{ + {0, 50, 50, 100}, + {50, 50, 100, 100}, + {0, 0, 50, 50}, + {50, 0, 100, 50}, + } + for _, w := range expected { + found := false + for _, c := range cells { + if c.X0 == w.x0 && c.Y0 == w.y0 && c.X1 == w.x1 && c.Y1 == w.y1 { + found = true + break + } + } + if !found { + t.Errorf("missing cell (%v,%v)-(%v,%v)", w.x0, w.y0, w.x1, w.y1) + } + } +} + +// TestCellsToTables_GroupsTouching asserts that cells sharing a +// corner end up in the same table, and that singletons get dropped. +func TestCellsToTables_GroupsTouching(t *testing.T) { + // Two 2x2 grids that don't touch. + left := []BBox{ + NewBBox(0, 0, 50, 50), + NewBBox(50, 0, 100, 50), + NewBBox(0, 50, 50, 100), + NewBBox(50, 50, 100, 100), + } + right := []BBox{ + NewBBox(200, 0, 250, 50), + NewBBox(250, 0, 300, 50), + NewBBox(200, 50, 250, 100), + NewBBox(250, 50, 300, 100), + } + // One standalone cell that should be filtered out. + stray := []BBox{NewBBox(400, 400, 410, 410)} + + all := append([]BBox{}, left...) + all = append(all, right...) + all = append(all, stray...) + + tables := cellsToTables(all) + if len(tables) != 2 { + t.Fatalf("got %d tables, want 2", len(tables)) + } + for i, tbl := range tables { + if len(tbl) != 4 { + t.Errorf("table %d: got %d cells, want 4", i, len(tbl)) + } + } +} + +// TestAssembleTableBox_2x2Grid asserts the row/column projection +// builds the expected 2×2 grid with cells in the right slots. +// +// Row 0 is visually TOP (larger Y1). Column 0 is visually LEFT +// (smaller X0). +func TestAssembleTableBox_2x2Grid(t *testing.T) { + cells := []BBox{ + NewBBox(0, 50, 50, 100), // top-left + NewBBox(50, 50, 100, 100), // top-right + NewBBox(0, 0, 50, 50), // bottom-left + NewBBox(50, 0, 100, 50), // bottom-right + } + tb := assembleTableBox(cells) + if tb.Rows != 2 || tb.Cols != 2 { + t.Fatalf("got Rows=%d Cols=%d, want 2x2", tb.Rows, tb.Cols) + } + if tb.BBox != (BBox{X0: 0, Y0: 0, X1: 100, Y1: 100}) { + t.Errorf("bbox: got %+v, want (0,0)-(100,100)", tb.BBox) + } + // Row 0 = top = larger Y. Row 0, Col 0 should be (0, 50, 50, 100). + if c := tb.CellsGrid[0][0]; c.X0 != 0 || c.Y1 != 100 { + t.Errorf("CellsGrid[0][0]: got %+v, want top-left", c) + } + if c := tb.CellsGrid[0][1]; c.X0 != 50 || c.Y1 != 100 { + t.Errorf("CellsGrid[0][1]: got %+v, want top-right", c) + } + if c := tb.CellsGrid[1][0]; c.X0 != 0 || c.Y1 != 50 { + t.Errorf("CellsGrid[1][0]: got %+v, want bottom-left", c) + } + if c := tb.CellsGrid[1][1]; c.X0 != 50 || c.Y1 != 50 { + t.Errorf("CellsGrid[1][1]: got %+v, want bottom-right", c) + } +} + +// TestRunTableFinder_2x3Grid is an end-to-end algorithm test that +// builds the edges for a 2-column × 3-row grid by hand, runs the +// full pipeline, and asserts the output has the right shape: +// 12 intersections, 6 cells, 1 table, 3×2 grid. +func TestRunTableFinder_2x3Grid(t *testing.T) { + // Columns at X = 100, 200, 300; rows at Y = 0, 50, 100, 150. + edges := []layout.Edge{ + makeH(100, 300, 0), + makeH(100, 300, 50), + makeH(100, 300, 100), + makeH(100, 300, 150), + makeV(100, 0, 150), + makeV(200, 0, 150), + makeV(300, 0, 150), + } + finder := runTableFinder(edges, 0.1, 0.1) + if got := len(finder.Intersections); got != 12 { + t.Errorf("intersections: got %d, want 12", got) + } + if got := len(finder.Cells); got != 6 { + t.Errorf("cells: got %d, want 6", got) + } + if got := len(finder.Tables); got != 1 { + t.Fatalf("tables: got %d, want 1", got) + } + tb := finder.Tables[0] + if tb.Rows != 3 || tb.Cols != 2 { + t.Errorf("grid: got %dx%d, want 3x2", tb.Rows, tb.Cols) + } +} + +// TestEnsureSupportedStrategies_RejectsTextAndExplicit asserts that +// the v0.3.0 strategies return ErrUnsupported rather than silently +// running an empty pipeline. +func TestEnsureSupportedStrategies_RejectsTextAndExplicit(t *testing.T) { + cases := []struct { + name string + s TableSettings + }{ + {"text/lines", TableSettings{VerticalStrategy: StrategyText, HorizontalStrategy: StrategyLines}}, + {"lines/text", TableSettings{VerticalStrategy: StrategyLines, HorizontalStrategy: StrategyText}}, + {"explicit/lines", TableSettings{VerticalStrategy: StrategyExplicit, HorizontalStrategy: StrategyLines}}, + {"lines/explicit", TableSettings{VerticalStrategy: StrategyLines, HorizontalStrategy: StrategyExplicit}}, + } + for _, c := range cases { + t.Run(c.name, func(t *testing.T) { + err := ensureSupportedStrategies(c.s.applyDefaults()) + if err == nil { + t.Fatal("got nil error, want ErrUnsupported") + } + if !errIs(err, ErrUnsupported) { + t.Errorf("got %v, want ErrUnsupported", err) + } + }) + } +} + +// TestEnsureSupportedStrategies_AcceptsLines asserts that both lines +// strategies pass validation. +func TestEnsureSupportedStrategies_AcceptsLines(t *testing.T) { + cases := []struct { + name string + s TableSettings + }{ + {"lines/lines", TableSettings{VerticalStrategy: StrategyLines, HorizontalStrategy: StrategyLines}}, + {"strict/lines", TableSettings{VerticalStrategy: StrategyLinesStrict, HorizontalStrategy: StrategyLines}}, + {"lines/strict", TableSettings{VerticalStrategy: StrategyLines, HorizontalStrategy: StrategyLinesStrict}}, + {"strict/strict", TableSettings{VerticalStrategy: StrategyLinesStrict, HorizontalStrategy: StrategyLinesStrict}}, + } + for _, c := range cases { + t.Run(c.name, func(t *testing.T) { + if err := ensureSupportedStrategies(c.s.applyDefaults()); err != nil { + t.Errorf("got %v, want nil", err) + } + }) + } +} + +// TestApplyDefaults_FillsZeroFields verifies the zero-value defaults +// match pdfplumber's constants. +func TestApplyDefaults_FillsZeroFields(t *testing.T) { + s := TableSettings{}.applyDefaults() + if s.VerticalStrategy != StrategyLines { + t.Errorf("VerticalStrategy: got %q, want %q", s.VerticalStrategy, StrategyLines) + } + if s.SnapTolerance != 3 { + t.Errorf("SnapTolerance: got %v, want 3", s.SnapTolerance) + } + if s.JoinTolerance != 3 { + t.Errorf("JoinTolerance: got %v, want 3", s.JoinTolerance) + } + if s.EdgeMinLength != 3 { + t.Errorf("EdgeMinLength: got %v, want 3", s.EdgeMinLength) + } + if s.EdgeMinLengthPrefilter != 1 { + t.Errorf("EdgeMinLengthPrefilter: got %v, want 1", s.EdgeMinLengthPrefilter) + } + if s.IntersectionTolerance != 3 { + t.Errorf("IntersectionTolerance: got %v, want 3", s.IntersectionTolerance) + } + if s.TextTolerance != 3 { + t.Errorf("TextTolerance: got %v, want 3", s.TextTolerance) + } +} + +// TestExtractTables_RuledFixture is the end-to-end integration test +// against the hand-crafted 2×3 ruled-table fixture. It opens the +// generated PDF, runs ExtractTables with default settings, and +// asserts the row count + cell text matches the fixture's known +// content. +// +// This test uses the public API only — the unit tests above cover +// the unexported algorithm functions. +func TestExtractTables_RuledFixture(t *testing.T) { + // Import path is package-internal here (we're in the pdftable + // package, not _test), so OpenBytes is unqualified. + doc, err := OpenBytes(testdata.TableRuled()) + if err != nil { + t.Fatalf("OpenBytes: %v", err) + } + defer doc.Close() + + p, err := doc.Page(1) + if err != nil { + t.Fatalf("Page(1): %v", err) + } + + tables, err := p.ExtractTables(DefaultTableSettings()) + if err != nil { + t.Fatalf("ExtractTables: %v", err) + } + if len(tables) != 1 { + t.Fatalf("got %d tables, want 1", len(tables)) + } + tbl := tables[0] + if len(tbl.Rows) != 3 { + t.Fatalf("rows: got %d, want 3", len(tbl.Rows)) + } + for i, row := range tbl.Rows { + if len(row) != 2 { + t.Errorf("row %d: got %d cols, want 2", i, len(row)) + } + } + // Row 0 (visually top): Name | Age + // Row 1: Alice | 30 + // Row 2: Bob | 25 + want := [][]string{ + {"Name", "Age"}, + {"Alice", "30"}, + {"Bob", "25"}, + } + for i := range want { + for j := range want[i] { + got := tbl.Rows[i][j] + if got != want[i][j] { + t.Errorf("Rows[%d][%d]: got %q, want %q", i, j, got, want[i][j]) + } + } + } + if tbl.Page != 1 { + t.Errorf("Page: got %d, want 1", tbl.Page) + } + if tbl.BBox.IsZero() { + t.Error("BBox is zero") + } +} + +// TestExtractTables_UnsupportedStrategyReturnsErrUnsupported asserts +// the public API surfaces ErrUnsupported when callers request "text" +// or "explicit" strategies. +func TestExtractTables_UnsupportedStrategyReturnsErrUnsupported(t *testing.T) { + doc, err := OpenBytes(testdata.TableRuled()) + if err != nil { + t.Fatalf("OpenBytes: %v", err) + } + defer doc.Close() + p, _ := doc.Page(1) + + settings := DefaultTableSettings() + settings.VerticalStrategy = StrategyText + _, err = p.ExtractTables(settings) + if err == nil { + t.Fatal("got nil, want ErrUnsupported") + } + if !errIs(err, ErrUnsupported) { + t.Errorf("got %v, want ErrUnsupported", err) + } + // The error should mention what was unsupported and the phase. + if !strings.Contains(err.Error(), "text") { + t.Errorf("error %q should name the strategy", err.Error()) + } +} + +// TestFindTables_NoEdgesReturnsEmpty asserts that a page with no +// edges (e.g. a text-only page) returns an empty slice, not an +// error. +func TestFindTables_NoEdgesReturnsEmpty(t *testing.T) { + doc, err := OpenBytes(testdata.Hello()) + if err != nil { + t.Fatalf("OpenBytes: %v", err) + } + defer doc.Close() + p, _ := doc.Page(1) + finders, err := p.FindTables(DefaultTableSettings()) + if err != nil { + t.Errorf("FindTables on text-only page: got %v, want nil", err) + } + if len(finders) != 0 { + t.Errorf("got %d finders, want 0 (text-only page)", len(finders)) + } +} diff --git a/testdata/fixtures.go b/testdata/fixtures.go index 4753b2e..0f7caba 100644 --- a/testdata/fixtures.go +++ b/testdata/fixtures.go @@ -52,6 +52,88 @@ ET `) } +// TableRuled returns a minimal PDF whose content stream draws a +// 2-column × 3-row ruled table containing predictable text. The +// table is positioned in user space so the cells are easy to reason +// about: each cell is 100×30 PDF points, the top-left cell sits at +// (100, 700), and the grid extends down to (300, 610). +// +// The grid is drawn by four horizontal rules (at Y = 610, 640, 670, +// 700) and three vertical rules (at X = 100, 200, 300). Cell content +// is placed near the top-left of each cell using `Td` offsets: +// +// row 0: Name Age <- header +// row 1: Alice 30 +// row 2: Bob 25 +// +// Coordinates use PDF user space (Y growing UP), so row 0 is the +// VISUALLY TOP row. The fixture is intentionally simple — no +// kerning, no rotated text, no shaded backgrounds — so the +// expected output is uncontroversial. We use Helvetica from the +// standard 14 so no font program needs to be embedded. +func TableRuled() []byte { + // Lay out the seven ruling lines (4 horizontal, 3 vertical). + // Each line uses a separate moveto/lineto/stroke triple. We + // could collapse them into one painting operation but separate + // `S` calls produce cleaner per-line objects in the parser's + // output, which keeps the test's "we have N lines" assertions + // straightforward. + // + // Horizontal rules: Y = 610 / 640 / 670 / 700, X from 100 to 300. + // Vertical rules: X = 100 / 200 / 300, Y from 610 to 700. + const grid = `1 w +% Horizontal rules. +100 610 m +300 610 l +S +100 640 m +300 640 l +S +100 670 m +300 670 l +S +100 700 m +300 700 l +S +% Vertical rules. +100 610 m +100 700 l +S +200 610 m +200 700 l +S +300 610 m +300 700 l +S +` + // Text in each cell. Y values target a few points below the + // top of the cell so the glyph baseline sits within the cell + // bbox even after Helvetica's descender drops below the + // baseline. Cell top is at Y = top of cell; we use Td to move + // to (x_offset, top - 22) where 22 leaves room for the cap + // height of 10pt Helvetica. + const text = `BT +/F1 10 Tf +% Row 0 (header): Y top = 700, baseline ≈ 678. +110 678 Td +(Name) Tj +100 0 Td +(Age) Tj +% Row 1: Y top = 670, baseline ≈ 648. Move back to col 0 then down. +-100 -30 Td +(Alice) Tj +100 0 Td +(30) Tj +% Row 2: Y top = 640, baseline ≈ 618. +-100 -30 Td +(Bob) Tj +100 0 Td +(25) Tj +ET +` + return BuildSinglePage(grid + text) +} + // Rules returns a minimal PDF whose content stream draws four lines // (two horizontal, two vertical) and one rectangle. We use simple // coordinates: a 100x100 box with one stroked diagonal line and one diff --git a/testdata/golden/issue-466-example.pdf b/testdata/golden/issue-466-example.pdf new file mode 100644 index 0000000..7a4e084 Binary files /dev/null and b/testdata/golden/issue-466-example.pdf differ diff --git a/testdata/golden/issue-466-example.tables.expected.json b/testdata/golden/issue-466-example.tables.expected.json new file mode 100644 index 0000000..3a4faf6 --- /dev/null +++ b/testdata/golden/issue-466-example.tables.expected.json @@ -0,0 +1,46 @@ +{ + "name": "issue-466-example", + "pages": [ + { + "number": 1, + "width": 595.275590551181, + "height": 841.861417322835, + "tables": [ + [ + [ + "T0-C0", + "T0-C1", + "T0-C2" + ], + [ + "T0-00", + "T0-01", + "T0-02" + ], + [ + "T0-10", + "T0-11", + "T0-12" + ], + [ + "T0-20-last", + "T0-21-last", + "T0-22-last" + ] + ], + [ + [ + "T3-C0", + "T3-C1", + "T3-C2" + ], + [ + "T3-00\nT3-10\nT3-20-last", + "T3-01\nT3-11\nT3-21-last", + "T3-02\nT3-12\nT3-22-last" + ] + ] + ] + } + ] +} \ No newline at end of file diff --git a/testdata/table-2x3-ruled.pdf b/testdata/table-2x3-ruled.pdf new file mode 100644 index 0000000..68a6ad7 Binary files /dev/null and b/testdata/table-2x3-ruled.pdf differ