hallelx2 · hallelx2 · May 27, 2026 · May 27, 2026 · May 27, 2026 · May 27, 2026
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,4 @@
+* text=auto eol=lf
+
+# Binary fixtures — keep raw bytes intact across platforms.
+*.pdf binary
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,98 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [0.2.0] - 2026-05-27
+
+Phase 1.3.C — table-finding via ruled lines. Direct port of
+pdfplumber's `TableFinder` + cells-from-edges algorithm (`table.py`).
+The v0.1.x public API surface is unchanged; v0.2.0 only adds methods
+to the `Page` interface and new top-level types, so existing callers
+compile and run as-is.
+
+### Added
+
+- `Page.FindTables(settings TableSettings) ([]TableFinder, error)` —
+  geometry-only stage of the pipeline. Returns one TableFinder per
+  detected table group with the merged edges, intersections, raw
+  cells, and assembled per-table CellsGrid exposed for debugging /
+  custom rendering.
+- `Page.ExtractTables(settings TableSettings) ([]*Table, error)` —
+  wraps FindTables, runs per-cell text extraction, returns fully
+  populated `Table` structs. Cell text is the dense extract\_text
+  output for chars whose centre point falls inside the cell bbox,
+  with leading / trailing whitespace stripped. Empty cells produce
+  `""`.
+- `TableSettings` struct with `DefaultTableSettings()` constructor
+  carrying pdfplumber-matching defaults (snap\_tolerance=3,
+  join\_tolerance=3, edge\_min\_length=3, edge\_min\_length\_prefilter=1,
+  intersection\_tolerance=3, text\_tolerance=3).
+- `TableStrategy` enum with constants `StrategyLines`,
+  `StrategyLinesStrict`, `StrategyText`, `StrategyExplicit`. Only
+  `StrategyLines` and `StrategyLinesStrict` are implemented in this
+  release; `StrategyText` and `StrategyExplicit` are deferred to
+  v0.3.0 and return `ErrUnsupported` (with a clear "Phase 1.3.D"
+  message) so callers don't get silent empty results.
+- `Table` (rows × columns of cell text + bbox + per-cell bbox grid),
+  `TableFinder` (edges + intersections + cells + tables), `TableBox`
+  (one assembled table's geometry: bbox + Rows × Cols grid),
+  `Intersection` (one edge-crossing point with its participating
+  vertical and horizontal edges).
+- Internal `internal/layout` package: `Edge` type with `FromLine`,
+  `FromRect`, `FromCurve` constructors, plus the snap → join →
+  filter pipeline (`SnapEdges`, `JoinEdges`, `MergeEdges`,
+  `FilterEdgesByLength`, `FilterEdgesBySource`,
+  `FilterEdgesByOrientation`, `SortEdges`).
+- Golden-file parity test against pdfplumber's `find_tables({"lines"})`
+  on the `issue-466-example.pdf` fixture (4×3 + 2×3 ruled tables).
+  Test infrastructure (`TestGoldenTablesAgainstPdfplumber` in
+  `golden_test.go`) loads any `*.tables.expected.json` fixture in
+  `testdata/golden/` and compares cell-for-cell after whitespace
+  normalisation. Regenerate via `python scripts/gen_golden.py`.
+- New hand-crafted fixture: `testdata.TableRuled()` — minimal
+  2-column × 3-row ruled table with predictable text ("Name", "Age";
+  "Alice", "30"; "Bob", "25") for unit testing the public API
+  surface without depending on third-party PDFs. Generator script
+  at `scripts/gen_table_fixture.go`.
+- Algorithm-level unit tests in `table_test.go`: hand-crafted edge
+  lists exercising `edgesToIntersections`, `intersectionsToCells`,
+  `cellsToTables`, `assembleTableBox`, and the full `runTableFinder`
+  pipeline.
+- README "Tables" section with a side-by-side Go / pdfplumber
+  example. The example is also extracted as a runnable program at
+  `examples/extract_tables/main.go` so changes to the API surface
+  break the example at build time.
+
+### Deferred (planned for v0.3.0 — Phase 1.3.D)
+
+- `StrategyText`: infer table edges from word alignment (clusters of
+  words sharing x0 / x1 / centre, clusters of words sharing top /
+  bottom). Useful for PDFs whose tables have no ruled lines (e.g.
+  banking statements, scanned-then-OCR'd documents).
+- `StrategyExplicit`: caller-supplied edges via
+  `TableSettings.ExplicitVerticalLines` /
+  `ExplicitHorizontalLines`. In v0.2.0 these settings are accepted
+  and added on top of the derived edges (helpful when a column
+  boundary isn't drawn), but they don't form the only source of
+  edges yet.
+
+### Known limitations
+
+- The cell-text extraction shares the v0.1.x word-grouping engine,
+  which depends on font metrics. Cells whose glyphs use standard-14
+  fonts WITHOUT the bundled AFM tables can have intra-word gaps
+  reported as "no gap" — e.g. "Hello World" comes out as
+  "HelloWorld". This was already documented for v0.1.0; for v0.2.0
+  it means the parity test against
+  `la-precinct-bulletin-2014-p1.pdf` (which uses Helvetica-Bold)
+  fails on cell text equality. The fixture is not checked in to
+  avoid CI noise; it'll be re-added once the AFM bundle lands in
+  v0.2.x.
+- `senate-expenditures.pdf` produces 7 cells where pdfplumber finds
+  10. The divergence is in how snap+join unifies edges that share a
+  near-collinear endpoint but differ slightly in the perpendicular
+  axis; under investigation as a follow-up issue. The fixture is
+  not in the golden set yet.
+
 ## [0.1.1] - 2026-05-27
 
 ### Fixed
@@ -127,6 +219,7 @@ Initial release. Phase 1.3.A — content-stream primitives layer.
 - Type 3 fonts (their glyph procedures are themselves content streams).
 - Vertical writing mode.
 
+[0.2.0]: https://github.com/hallelx2/pdftable/releases/tag/v0.2.0
 [0.1.1]: https://github.com/hallelx2/pdftable/releases/tag/v0.1.1
 [0.1.0]: https://github.com/hallelx2/pdftable/releases/tag/v0.1.0
 [0.0.1]: https://github.com/hallelx2/pdftable/releases/tag/v0.0.1
diff --git a/README.md b/README.md
@@ -19,9 +19,10 @@ heuristics on. This is that.
 
 ## Status
 
-`v0.1.0` — words and text extraction. `Page.Words`, `Page.ExtractText`,
-and `Page.ExtractTextSimple` ship with this release; table-finding
-(`FindTables`, `ExtractTables`) is the next phase.
+`v0.2.0` — line-strategy table finding. `Page.FindTables` and
+`Page.ExtractTables` ship with this release covering the `lines` and
+`lines_strict` strategies (PDFs with ruled tables). `text` and
+`explicit` strategies return `ErrUnsupported` and land in v0.3.0.
 
 [![Go Reference](https://pkg.go.dev/badge/github.com/hallelx2/pdftable.svg)](https://pkg.go.dev/github.com/hallelx2/pdftable)
 [![CI](https://github.com/hallelx2/pdftable/actions/workflows/test.yml/badge.svg)](https://github.com/hallelx2/pdftable/actions/workflows/test.yml)
@@ -30,7 +31,7 @@ and `Page.ExtractTextSimple` ship with this release; table-finding
 ## Install
 
 ```sh
-go get github.com/hallelx2/pdftable@v0.1.0
+go get github.com/hallelx2/pdftable@v0.2.0
 ```
 
 Requires Go 1.25+ (uses the standard-library `iter` package for the `Pages()` range-over-func iterator, and pdfcpu v0.12+).
@@ -111,6 +112,10 @@ type Page interface {
     Words(opts WordOpts) ([]Word, error)
     ExtractText(opts TextOpts) (string, error)
     ExtractTextSimple(xTolerance, yTolerance float64) (string, error)
+
+    // New in v0.2.0: line-strategy table finding.
+    FindTables(settings TableSettings) ([]TableFinder, error)
+    ExtractTables(settings TableSettings) ([]*Table, error)
 }
 
 // Primitives.
@@ -206,6 +211,91 @@ laid, _ := page.ExtractText(opts)
 fmt.Println(laid)
 ```
 
+## Tables (lines strategy)
+
+`Page.ExtractTables` is the table-detection entry point. It runs the
+edges → intersections → cells → tables pipeline (a direct port of
+pdfplumber's `TableFinder`) and returns one `*Table` per detected
+ruled table, with cell text already extracted.
+
+```go
+doc, _ := pdftable.OpenFile("invoice.pdf")
+defer doc.Close()
+page, _ := doc.Page(1)
+
+settings := pdftable.DefaultTableSettings()
+// settings.VerticalStrategy = pdftable.StrategyLinesStrict  // ignore rect outlines
+
+tables, _ := page.ExtractTables(settings)
+for ti, t := range tables {
+    fmt.Printf("table %d: %d rows × %d cols at %+v\n",
+        ti, len(t.Rows), len(t.Rows[0]), t.BBox)
+    for _, row := range t.Rows {
+        fmt.Println(row)
+    }
+}
+```
+
+`TableSettings` defaults match pdfplumber's
+(`snap_tolerance=3`, `join_tolerance=3`, `edge_min_length=3`,
+`intersection_tolerance=3`, `text_tolerance=3`). Override any field
+on the value returned from `DefaultTableSettings()` to tighten or
+loosen the heuristics. The two implemented strategies are:
+
+- `StrategyLines` — edges come from drawn `Line` segments, `Rect`
+  outlines (all four sides), and axis-aligned `Curve` segments.
+  Default. Best for typical PDFs whose tables have rule lines.
+- `StrategyLinesStrict` — only drawn `Line` segments are used. Use
+  this when your PDF draws cell BACKGROUNDS as filled rectangles
+  that you do NOT want treated as row boundaries.
+
+`StrategyText` (word-alignment-based) and `StrategyExplicit`
+(caller-supplied edges) return `ErrUnsupported` in v0.2.0 — they
+land in v0.3.0.
+
+### Side-by-side: pdfplumber → pdftable
+
+```python
+# Python (pdfplumber)
+import pdfplumber
+
+with pdfplumber.open("invoice.pdf") as pdf:
+    page = pdf.pages[0]
+    for table in page.find_tables({"vertical_strategy": "lines",
+                                    "horizontal_strategy": "lines"}):
+        for row in table.extract():
+            print(row)
+```
+
+```go
+// Go (pdftable)
+import "github.com/hallelx2/pdftable"
+
+doc, _ := pdftable.OpenFile("invoice.pdf")
+defer doc.Close()
+page, _ := doc.Page(1)
+
+settings := pdftable.DefaultTableSettings()
+settings.VerticalStrategy = pdftable.StrategyLines
+settings.HorizontalStrategy = pdftable.StrategyLines
+
+tables, _ := page.ExtractTables(settings)
+for _, t := range tables {
+    for _, row := range t.Rows {
+        fmt.Println(row)
+    }
+}
+```
+
+The two outputs match cell-for-cell on ruled fixtures (see
+`testdata/golden/issue-466-example.*` for the parity test). Field
+naming differs in the obvious places: pdftable returns a slice of
+`*Table` instead of `Table` objects you have to call `.extract()`
+on; rows are `[]string` instead of `list[Optional[str]]` (missing
+cells produce `""` rather than `nil`); and table bboxes use
+`(X0, Y0, X1, Y1)` PDF user space rather than pdfplumber's
+image-space `(x0, top, x1, bottom)`.
+
 ## Side-by-side comparison with pdfplumber
 
 ```python
@@ -299,16 +389,21 @@ pdftable/
 ├── page.go            // Page interface + implementation
 ├── char.go            // Public Char / Line / Rect / Curve / Objects
 ├── text.go            // Word + ExtractText + ExtractTextSimple (v0.1.0)
+├── table.go           // TableStrategy / TableSettings / Table types (v0.2.0)
+├── finder.go          // Cells-from-edges algorithm (v0.2.0)
 ├── clustering.go      // 1-D clusterObjects, groupObjectsByAttr, dedupeChars
 ├── geometry.go        // BBox helpers: Union, Intersect, Contains, Snap
 ├── errors.go          // Sentinel errors
-└── internal/pdf/
-    ├── reader.go      // pdfcpu bridge
-    ├── content.go     // Content-stream interpreter
-    ├── ops.go         // Operator dispatch table
-    ├── state.go       // Graphics + text state, matrix math
-    ├── font.go        // Font + encoding tables + glyph-name resolution
-    └── cmap.go        // ToUnicode CMap parser
+└── internal/
+    ├── layout/
+    │   └── lines.go   // Edge type + snap/join/filter pipeline (v0.2.0)
+    └── pdf/
+        ├── reader.go      // pdfcpu bridge
+        ├── content.go     // Content-stream interpreter
+        ├── ops.go         // Operator dispatch table
+        ├── state.go       // Graphics + text state, matrix math
+        ├── font.go        // Font + encoding tables + glyph-name resolution
+        └── cmap.go        // ToUnicode CMap parser
 ```
 
 The public `pdftable` package is small and stable. The `internal/pdf`
@@ -333,12 +428,15 @@ stdlib-only.
 
 - `v0.0.x` — content-stream primitives.
 - `v0.1.x` — text extraction: `Page.ExtractText`, `Page.Words`,
-  `Page.ExtractTextSimple` (this release).
-- `v0.2.x` — table finding: `Page.FindTables` using ruling-line +
-  whitespace heuristics, `Page.ExtractTables` returning row/cell text.
-  Bundle the standard-14 AFM metrics so word bboxes match pdfplumber
-  to within 1 PDF point.
-- `v0.3.x` — performance pass: parser benchmarking against pdfminer.six
+  `Page.ExtractTextSimple`.
+- `v0.2.x` — table finding via ruling lines (this release):
+  `Page.FindTables` / `Page.ExtractTables` covering the `lines` and
+  `lines_strict` strategies.
+- `v0.3.x` — remaining table strategies: `text` (word-alignment
+  edges) and `explicit` (caller-supplied edges). Bundle the
+  standard-14 AFM metrics so word bboxes (and therefore cell text)
+  match pdfplumber to within 1 PDF point on standard fonts.
+- `v0.4.x` — performance pass: parser benchmarking against pdfminer.six
   and pdfplumber on a representative document corpus.
 
 ## License

diff --git a/examples/extract_tables/main.go b/examples/extract_tables/main.go
@@ -0,0 +1,65 @@
+// Copyright (c) 2026 Halleluyah Oludele
+// Licensed under the MIT License.
+
+// examples/extract_tables/main.go is the runnable form of the
+// README's "Tables (lines strategy)" example. It exists so that
+// changes to the public API surface break the example at build time
+// rather than letting a stale snippet drift in the README.
+//
+// Run from the repo root:
+//
+//	go run ./examples/extract_tables testdata/golden/issue-466-example.pdf
+//
+// The example uses the ExtractTables call with default settings
+// (which select the "lines" strategy on both axes). It prints each
+// detected table's rows × cols and dimensions, then each row as a
+// flat slice — exactly the snippet documented in README.md.
+package main
+
+import (
+	"fmt"
+	"log"
+	"os"
+
+	"github.com/hallelx2/pdftable"
+)
+
+func main() {
+	if len(os.Args) < 2 {
+		fmt.Fprintln(os.Stderr, "usage: extract_tables <file.pdf>")
+		os.Exit(2)
+	}
+	path := os.Args[1]
+
+	doc, err := pdftable.OpenFile(path)
+	if err != nil {
+		log.Fatalf("OpenFile %s: %v", path, err)
+	}
+	defer doc.Close()
+
+	page, err := doc.Page(1)
+	if err != nil {
+		log.Fatalf("Page(1): %v", err)
+	}
+
+	settings := pdftable.DefaultTableSettings()
+	// Uncomment to ignore Rect outlines (filled cell backgrounds
+	// that aren't real row boundaries):
+	// settings.VerticalStrategy = pdftable.StrategyLinesStrict
+
+	tables, err := page.ExtractTables(settings)
+	if err != nil {
+		log.Fatalf("ExtractTables: %v", err)
+	}
+	for ti, t := range tables {
+		cols := 0
+		if len(t.Rows) > 0 {
+			cols = len(t.Rows[0])
+		}
+		fmt.Printf("table %d: %d rows × %d cols at %+v\n",
+			ti, len(t.Rows), cols, t.BBox)
+		for _, row := range t.Rows {
+			fmt.Println(row)
+		}
+	}
+}