Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
* text=auto eol=lf

# Binary fixtures — keep raw bytes intact across platforms.
*.pdf binary
93 changes: 93 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,98 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.2.0] - 2026-05-27

Phase 1.3.C — table-finding via ruled lines. Direct port of
pdfplumber's `TableFinder` + cells-from-edges algorithm (`table.py`).
The v0.1.x public API surface is unchanged; v0.2.0 only adds methods
to the `Page` interface and new top-level types, so existing callers
compile and run as-is.

### Added

- `Page.FindTables(settings TableSettings) ([]TableFinder, error)` —
geometry-only stage of the pipeline. Returns one TableFinder per
detected table group with the merged edges, intersections, raw
cells, and assembled per-table CellsGrid exposed for debugging /
custom rendering.
- `Page.ExtractTables(settings TableSettings) ([]*Table, error)` —
wraps FindTables, runs per-cell text extraction, returns fully
populated `Table` structs. Cell text is the dense extract\_text
output for chars whose centre point falls inside the cell bbox,
with leading / trailing whitespace stripped. Empty cells produce
`""`.
- `TableSettings` struct with `DefaultTableSettings()` constructor
carrying pdfplumber-matching defaults (snap\_tolerance=3,
join\_tolerance=3, edge\_min\_length=3, edge\_min\_length\_prefilter=1,
intersection\_tolerance=3, text\_tolerance=3).
- `TableStrategy` enum with constants `StrategyLines`,
`StrategyLinesStrict`, `StrategyText`, `StrategyExplicit`. Only
`StrategyLines` and `StrategyLinesStrict` are implemented in this
release; `StrategyText` and `StrategyExplicit` are deferred to
v0.3.0 and return `ErrUnsupported` (with a clear "Phase 1.3.D"
message) so callers don't get silent empty results.
- `Table` (rows × columns of cell text + bbox + per-cell bbox grid),
`TableFinder` (edges + intersections + cells + tables), `TableBox`
(one assembled table's geometry: bbox + Rows × Cols grid),
`Intersection` (one edge-crossing point with its participating
vertical and horizontal edges).
- Internal `internal/layout` package: `Edge` type with `FromLine`,
`FromRect`, `FromCurve` constructors, plus the snap → join →
filter pipeline (`SnapEdges`, `JoinEdges`, `MergeEdges`,
`FilterEdgesByLength`, `FilterEdgesBySource`,
`FilterEdgesByOrientation`, `SortEdges`).
- Golden-file parity test against pdfplumber's `find_tables({"lines"})`
on the `issue-466-example.pdf` fixture (4×3 + 2×3 ruled tables).
Test infrastructure (`TestGoldenTablesAgainstPdfplumber` in
`golden_test.go`) loads any `*.tables.expected.json` fixture in
`testdata/golden/` and compares cell-for-cell after whitespace
normalisation. Regenerate via `python scripts/gen_golden.py`.
- New hand-crafted fixture: `testdata.TableRuled()` — minimal
2-column × 3-row ruled table with predictable text ("Name", "Age";
"Alice", "30"; "Bob", "25") for unit testing the public API
surface without depending on third-party PDFs. Generator script
at `scripts/gen_table_fixture.go`.
- Algorithm-level unit tests in `table_test.go`: hand-crafted edge
lists exercising `edgesToIntersections`, `intersectionsToCells`,
`cellsToTables`, `assembleTableBox`, and the full `runTableFinder`
pipeline.
- README "Tables" section with a side-by-side Go / pdfplumber
example. The example is also extracted as a runnable program at
`examples/extract_tables/main.go` so changes to the API surface
break the example at build time.

### Deferred (planned for v0.3.0 — Phase 1.3.D)

- `StrategyText`: infer table edges from word alignment (clusters of
words sharing x0 / x1 / centre, clusters of words sharing top /
bottom). Useful for PDFs whose tables have no ruled lines (e.g.
banking statements, scanned-then-OCR'd documents).
- `StrategyExplicit`: caller-supplied edges via
`TableSettings.ExplicitVerticalLines` /
`ExplicitHorizontalLines`. In v0.2.0 these settings are accepted
and added on top of the derived edges (helpful when a column
boundary isn't drawn), but they don't form the only source of
edges yet.

### Known limitations

- The cell-text extraction shares the v0.1.x word-grouping engine,
which depends on font metrics. Cells whose glyphs use standard-14
fonts WITHOUT the bundled AFM tables can have intra-word gaps
reported as "no gap" — e.g. "Hello World" comes out as
"HelloWorld". This was already documented for v0.1.0; for v0.2.0
it means the parity test against
`la-precinct-bulletin-2014-p1.pdf` (which uses Helvetica-Bold)
fails on cell text equality. The fixture is not checked in to
avoid CI noise; it'll be re-added once the AFM bundle lands in
v0.2.x.
- `senate-expenditures.pdf` produces 7 cells where pdfplumber finds
10. The divergence is in how snap+join unifies edges that share a
near-collinear endpoint but differ slightly in the perpendicular
axis; under investigation as a follow-up issue. The fixture is
not in the golden set yet.

## [0.1.1] - 2026-05-27

### Fixed
Expand Down Expand Up @@ -127,6 +219,7 @@ Initial release. Phase 1.3.A — content-stream primitives layer.
- Type 3 fonts (their glyph procedures are themselves content streams).
- Vertical writing mode.

[0.2.0]: https://github.com/hallelx2/pdftable/releases/tag/v0.2.0
[0.1.1]: https://github.com/hallelx2/pdftable/releases/tag/v0.1.1
[0.1.0]: https://github.com/hallelx2/pdftable/releases/tag/v0.1.0
[0.0.1]: https://github.com/hallelx2/pdftable/releases/tag/v0.0.1
132 changes: 115 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,10 @@ heuristics on. This is that.

## Status

`v0.1.0` — words and text extraction. `Page.Words`, `Page.ExtractText`,
and `Page.ExtractTextSimple` ship with this release; table-finding
(`FindTables`, `ExtractTables`) is the next phase.
`v0.2.0` — line-strategy table finding. `Page.FindTables` and
`Page.ExtractTables` ship with this release covering the `lines` and
`lines_strict` strategies (PDFs with ruled tables). `text` and
`explicit` strategies return `ErrUnsupported` and land in v0.3.0.

[![Go Reference](https://pkg.go.dev/badge/github.com/hallelx2/pdftable.svg)](https://pkg.go.dev/github.com/hallelx2/pdftable)
[![CI](https://github.com/hallelx2/pdftable/actions/workflows/test.yml/badge.svg)](https://github.com/hallelx2/pdftable/actions/workflows/test.yml)
Expand All @@ -30,7 +31,7 @@ and `Page.ExtractTextSimple` ship with this release; table-finding
## Install

```sh
go get github.com/hallelx2/pdftable@v0.1.0
go get github.com/hallelx2/pdftable@v0.2.0
```

Requires Go 1.25+ (uses the standard-library `iter` package for the `Pages()` range-over-func iterator, and pdfcpu v0.12+).
Expand Down Expand Up @@ -111,6 +112,10 @@ type Page interface {
Words(opts WordOpts) ([]Word, error)
ExtractText(opts TextOpts) (string, error)
ExtractTextSimple(xTolerance, yTolerance float64) (string, error)

// New in v0.2.0: line-strategy table finding.
FindTables(settings TableSettings) ([]TableFinder, error)
ExtractTables(settings TableSettings) ([]*Table, error)
}

// Primitives.
Expand Down Expand Up @@ -206,6 +211,91 @@ laid, _ := page.ExtractText(opts)
fmt.Println(laid)
```

## Tables (lines strategy)

`Page.ExtractTables` is the table-detection entry point. It runs the
edges → intersections → cells → tables pipeline (a direct port of
pdfplumber's `TableFinder`) and returns one `*Table` per detected
ruled table, with cell text already extracted.

```go
doc, _ := pdftable.OpenFile("invoice.pdf")
defer doc.Close()
page, _ := doc.Page(1)

settings := pdftable.DefaultTableSettings()
// settings.VerticalStrategy = pdftable.StrategyLinesStrict // ignore rect outlines

tables, _ := page.ExtractTables(settings)
for ti, t := range tables {
fmt.Printf("table %d: %d rows × %d cols at %+v\n",
ti, len(t.Rows), len(t.Rows[0]), t.BBox)
for _, row := range t.Rows {
fmt.Println(row)
}
}
```

`TableSettings` defaults match pdfplumber's
(`snap_tolerance=3`, `join_tolerance=3`, `edge_min_length=3`,
`intersection_tolerance=3`, `text_tolerance=3`). Override any field
on the value returned from `DefaultTableSettings()` to tighten or
loosen the heuristics. The two implemented strategies are:

- `StrategyLines` — edges come from drawn `Line` segments, `Rect`
outlines (all four sides), and axis-aligned `Curve` segments.
Default. Best for typical PDFs whose tables have rule lines.
- `StrategyLinesStrict` — only drawn `Line` segments are used. Use
this when your PDF draws cell BACKGROUNDS as filled rectangles
that you do NOT want treated as row boundaries.

`StrategyText` (word-alignment-based) and `StrategyExplicit`
(caller-supplied edges) return `ErrUnsupported` in v0.2.0 — they
land in v0.3.0.

### Side-by-side: pdfplumber → pdftable

```python
# Python (pdfplumber)
import pdfplumber

with pdfplumber.open("invoice.pdf") as pdf:
page = pdf.pages[0]
for table in page.find_tables({"vertical_strategy": "lines",
"horizontal_strategy": "lines"}):
for row in table.extract():
print(row)
```

```go
// Go (pdftable)
import "github.com/hallelx2/pdftable"

doc, _ := pdftable.OpenFile("invoice.pdf")
defer doc.Close()
page, _ := doc.Page(1)

settings := pdftable.DefaultTableSettings()
settings.VerticalStrategy = pdftable.StrategyLines
settings.HorizontalStrategy = pdftable.StrategyLines

tables, _ := page.ExtractTables(settings)
for _, t := range tables {
for _, row := range t.Rows {
fmt.Println(row)
}
}
```

The two outputs match cell-for-cell on ruled fixtures (see
`testdata/golden/issue-466-example.*` for the parity test). Field
naming differs in the obvious places: pdftable returns a slice of
`*Table` instead of `Table` objects you have to call `.extract()`
on; rows are `[]string` instead of `list[Optional[str]]` (missing
cells produce `""` rather than `nil`); and table bboxes use
`(X0, Y0, X1, Y1)` PDF user space rather than pdfplumber's
image-space `(x0, top, x1, bottom)`.

## Side-by-side comparison with pdfplumber

```python
Expand Down Expand Up @@ -299,16 +389,21 @@ pdftable/
├── page.go // Page interface + implementation
├── char.go // Public Char / Line / Rect / Curve / Objects
├── text.go // Word + ExtractText + ExtractTextSimple (v0.1.0)
├── table.go // TableStrategy / TableSettings / Table types (v0.2.0)
├── finder.go // Cells-from-edges algorithm (v0.2.0)
├── clustering.go // 1-D clusterObjects, groupObjectsByAttr, dedupeChars
├── geometry.go // BBox helpers: Union, Intersect, Contains, Snap
├── errors.go // Sentinel errors
└── internal/pdf/
├── reader.go // pdfcpu bridge
├── content.go // Content-stream interpreter
├── ops.go // Operator dispatch table
├── state.go // Graphics + text state, matrix math
├── font.go // Font + encoding tables + glyph-name resolution
└── cmap.go // ToUnicode CMap parser
└── internal/
├── layout/
│ └── lines.go // Edge type + snap/join/filter pipeline (v0.2.0)
└── pdf/
├── reader.go // pdfcpu bridge
├── content.go // Content-stream interpreter
├── ops.go // Operator dispatch table
├── state.go // Graphics + text state, matrix math
├── font.go // Font + encoding tables + glyph-name resolution
└── cmap.go // ToUnicode CMap parser
```

The public `pdftable` package is small and stable. The `internal/pdf`
Expand All @@ -333,12 +428,15 @@ stdlib-only.

- `v0.0.x` — content-stream primitives.
- `v0.1.x` — text extraction: `Page.ExtractText`, `Page.Words`,
`Page.ExtractTextSimple` (this release).
- `v0.2.x` — table finding: `Page.FindTables` using ruling-line +
whitespace heuristics, `Page.ExtractTables` returning row/cell text.
Bundle the standard-14 AFM metrics so word bboxes match pdfplumber
to within 1 PDF point.
- `v0.3.x` — performance pass: parser benchmarking against pdfminer.six
`Page.ExtractTextSimple`.
- `v0.2.x` — table finding via ruling lines (this release):
`Page.FindTables` / `Page.ExtractTables` covering the `lines` and
`lines_strict` strategies.
- `v0.3.x` — remaining table strategies: `text` (word-alignment
edges) and `explicit` (caller-supplied edges). Bundle the
standard-14 AFM metrics so word bboxes (and therefore cell text)
match pdfplumber to within 1 PDF point on standard fonts.
- `v0.4.x` — performance pass: parser benchmarking against pdfminer.six
and pdfplumber on a representative document corpus.

## License
Expand Down
65 changes: 65 additions & 0 deletions examples/extract_tables/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
// Copyright (c) 2026 Halleluyah Oludele
// Licensed under the MIT License.

// examples/extract_tables/main.go is the runnable form of the
// README's "Tables (lines strategy)" example. It exists so that
// changes to the public API surface break the example at build time
// rather than letting a stale snippet drift in the README.
//
// Run from the repo root:
//
// go run ./examples/extract_tables testdata/golden/issue-466-example.pdf
//
// The example uses the ExtractTables call with default settings
// (which select the "lines" strategy on both axes). It prints each
// detected table's rows × cols and dimensions, then each row as a
// flat slice — exactly the snippet documented in README.md.
package main

import (
"fmt"
"log"
"os"

"github.com/hallelx2/pdftable"
)

func main() {
if len(os.Args) < 2 {
fmt.Fprintln(os.Stderr, "usage: extract_tables <file.pdf>")
os.Exit(2)
}
path := os.Args[1]

doc, err := pdftable.OpenFile(path)
if err != nil {
log.Fatalf("OpenFile %s: %v", path, err)
}
defer doc.Close()

page, err := doc.Page(1)
if err != nil {
log.Fatalf("Page(1): %v", err)
}

settings := pdftable.DefaultTableSettings()
// Uncomment to ignore Rect outlines (filled cell backgrounds
// that aren't real row boundaries):
// settings.VerticalStrategy = pdftable.StrategyLinesStrict

tables, err := page.ExtractTables(settings)
if err != nil {
log.Fatalf("ExtractTables: %v", err)
}
for ti, t := range tables {
cols := 0
if len(t.Rows) > 0 {
cols = len(t.Rows[0])
}
fmt.Printf("table %d: %d rows × %d cols at %+v\n",
ti, len(t.Rows), cols, t.BBox)
for _, row := range t.Rows {
fmt.Println(row)
}
}
}
Loading
Loading