Skip to content

feat(manifest): ✨ add per-file column statistics for codec-agnostic pruning#103

Merged
justapithecus merged 1 commit intomainfrom
andrew/feat/manifest/file-stats
Feb 7, 2026
Merged

feat(manifest): ✨ add per-file column statistics for codec-agnostic pruning#103
justapithecus merged 1 commit intomainfrom
andrew/feat/manifest/file-stats

Conversation

@justapithecus
Copy link
Owner

Summary

Adds per-file column statistics to manifest FileRefs, enabling external readers to prune files without opening them. The design is codec-agnostic via an optional StatisticalCodec interface; the Parquet codec is the first implementation.

Highlights

  • New FileStats and ColumnStats types on FileRef (omitempty for backward compat)
  • New StatisticalCodec and StatisticalStreamEncoder opt-in interfaces
  • Parquet codec reports min/max for orderable types (int32, int64, float32, float64, string, timestamp) and null count for all columns
  • JSONL and raw blob writes produce no stats (FileRef.Stats is nil)
  • Write path wired in both writeDataFile (batch) and StreamWriteRecords (streaming)
  • 14 new tests: codec stats, write-path wiring, JSON round-trip, backward compat
  • Contracts updated: CONTRACT_CORE.md, CONTRACT_PARQUET.md, CONTRACT_WRITE_API.md

Test plan

  • All 14 new stats tests pass
  • All existing tests still pass
  • golangci-lint clean
  • Examples compile
  • JSONL manifest JSON has no stats key; Parquet manifest JSON has stats with columns

🤖 Generated with Claude Code

…runing

Add FileStats and ColumnStats types to FileRef for per-file column
statistics. Codecs opt in via StatisticalCodec interface; Parquet is
the first implementation, reporting min/max for orderable types and
null count for all columns. JSONL and raw blob writes produce no stats.

- New types: FileStats, ColumnStats (api.go)
- New interfaces: StatisticalCodec, StatisticalStreamEncoder (api.go)
- Stats field on FileRef with omitempty for backward compat
- Parquet accumulates stats during Encode via table-driven comparison
- Write path wired in writeDataFile and StreamWriteRecords
- 14 new tests covering codec stats, write-path wiring, and JSON round-trip
- Contracts updated: CONTRACT_CORE, CONTRACT_PARQUET, CONTRACT_WRITE_API
- PUBLIC_API.md updated with new interfaces, types, and usage example

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@justapithecus justapithecus merged commit 344f2e7 into main Feb 7, 2026
5 checks passed
@justapithecus justapithecus deleted the andrew/feat/manifest/file-stats branch February 7, 2026 02:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant