docs: restructure README to lead with primary surface and use cases by clee704 · Pull Request #63 · clee704/parquet-analyzer

clee704 · 2026-06-11T04:05:20Z

What & why

Restructures README.md to lead with the tool's value proposition and
audience — AI agents and humans verifying encoding-level parquet behavior —
with a verb-noun example in the first paragraph and the primary CLI surface
(verb-noun subcommands) clearly introduced before the legacy modes.

The previous README led with a generic one-liner and an HTML report link,
with the subcommand surface buried halfway down. This underrepresented what
the tool is now primarily useful for.

Key changes:

First screen: a 3-line verb-noun example block and a short paragraph
establish the tool's purpose and JSON/AI-agent-friendly design before any
prose sections.
"What you can do" section: 5 concrete investigation flows (encoding
audit, byte-range lookup, top-down navigation, page decode, file
validation), each with a real command example and a link to the subcommand
reference.
"Compared to" section: explicit positioning vs parquet-tools, pqrs,
pyarrow.parquet library, and DuckDB parquet_metadata().
CLI Reference reorganized: verb-noun subcommands first (primary
surface), legacy --output-mode modes collapsed into a "Legacy
whole-file modes" subsection (secondary surface).
All examples run on real data: every code block uses actual
titanic.parquet output (or a small pyarrow-generated file for the
Arrow:schema KV example); no ... placeholder tokens.
New "Design" section: points to docs/output-principles.md and
docs/tree-schema.md for contributors and sophisticated consumers.
Preserved content: Library API, decoder usage, dev/build
instructions, benchmarks, footer cache, and Thrift details are all
retained and reorganized; Development moved to the bottom.

Closes #23

Refactoring checkpoint

I ran the refactoring checkpoint on the touched code and its design context.
Fixed in this PR: N/A — docs-only change; no code touched.
Deferred (issue links): none.

Dogfooding

I drove the change on realistic data and assessed performance + UX (or: this PR is exempt — internal/docs/test-only).
Scenarios exercised: Ran every verb-noun subcommand (file summary, file kv, file schema, file validate, rowgroup list, rowgroup show, column list, column show, page list, page header, page decode) against tests/data/titanic.parquet; also generated a small pyarrow file to capture the ARROW:schema KV key output. All real output pasted directly — no manual editing.
Perf / UX findings: Docs-only PR; no behavior changed.
Fixed / deferred: none.

Tests

hatch run dev:check is green (format, lint, type-check, tests, per-module 95% coverage).
New / changed code paths assert observable behavior, not just execute lines. (N/A — docs only.)

Reorganize the README to lead with the tool's value proposition and audience (AI agents and humans verifying encoding-level parquet behavior), with verb-noun subcommand examples in the first screen. Changes: - First paragraph states purpose and audience clearly; adds a 3-line verb-noun example before any prose - New 'What you can do' section with 5 concrete investigation flows (encoding audit, byte-range lookup, top-down navigation, page decode, file validation), each linked to the relevant subcommand section - New 'Compared to' section positioning the tool vs parquet-tools, pqrs, pyarrow.parquet library, and DuckDB parquet_metadata() - CLI Reference reorganized: verb-noun subcommands first (primary surface), legacy --output-mode modes collapsed into a 'Legacy whole-file modes' subsection (secondary surface) - All examples updated to use actual titanic.parquet output; removed placeholder '...' tokens throughout - New 'Design' section pointing to docs/output-principles.md and docs/tree-schema.md - Existing Library API, decoder usage, dev instructions, benchmarks, and technical details preserved and reorganized; Development section moved to the bottom Closes #23 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Widen tagline to 'CLI and Python library' so Python library consumers see the tool is relevant on first glance - Add flow 6 in 'What you can do': library API usage with a real runnable snippet against titanic.parquet showing lazy open, column walk, page.decode() RLE run inspection, and physical_values() - Move '## Library API' above '### Legacy whole-file modes' so the primary Python API surface sits above the secondary legacy CLI modes in the reference section Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

clee704 and others added 2 commits June 11, 2026 04:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: restructure README to lead with primary surface and use cases#63

docs: restructure README to lead with primary surface and use cases#63
clee704 wants to merge 2 commits into
masterfrom
docs/readme-restructure-23

clee704 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

clee704 commented Jun 11, 2026

What & why

Refactoring checkpoint

Dogfooding

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant