Skip to content

docs: restructure README to lead with primary surface and use cases#63

Draft
clee704 wants to merge 2 commits into
masterfrom
docs/readme-restructure-23
Draft

docs: restructure README to lead with primary surface and use cases#63
clee704 wants to merge 2 commits into
masterfrom
docs/readme-restructure-23

Conversation

@clee704

@clee704 clee704 commented Jun 11, 2026

Copy link
Copy Markdown
Owner

What & why

Restructures README.md to lead with the tool's value proposition and
audience — AI agents and humans verifying encoding-level parquet behavior —
with a verb-noun example in the first paragraph and the primary CLI surface
(verb-noun subcommands) clearly introduced before the legacy modes.

The previous README led with a generic one-liner and an HTML report link,
with the subcommand surface buried halfway down. This underrepresented what
the tool is now primarily useful for.

Key changes:

  • First screen: a 3-line verb-noun example block and a short paragraph
    establish the tool's purpose and JSON/AI-agent-friendly design before any
    prose sections.
  • "What you can do" section: 5 concrete investigation flows (encoding
    audit, byte-range lookup, top-down navigation, page decode, file
    validation), each with a real command example and a link to the subcommand
    reference.
  • "Compared to" section: explicit positioning vs parquet-tools, pqrs,
    pyarrow.parquet library, and DuckDB parquet_metadata().
  • CLI Reference reorganized: verb-noun subcommands first (primary
    surface), legacy --output-mode modes collapsed into a "Legacy
    whole-file modes" subsection (secondary surface).
  • All examples run on real data: every code block uses actual
    titanic.parquet output (or a small pyarrow-generated file for the
    Arrow:schema KV example); no ... placeholder tokens.
  • New "Design" section: points to docs/output-principles.md and
    docs/tree-schema.md for contributors and sophisticated consumers.
  • Preserved content: Library API, decoder usage, dev/build
    instructions, benchmarks, footer cache, and Thrift details are all
    retained and reorganized; Development moved to the bottom.

Closes #23

Refactoring checkpoint

  • I ran the refactoring checkpoint on the touched code and its design context.
  • Fixed in this PR: N/A — docs-only change; no code touched.
  • Deferred (issue links): none.

Dogfooding

  • I drove the change on realistic data and assessed performance + UX (or: this PR is exempt — internal/docs/test-only).
  • Scenarios exercised: Ran every verb-noun subcommand (file summary, file kv, file schema, file validate, rowgroup list, rowgroup show, column list, column show, page list, page header, page decode) against tests/data/titanic.parquet; also generated a small pyarrow file to capture the ARROW:schema KV key output. All real output pasted directly — no manual editing.
  • Perf / UX findings: Docs-only PR; no behavior changed.
  • Fixed / deferred: none.

Tests

  • hatch run dev:check is green (format, lint, type-check, tests, per-module 95% coverage).
  • New / changed code paths assert observable behavior, not just execute lines. (N/A — docs only.)

clee704 and others added 2 commits June 11, 2026 04:04
Reorganize the README to lead with the tool's value proposition and
audience (AI agents and humans verifying encoding-level parquet behavior),
with verb-noun subcommand examples in the first screen.

Changes:
- First paragraph states purpose and audience clearly; adds a 3-line
  verb-noun example before any prose
- New 'What you can do' section with 5 concrete investigation flows
  (encoding audit, byte-range lookup, top-down navigation, page decode,
  file validation), each linked to the relevant subcommand section
- New 'Compared to' section positioning the tool vs parquet-tools,
  pqrs, pyarrow.parquet library, and DuckDB parquet_metadata()
- CLI Reference reorganized: verb-noun subcommands first (primary
  surface), legacy --output-mode modes collapsed into a 'Legacy
  whole-file modes' subsection (secondary surface)
- All examples updated to use actual titanic.parquet output; removed
  placeholder '...' tokens throughout
- New 'Design' section pointing to docs/output-principles.md and
  docs/tree-schema.md
- Existing Library API, decoder usage, dev instructions, benchmarks,
  and technical details preserved and reorganized; Development section
  moved to the bottom

Closes #23

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Widen tagline to 'CLI and Python library' so Python library consumers
  see the tool is relevant on first glance
- Add flow 6 in 'What you can do': library API usage with a real
  runnable snippet against titanic.parquet showing lazy open, column
  walk, page.decode() RLE run inspection, and physical_values()
- Move '## Library API' above '### Legacy whole-file modes' so the
  primary Python API surface sits above the secondary legacy CLI modes
  in the reference section

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docs: restructure README to lead with primary surface + use cases

1 participant