Skip to content

Address review feedback#18

Merged
vinniefalco merged 14 commits into
cppalliance:masterfrom
gregjkal:gk/review-fixes
Apr 16, 2026
Merged

Address review feedback#18
vinniefalco merged 14 commits into
cppalliance:masterfrom
gregjkal:gk/review-fixes

Conversation

@gregjkal
Copy link
Copy Markdown
Contributor

Summary

Fixes flagged by the Apr 14 agent review (tomd/review.md, tomd/lib-review.md) plus packaging, CI, and test coverage so the project is installable and gated. One follow-up wave (sourced from the Apr 15 agent review in tomd/review.md) is deliberately deferred, see "Not in this PR" below.

What's in it

Bug fixes (user-visible)

  • Nested-list duplication (HTML): renderer doubled text in nested lists; sublists are now detached before inline capture.
  • Dehyphenation single-span (PDF): dehyphenation duplicated a word when the next line was a single fully-consumed span.
  • Spatial sort stability: spatial extraction sorted by y only; now sorts by y-band then x for deterministic reading order.
  • similar() short-circuit: returned False for identical strings over 200 chars, breaking TOC match on long titles.
  • Hidden-regions API fallback: resolved by removing the try/except AttributeError around page.get_texttrace(). With pymupdf~=1.27.0 pinned, the pin is the contract; silent degradation would hide dep drift.
  • Nested <table> rows: HTML renderer's find_all("tr") flattened nested rows into the parent.

Duplication / contract fidelity

  • Author-parsing tests: the shared parser had already landed; adds direct tests for the helper's pending-name pairing, trailing-name flush, blank-line skip, and callback injection.
  • Regex pattern dedup: doc-number and section-number shapes now live in one place (lib/__init__.py); lib/pdf/types.py builds labeled variants on them. The metadata-field pattern picks up SD-N coverage for free.
  • Block.font_size docstring corrected (line-count, not character-count); _extract_metadata stops mutating its input and uses replace() instead.

Coverage (previously untested areas)

  • Tests for get_edge_items, detect_repeating, strip_repeating.
  • Tests for find_hidden_regions and strip_hidden_blocks (plus a cleanup.py bug surfaced in the process).
  • Tests for _detect_lists_by_position, _split_section_by_position, _join_bullet_marker_lines.
  • Fixture-backed tests for Bikeshed, HackMD, and hand-written HTML: detection, metadata, render.

Infrastructure

  • Packaging: pyproject.toml + __init__.py so pip install -e tomd produces a working tomd console script. Pins pymupdf~=1.27.0 and beautifulsoup4~=4.14.0 to patch releases. python tomd/main.py is dropped in favor of tomd paper.pdf or python -m tomd.main.
  • LICENSE + README: BSL-1.0 LICENSE matches pyproject.toml; README expanded from install-only stub to cover usage, output format, uncertain-region markers, limitations, design, and development.
  • CI: .github/workflows/tomd-tests.yml with a pytest matrix on Python 3.12 and 3.13, scoped to tomd/** path changes. requires-python bumped to >=3.12.

Not in this PR

Follow-ups from the Apr 15 kimi-k2.5 review (tomd/review.md), to land separately:

  • _compute_bbox empty-list guard plus reuse from cleanup.py (eliminates a third inline min/max).
  • Rename classify_monospace to is_monospace (pure rename; matches the bool it sets).
  • Bound the reply-to continuation loop in extract_metadata_from_blocks so malformed page 0s cannot sweep body text into metadata["reply-to"].
  • Whitespace-regex consolidation decision: three overlapping multi-space patterns in cleanup.py, structure.py, and render.py. Either consolidate or document the span-vs-DOM distinction at each site.
  • Single-source the PDF metadata pathway documentation (three pathways merge in convert_pdf; precedence is currently undocumented).
  • Evaluate whether structure.py._word_similarity can share implementation with lib/similarity.py._jaccard_similarity; different denominators today.
  • Narrow (or document as won't-fix) the except Exception in collect_line_drawings.

Two kimi findings are scoped out: the __init__.py:133 font_counts guard is a false positive (empty Counter is falsy), and the doc-number regex-docs item is subsumed by the regex-dedup work above.

gregjkal and others added 14 commits April 15, 2026 18:10
Adds pyproject.toml, __init__.py, and a minimal README so `pip install -e tomd`
produces a working `tomd` console script. Pins pymupdf~=1.27 and
beautifulsoup4~=4.14 to protect against silent PyMuPDF API drift (get_texttrace,
get_text dict/rawdict, get_drawings). Switches main.py to relative imports so
tomd.main:main resolves under the console script; the legacy
`python tomd/main.py` invocation is dropped in favor of `tomd paper.pdf` or
`python -m tomd.main`.
papers/ is gitignored, so fresh clones saw 7 AssertionError failures
before any conversion logic ran. Replace the file-existence assert with
pytest.skip() so the tests stay granular (per-stem) and only run when
their input HTML is present.
Adds .github/workflows/tomd-tests.yml with a pytest matrix scoped to
tomd/** path changes. Bumps requires-python to >=3.12 to match the
tested range.
BSL-1.0 LICENSE file matches the license already declared in
pyproject.toml; README grows from an install-only stub to cover usage,
output, uncertain-region markers, limitations, design-doc links, and
development. Closes issues/15.
The doc-number shape lived in both lib/__init__.py (DOC_NUM_RE) and
lib/pdf/types.py (DOC_FIELD_RE), and the dotted-decimal section-number
shape lived in both SECTION_NUM_PREFIX_RE and SECTION_NUM_RE. The
tomd-specific rule "regex patterns for metadata fields must be defined
in one place" was violated; the pair of doc-number patterns also
disagreed on SD-N coverage.

Extract two core shape strings (DOC_NUM_PATTERN, SECTION_NUM_PATTERN)
in lib/__init__.py and rebuild the four callers on them.  DOC_FIELD_RE
picks up SD-N support for free.  No call-site changes: every existing
group(0)/group(1)/match/search semantic is preserved (verified by grep
for .groups() tuple unpacking; there are none). Closes issues/08.
The shared author state machine in lib/__init__.py was already
deduplicated (both lib/pdf/wg21.py and lib/html/extract.py delegate to
parse_author_lines with per-caller clean_line/skip_line callbacks). Only
the pattern-level coverage proposed in issues/07 was missing: the helper
is exercised today only through the callers' tests. Add tests/test_authors.py
covering the pending-name pairing, trailing-name flush, blank-line skip,
and clean_line/skip_line injection points. Closes issues/07.
With pymupdf and beautifulsoup4 pinned in requirements.txt and
pyproject.toml, the try/except AttributeError guard around
page.get_texttrace() in find_hidden_regions was defending against an
unreachable path. Silent degradation on a dep mismatch would hide real
bugs; the pin is the contract, so a missing API should surface as a
clear AttributeError.

Tightened the pins from ~=1.27 / ~=4.14 to ~=1.27.0 / ~=4.14.0 so that
only patch releases are accepted. The looser form actually allowed any
minor bump up to the next major, which contradicted the comment in
requirements.txt flagging PyMuPDF minor-version API drift as the
concern.
* tomd: Add tests for header footer functions

* hidden regions tests

*  position-based list tests

*  HTML generator fixtures

* PR feedback
@vinniefalco
Copy link
Copy Markdown
Member

Well this looks very nice !!

@vinniefalco vinniefalco merged commit be4f82f into cppalliance:master Apr 16, 2026
2 checks passed
gregjkal added a commit to gregjkal/wg21-papers that referenced this pull request Apr 16, 2026
Addresses the deferred items from PR cppalliance#18's "Not in this PR" section,
sourced from the Apr 15 kimi-k2.5 review.

- Promote _compute_bbox from extract.py to types.py as public
  compute_bbox. No silent empty-list guard: callers already enforce
  non-empty input, and a ValueError on empty is the correct contract
  (crash loudly, don't degrade). Replace the inline min/max duplicate
  in cleanup.py._join_cross_page with a call to the shared helper.

- Cap the reply-to continuation loop in wg21.py at
  REPLY_TO_CONTINUATION_CAP (5) blocks. On a malformed PDF where
  page 0 body text contains email addresses, the unbounded loop
  could sweep arbitrary content into metadata["reply-to"].
gregjkal added a commit to gregjkal/wg21-papers that referenced this pull request Apr 16, 2026
Addresses the deferred items from PR cppalliance#18's "Not in this PR" section,
sourced from the Apr 15 kimi-k2.5 review.

- Promote _compute_bbox from extract.py to types.py as public
  compute_bbox. No silent empty-list guard: callers already enforce
  non-empty input, and a ValueError on empty is the correct contract
  (crash loudly, don't degrade). Replace the inline min/max duplicate
  in cleanup.py._join_cross_page with a call to the shared helper.

- Cap the reply-to continuation loop in wg21.py at
  REPLY_TO_CONTINUATION_CAP (5) blocks. On a malformed PDF where
  page 0 body text contains email addresses, the unbounded loop
  could sweep arbitrary content into metadata["reply-to"].
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants