Address review feedback by gregjkal · Pull Request #18 · cppalliance/wg21-papers

gregjkal · 2026-04-15T22:29:20Z

Summary

Fixes flagged by the Apr 14 agent review (tomd/review.md, tomd/lib-review.md) plus packaging, CI, and test coverage so the project is installable and gated. One follow-up wave (sourced from the Apr 15 agent review in tomd/review.md) is deliberately deferred, see "Not in this PR" below.

What's in it

Bug fixes (user-visible)

Nested-list duplication (HTML): renderer doubled text in nested lists; sublists are now detached before inline capture.
Dehyphenation single-span (PDF): dehyphenation duplicated a word when the next line was a single fully-consumed span.
Spatial sort stability: spatial extraction sorted by y only; now sorts by y-band then x for deterministic reading order.
similar() short-circuit: returned False for identical strings over 200 chars, breaking TOC match on long titles.
Hidden-regions API fallback: resolved by removing the try/except AttributeError around page.get_texttrace(). With pymupdf~=1.27.0 pinned, the pin is the contract; silent degradation would hide dep drift.
Nested <table> rows: HTML renderer's find_all("tr") flattened nested rows into the parent.

Duplication / contract fidelity

Author-parsing tests: the shared parser had already landed; adds direct tests for the helper's pending-name pairing, trailing-name flush, blank-line skip, and callback injection.
Regex pattern dedup: doc-number and section-number shapes now live in one place (lib/__init__.py); lib/pdf/types.py builds labeled variants on them. The metadata-field pattern picks up SD-N coverage for free.
Block.font_size docstring corrected (line-count, not character-count); _extract_metadata stops mutating its input and uses replace() instead.

Coverage (previously untested areas)

Tests for get_edge_items, detect_repeating, strip_repeating.
Tests for find_hidden_regions and strip_hidden_blocks (plus a cleanup.py bug surfaced in the process).
Tests for _detect_lists_by_position, _split_section_by_position, _join_bullet_marker_lines.
Fixture-backed tests for Bikeshed, HackMD, and hand-written HTML: detection, metadata, render.

Infrastructure

Packaging: pyproject.toml + __init__.py so pip install -e tomd produces a working tomd console script. Pins pymupdf~=1.27.0 and beautifulsoup4~=4.14.0 to patch releases. python tomd/main.py is dropped in favor of tomd paper.pdf or python -m tomd.main.
LICENSE + README: BSL-1.0 LICENSE matches pyproject.toml; README expanded from install-only stub to cover usage, output format, uncertain-region markers, limitations, design, and development.
CI: .github/workflows/tomd-tests.yml with a pytest matrix on Python 3.12 and 3.13, scoped to tomd/** path changes. requires-python bumped to >=3.12.

Not in this PR

Follow-ups from the Apr 15 kimi-k2.5 review (tomd/review.md), to land separately:

_compute_bbox empty-list guard plus reuse from cleanup.py (eliminates a third inline min/max).
Rename classify_monospace to is_monospace (pure rename; matches the bool it sets).
Bound the reply-to continuation loop in extract_metadata_from_blocks so malformed page 0s cannot sweep body text into metadata["reply-to"].
Whitespace-regex consolidation decision: three overlapping multi-space patterns in cleanup.py, structure.py, and render.py. Either consolidate or document the span-vs-DOM distinction at each site.
Single-source the PDF metadata pathway documentation (three pathways merge in convert_pdf; precedence is currently undocumented).
Evaluate whether structure.py._word_similarity can share implementation with lib/similarity.py._jaccard_similarity; different denominators today.
Narrow (or document as won't-fix) the except Exception in collect_line_drawings.

Two kimi findings are scoped out: the __init__.py:133 font_counts guard is a false positive (empty Counter is falsy), and the doc-number regex-docs item is subsumed by the regex-dedup work above.

…ing order

…ct_metadata

…uplication

Adds pyproject.toml, __init__.py, and a minimal README so `pip install -e tomd` produces a working `tomd` console script. Pins pymupdf~=1.27 and beautifulsoup4~=4.14 to protect against silent PyMuPDF API drift (get_texttrace, get_text dict/rawdict, get_drawings). Switches main.py to relative imports so tomd.main:main resolves under the console script; the legacy `python tomd/main.py` invocation is dropped in favor of `tomd paper.pdf` or `python -m tomd.main`.

papers/ is gitignored, so fresh clones saw 7 AssertionError failures before any conversion logic ran. Replace the file-existence assert with pytest.skip() so the tests stay granular (per-stem) and only run when their input HTML is present.

Adds .github/workflows/tomd-tests.yml with a pytest matrix scoped to tomd/** path changes. Bumps requires-python to >=3.12 to match the tested range.

BSL-1.0 LICENSE file matches the license already declared in pyproject.toml; README grows from an install-only stub to cover usage, output, uncertain-region markers, limitations, design-doc links, and development. Closes issues/15.

The doc-number shape lived in both lib/__init__.py (DOC_NUM_RE) and lib/pdf/types.py (DOC_FIELD_RE), and the dotted-decimal section-number shape lived in both SECTION_NUM_PREFIX_RE and SECTION_NUM_RE. The tomd-specific rule "regex patterns for metadata fields must be defined in one place" was violated; the pair of doc-number patterns also disagreed on SD-N coverage. Extract two core shape strings (DOC_NUM_PATTERN, SECTION_NUM_PATTERN) in lib/__init__.py and rebuild the four callers on them. DOC_FIELD_RE picks up SD-N support for free. No call-site changes: every existing group(0)/group(1)/match/search semantic is preserved (verified by grep for .groups() tuple unpacking; there are none). Closes issues/08.

The shared author state machine in lib/__init__.py was already deduplicated (both lib/pdf/wg21.py and lib/html/extract.py delegate to parse_author_lines with per-caller clean_line/skip_line callbacks). Only the pattern-level coverage proposed in issues/07 was missing: the helper is exercised today only through the callers' tests. Add tests/test_authors.py covering the pending-name pairing, trailing-name flush, blank-line skip, and clean_line/skip_line injection points. Closes issues/07.

With pymupdf and beautifulsoup4 pinned in requirements.txt and pyproject.toml, the try/except AttributeError guard around page.get_texttrace() in find_hidden_regions was defending against an unreachable path. Silent degradation on a dep mismatch would hide real bugs; the pin is the contract, so a missing API should surface as a clear AttributeError. Tightened the pins from ~=1.27 / ~=4.14 to ~=1.27.0 / ~=4.14.0 so that only patch releases are accepted. The looser form actually allowed any minor bump up to the next major, which contradicted the comment in requirements.txt flagging PyMuPDF minor-version API drift as the concern.

* tomd: Add tests for header footer functions * hidden regions tests * position-based list tests * HTML generator fixtures * PR feedback

vinniefalco · 2026-04-15T22:41:28Z

Well this looks very nice !!

Addresses the deferred items from PR cppalliance#18's "Not in this PR" section, sourced from the Apr 15 kimi-k2.5 review. - Promote _compute_bbox from extract.py to types.py as public compute_bbox. No silent empty-list guard: callers already enforce non-empty input, and a ValueError on empty is the correct contract (crash loudly, don't degrade). Replace the inline min/max duplicate in cleanup.py._join_cross_page with a call to the shared helper. - Cap the reply-to continuation loop in wg21.py at REPLY_TO_CONTINUATION_CAP (5) blocks. On a malformed PDF where page 0 body text contains email addresses, the unbounded loop could sweep arbitrary content into metadata["reply-to"].

gregjkal and others added 14 commits April 15, 2026 18:10

tomd: ignore .venv/

b8ba5d2

tomd: sort spatial characters by y-band then x for deterministic read…

74a4ac8

…ing order

tomd: short-circuit similar() on identical strings regardless of length

eb454f9

tomd: fix Block.font_size docstring and stop mutating input in _extra…

2a3daf7

…ct_metadata

tomd: detach HTML sublists before inline capture to fix nested-list d…

3236b8a

…uplication

tomd: fix dehyphenation duplicating word when next line is single span

065a1b8

tomd: add pytest CI workflow for Python 3.12 and 3.13

f9b211c

Adds .github/workflows/tomd-tests.yml with a pytest matrix scoped to tomd/** path changes. Bumps requires-python to >=3.12 to match the tested range.

tomd: add LICENSE and expand user-facing README

25bb253

BSL-1.0 LICENSE file matches the license already declared in pyproject.toml; README grows from an install-only stub to cover usage, output, uncertain-region markers, limitations, design-doc links, and development. Closes issues/15.

Add missing tests and fix cleanup.py bug

be4f82f

* tomd: Add tests for header footer functions * hidden regions tests * position-based list tests * HTML generator fixtures * PR feedback

vinniefalco merged commit be4f82f into cppalliance:master Apr 16, 2026
2 checks passed

gregjkal mentioned this pull request Apr 16, 2026

tomd: kimi review follow-ups #19

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address review feedback#18

Address review feedback#18
vinniefalco merged 14 commits into
cppalliance:masterfrom
gregjkal:gk/review-fixes

gregjkal commented Apr 15, 2026

Uh oh!

vinniefalco commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gregjkal commented Apr 15, 2026

Summary

What's in it

Bug fixes (user-visible)

Duplication / contract fidelity

Coverage (previously untested areas)

Infrastructure

Not in this PR

Uh oh!

vinniefalco commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants