Address review feedback#18
Merged
Merged
Conversation
Adds pyproject.toml, __init__.py, and a minimal README so `pip install -e tomd` produces a working `tomd` console script. Pins pymupdf~=1.27 and beautifulsoup4~=4.14 to protect against silent PyMuPDF API drift (get_texttrace, get_text dict/rawdict, get_drawings). Switches main.py to relative imports so tomd.main:main resolves under the console script; the legacy `python tomd/main.py` invocation is dropped in favor of `tomd paper.pdf` or `python -m tomd.main`.
papers/ is gitignored, so fresh clones saw 7 AssertionError failures before any conversion logic ran. Replace the file-existence assert with pytest.skip() so the tests stay granular (per-stem) and only run when their input HTML is present.
Adds .github/workflows/tomd-tests.yml with a pytest matrix scoped to tomd/** path changes. Bumps requires-python to >=3.12 to match the tested range.
BSL-1.0 LICENSE file matches the license already declared in pyproject.toml; README grows from an install-only stub to cover usage, output, uncertain-region markers, limitations, design-doc links, and development. Closes issues/15.
The doc-number shape lived in both lib/__init__.py (DOC_NUM_RE) and lib/pdf/types.py (DOC_FIELD_RE), and the dotted-decimal section-number shape lived in both SECTION_NUM_PREFIX_RE and SECTION_NUM_RE. The tomd-specific rule "regex patterns for metadata fields must be defined in one place" was violated; the pair of doc-number patterns also disagreed on SD-N coverage. Extract two core shape strings (DOC_NUM_PATTERN, SECTION_NUM_PATTERN) in lib/__init__.py and rebuild the four callers on them. DOC_FIELD_RE picks up SD-N support for free. No call-site changes: every existing group(0)/group(1)/match/search semantic is preserved (verified by grep for .groups() tuple unpacking; there are none). Closes issues/08.
The shared author state machine in lib/__init__.py was already deduplicated (both lib/pdf/wg21.py and lib/html/extract.py delegate to parse_author_lines with per-caller clean_line/skip_line callbacks). Only the pattern-level coverage proposed in issues/07 was missing: the helper is exercised today only through the callers' tests. Add tests/test_authors.py covering the pending-name pairing, trailing-name flush, blank-line skip, and clean_line/skip_line injection points. Closes issues/07.
With pymupdf and beautifulsoup4 pinned in requirements.txt and pyproject.toml, the try/except AttributeError guard around page.get_texttrace() in find_hidden_regions was defending against an unreachable path. Silent degradation on a dep mismatch would hide real bugs; the pin is the contract, so a missing API should surface as a clear AttributeError. Tightened the pins from ~=1.27 / ~=4.14 to ~=1.27.0 / ~=4.14.0 so that only patch releases are accepted. The looser form actually allowed any minor bump up to the next major, which contradicted the comment in requirements.txt flagging PyMuPDF minor-version API drift as the concern.
* tomd: Add tests for header footer functions * hidden regions tests * position-based list tests * HTML generator fixtures * PR feedback
Member
|
Well this looks very nice !! |
gregjkal
added a commit
to gregjkal/wg21-papers
that referenced
this pull request
Apr 16, 2026
Addresses the deferred items from PR cppalliance#18's "Not in this PR" section, sourced from the Apr 15 kimi-k2.5 review. - Promote _compute_bbox from extract.py to types.py as public compute_bbox. No silent empty-list guard: callers already enforce non-empty input, and a ValueError on empty is the correct contract (crash loudly, don't degrade). Replace the inline min/max duplicate in cleanup.py._join_cross_page with a call to the shared helper. - Cap the reply-to continuation loop in wg21.py at REPLY_TO_CONTINUATION_CAP (5) blocks. On a malformed PDF where page 0 body text contains email addresses, the unbounded loop could sweep arbitrary content into metadata["reply-to"].
gregjkal
added a commit
to gregjkal/wg21-papers
that referenced
this pull request
Apr 16, 2026
Addresses the deferred items from PR cppalliance#18's "Not in this PR" section, sourced from the Apr 15 kimi-k2.5 review. - Promote _compute_bbox from extract.py to types.py as public compute_bbox. No silent empty-list guard: callers already enforce non-empty input, and a ValueError on empty is the correct contract (crash loudly, don't degrade). Replace the inline min/max duplicate in cleanup.py._join_cross_page with a call to the shared helper. - Cap the reply-to continuation loop in wg21.py at REPLY_TO_CONTINUATION_CAP (5) blocks. On a malformed PDF where page 0 body text contains email addresses, the unbounded loop could sweep arbitrary content into metadata["reply-to"].
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes flagged by the Apr 14 agent review (
tomd/review.md,tomd/lib-review.md) plus packaging, CI, and test coverage so the project is installable and gated. One follow-up wave (sourced from the Apr 15 agent review intomd/review.md) is deliberately deferred, see "Not in this PR" below.What's in it
Bug fixes (user-visible)
similar()short-circuit: returned False for identical strings over 200 chars, breaking TOC match on long titles.try/except AttributeErroraroundpage.get_texttrace(). Withpymupdf~=1.27.0pinned, the pin is the contract; silent degradation would hide dep drift.<table>rows: HTML renderer'sfind_all("tr")flattened nested rows into the parent.Duplication / contract fidelity
lib/__init__.py);lib/pdf/types.pybuilds labeled variants on them. The metadata-field pattern picks upSD-Ncoverage for free.Block.font_sizedocstring corrected (line-count, not character-count);_extract_metadatastops mutating its input and usesreplace()instead.Coverage (previously untested areas)
get_edge_items,detect_repeating,strip_repeating.find_hidden_regionsandstrip_hidden_blocks(plus acleanup.pybug surfaced in the process)._detect_lists_by_position,_split_section_by_position,_join_bullet_marker_lines.Infrastructure
pyproject.toml+__init__.pysopip install -e tomdproduces a workingtomdconsole script. Pinspymupdf~=1.27.0andbeautifulsoup4~=4.14.0to patch releases.python tomd/main.pyis dropped in favor oftomd paper.pdforpython -m tomd.main.LICENSEmatchespyproject.toml; README expanded from install-only stub to cover usage, output format, uncertain-region markers, limitations, design, and development..github/workflows/tomd-tests.ymlwith a pytest matrix on Python 3.12 and 3.13, scoped totomd/**path changes.requires-pythonbumped to>=3.12.Not in this PR
Follow-ups from the Apr 15 kimi-k2.5 review (
tomd/review.md), to land separately:_compute_bboxempty-list guard plus reuse fromcleanup.py(eliminates a third inline min/max).classify_monospacetois_monospace(pure rename; matches the bool it sets).extract_metadata_from_blocksso malformed page 0s cannot sweep body text intometadata["reply-to"].cleanup.py,structure.py, andrender.py. Either consolidate or document the span-vs-DOM distinction at each site.convert_pdf; precedence is currently undocumented).structure.py._word_similaritycan share implementation withlib/similarity.py._jaccard_similarity; different denominators today.except Exceptionincollect_line_drawings.Two kimi findings are scoped out: the
__init__.py:133font_countsguard is a false positive (empty Counter is falsy), and the doc-number regex-docs item is subsumed by the regex-dedup work above.