tomd: kimi review follow-ups by gregjkal · Pull Request #19 · cppalliance/wg21-papers

gregjkal · 2026-04-16T14:11:39Z

Follow-up to #18, addressing two items from the "Not in this PR" section (sourced from the Apr 15 kimi-k2.5 review). The other suggestions outlined there were dropped.

Stuff we did

`compute_bbox` dedup

Promoted from extract.py to types.py as a public helper
Replaced the inline min/max duplicate in cleanup.py.
Did not do: Review suggested a silent empty-list guard, but callers already enforce passing non-empty lists, so we should prefer raising ValueError on empty lists to avoid silent degradation from a future mistake.

Reply-to continuation cap

Reply-to loop in wg21.py now stops after 5 blocks, preventing malformed PDFs from sweeping body text into metadata.

Stuff we didn't do

The remaining five items from the review were evaluated and dropped:

kimi #4 (classify_monospace rename): it's a triple-signal classification algorithm, not a simple predicate. The name fits.
kimi #6 (whitespace-regex consolidation): the three patterns have intentionally different semantics. Different enough to keep separate.
kimi #8 (metadata pathway docs): turns out this was already addressed in Address review feedback #18.
kimi #9 (multiset similarity sharing): _word_similarity and _jaccard_similarity use different denominators for different purposes. Not worth unifying.
kimi #10 (except Exception narrowing): a good general best practice, but here we already log with exc_info=True, and unfortunately, MuPDF doesn't appear to have a clean exception hierarchy to narrow to anyway.

Addresses the deferred items from PR cppalliance#18's "Not in this PR" section, sourced from the Apr 15 kimi-k2.5 review. - Promote _compute_bbox from extract.py to types.py as public compute_bbox. No silent empty-list guard: callers already enforce non-empty input, and a ValueError on empty is the correct contract (crash loudly, don't degrade). Replace the inline min/max duplicate in cleanup.py._join_cross_page with a call to the shared helper. - Cap the reply-to continuation loop in wg21.py at REPLY_TO_CONTINUATION_CAP (5) blocks. On a malformed PDF where page 0 body text contains email addresses, the unbounded loop could sweep arbitrary content into metadata["reply-to"].

jlchilders11

One change to make the pull request description accurate

gregjkal force-pushed the gk/kimi-review-followups branch from df16e13 to f65ad96 Compare April 16, 2026 14:24

jlchilders11 requested changes Apr 16, 2026

View reviewed changes

Comment thread tomd/lib/pdf/cleanup.py

jlchilders11 approved these changes Apr 16, 2026

View reviewed changes

gregjkal marked this pull request as ready for review April 16, 2026 14:43

gregjkal merged commit 829f6d4 into cppalliance:master Apr 16, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tomd: kimi review follow-ups#19

tomd: kimi review follow-ups#19
gregjkal merged 1 commit into
cppalliance:masterfrom
gregjkal:gk/kimi-review-followups

gregjkal commented Apr 16, 2026 •

edited

Loading

Uh oh!

jlchilders11 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gregjkal commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stuff we did

compute_bbox dedup

Reply-to continuation cap

Stuff we didn't do

Uh oh!

jlchilders11 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gregjkal commented Apr 16, 2026 •

edited

Loading

`compute_bbox` dedup