Skip to content

tomd: kimi review follow-ups#19

Merged
gregjkal merged 1 commit into
cppalliance:masterfrom
gregjkal:gk/kimi-review-followups
Apr 16, 2026
Merged

tomd: kimi review follow-ups#19
gregjkal merged 1 commit into
cppalliance:masterfrom
gregjkal:gk/kimi-review-followups

Conversation

@gregjkal
Copy link
Copy Markdown
Contributor

@gregjkal gregjkal commented Apr 16, 2026

Follow-up to #18, addressing two items from the "Not in this PR" section (sourced from the Apr 15 kimi-k2.5 review). The other suggestions outlined there were dropped.

Stuff we did

compute_bbox dedup

  • Promoted from extract.py to types.py as a public helper
  • Replaced the inline min/max duplicate in cleanup.py.
  • Did not do: Review suggested a silent empty-list guard, but callers already enforce passing non-empty lists, so we should prefer raising ValueError on empty lists to avoid silent degradation from a future mistake.

Reply-to continuation cap

  • Reply-to loop in wg21.py now stops after 5 blocks, preventing malformed PDFs from sweeping body text into metadata.

Stuff we didn't do

The remaining five items from the review were evaluated and dropped:

  • kimi #4 (classify_monospace rename): it's a triple-signal classification algorithm, not a simple predicate. The name fits.
  • kimi #6 (whitespace-regex consolidation): the three patterns have intentionally different semantics. Different enough to keep separate.
  • kimi #8 (metadata pathway docs): turns out this was already addressed in Address review feedback #18.
  • kimi #9 (multiset similarity sharing): _word_similarity and _jaccard_similarity use different denominators for different purposes. Not worth unifying.
  • kimi #10 (except Exception narrowing): a good general best practice, but here we already log with exc_info=True, and unfortunately, MuPDF doesn't appear to have a clean exception hierarchy to narrow to anyway.

Addresses the deferred items from PR cppalliance#18's "Not in this PR" section,
sourced from the Apr 15 kimi-k2.5 review.

- Promote _compute_bbox from extract.py to types.py as public
  compute_bbox. No silent empty-list guard: callers already enforce
  non-empty input, and a ValueError on empty is the correct contract
  (crash loudly, don't degrade). Replace the inline min/max duplicate
  in cleanup.py._join_cross_page with a call to the shared helper.

- Cap the reply-to continuation loop in wg21.py at
  REPLY_TO_CONTINUATION_CAP (5) blocks. On a malformed PDF where
  page 0 body text contains email addresses, the unbounded loop
  could sweep arbitrary content into metadata["reply-to"].
@gregjkal gregjkal force-pushed the gk/kimi-review-followups branch from df16e13 to f65ad96 Compare April 16, 2026 14:24
Copy link
Copy Markdown
Contributor

@jlchilders11 jlchilders11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One change to make the pull request description accurate

Comment thread tomd/lib/pdf/cleanup.py
@gregjkal gregjkal marked this pull request as ready for review April 16, 2026 14:43
@gregjkal gregjkal merged commit 829f6d4 into cppalliance:master Apr 16, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants