feat: hwp2hwpx CLI with lossless verification gates + serializer fidelity fixes#1366
feat: hwp2hwpx CLI with lossless verification gates + serializer fidelity fixes#1366idaeho wants to merge 2 commits into
Conversation
…lity fixes Ported from a v0.7.15-based working branch onto devel, deduplicated against devel's own edwardkim#1326 work (header/footer/pageHiding/pageNum/newNum/autoNum serialization already present — this commit keeps devel's implementations). New in this commit: - 'rhwp hwp2hwpx <in.hwp> <out.hwpx> [--verify] [--verify-pages]' subcommand (--verify: IR diff gate via ir-diff engine, exit 3 on loss; --verify-pages: render page count gate, exit 4 on mismatch) - Fix table pageBreak Hancom semantics (serializer wrote CellBreak->CELL, parser maps CELL->RowBreak; now serializer mirrors the parser: CELL<->RowBreak, TABLE<->CellBreak) - Preserve char-shape run boundaries (multi-run): split runs at char_shapes boundaries in the slot path (controls, fieldEnd-adjacent boundaries, trailing boundaries), in header/footer subLists, and in table cells - Treat SectionDef/ColumnDef as 8-unit inline slots (consume position, no output — secPr/colPr serialized separately) with adaptive fallback for IR lacking gap markers; count trailing markers in slot inference - Map BinData manifest keys via DocInfo BIN_DATA 1-based index (fixes 'binaryItemIDRef not registered' on documents with non-contiguous storage ids) - Replace template page margins / colPr / footNotePr / endNotePr with IR values (fixes ~2x page-count divergence on multi-column and endnote-heavy documents) - Preserve secPr/colPr order when ColumnDef precedes SectionDef - ir-diff: return diff count from ir_diff(); exclude HWP5 vendor tab padding (code units 3-5, not representable in OWPML <hp:tab>, Hancom's own converter drops it too) from semantic comparison Verification on an 11-sample suite (complex tables, nested tables, cropped/ header pictures, equations, 2-column exam docs, 4-10MB real documents): - IR diff vs original HWP: 1074 (v0.7.15) -> 3 total, 9/11 samples at zero - Render page count match: 11/11 - Known residuals: paragraph char_count raw mismatch (HWP5 nchars variant without trailing marker, not representable in HWPX), and a char-shape boundary position between trailing markers interacting with devel's AutoNumber slot handling (multi-table samples, 2 diffs) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… pairs
Follow-up to the hwp2hwpx lossless gate: with these comparator fixes the
29-pair Hancom-converted corpus (exam papers, LH press releases, hwp3->hwp5
conversions) goes from 2,126 reported diffs to 1 (an empty-paragraph raw
vpos=0 oddity), so the --verify gate now reports only real content
differences.
- char_shapes: compare referenced CharShape *content* (font names resolved
through FACE_NAME, size, ratios/spacings, bold/italic/underline, colors)
instead of raw table ids — Hancom rebuilds both CharShape and FaceName
tables when saving HWPX (measured 317 vs 307 entries on the same document)
- table text_wrap: treat as equal when both sides are treat_as_char=true —
wrap is meaningless for inline tables and Hancom normalizes the dormant
field to TOP_AND_BOTTOM while HWP5 raw keeps a stale Square
- paragraph char_count: report only when the paragraph has substantive
diffs (text/offsets/controls/char_shapes) — raw nchars varies by trailing
marker convention and HWPX has no carrier field
- HWP3-conversion variant pairs (is_hwp3_variant on one side only):
* ParaShape indent/spacing: accept the exact-2x delta introduced by the
parser's variant normalization (parser/mod.rs Task edwardkim#1042)
* line_segs: skip — typeset-derived values that Hancom itself recomputes
* TabDef positions: ignore u32 negative-range sentinels from converted raw
- parser/hwpx: document that HWP5 raw paragraph margins are uniformly 2x
(PARA_SHAPE raw hex verified; earlier per-document confusion came from
the variant normalization above)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
안녕하세요, 김대호님. rhwp에 첫 기여를 보내주셔서 감사합니다. 이번 PR은 다만 현재 PR은 최신 확인된 주요 사항은 다음과 같습니다.
그래서 번거로우시겠지만, 최신
참고로 로컬에서 작은 샘플 기준으로 좋은 첫 기여 감사드립니다. 최신 |
|
안녕하세요, 김대호님. 검토 후 진행 상황에 변화가 있어 추가로 안내드립니다. 오늘 devel에 #1379(HWPX serializer: subList 내부 컨트롤 보존)가 머지되면서, 이 PR의 일부 변경과 겹치는 부분이 생겼습니다. rebase 시 불필요한 충돌 작업을 줄이실 수 있도록 현재 기준 중복/고유 영역을 정리해 드립니다. devel에 이미 구현되었거나 내부에서 처리 예정이라 드랍을 권장하는 부분
고유 가치가 있어 유지를 권장하는 부분
기존 수정 요청 사항(fmt, useFontSpace 파싱 복구 등)은 그대로 유효합니다. 범위가 줄어드는 만큼 부담도 줄어들 것으로 기대합니다. 진행 중 궁금한 점이 있으시면 편하게 말씀해 주세요. 감사합니다. |
Closes 1365 연계 이슈: #1365
요약
한컴 미설치 환경에서의 HWP→HWPX 변환 CLI(
hwp2hwpx)와 무손실 검증 게이트(--verify / --verify-pages), 그리고 그 과정에서 발견한 직렬화 정합 수정 묶음입니다. devel의 #1326 작업(머리말/꼬리말/pageHiding/pageNum/newNum/autoNum 직렬화)과 중복되는 부분은 devel 구현을 유지하고 차집합만 포팅했습니다.변경 사항
rhwp hwp2hwpx <in.hwp> <out.hwpx> [--verify] [--verify-pages]— verify는 ir-diff 엔진 재사용(차이>0 → exit 3), verify-pages는 렌더 페이지 수 비교(불일치 → exit 4)ir_diff()가 차이 건수 반환, HWP5 탭 벤더 패딩(code unit 3~5, OWPML 미운반·한컴 변환기도 미보존 실측) 의미론 비교 제외실측 (11샘플)
알려진 잔여 (3건)
🤖 Generated with Claude Code