Add opt-in duplicate-disambiguation metadata alongside content_fingerprint in read_file(format="json")
Problem
For downstream consumers (e.g. legal-context's manual-DOCX ingest pipeline) that want to reference a specific paragraph from outside the safe-docx session — for example, storing a stable paragraph ID alongside a database row so a re-extraction matches the same paragraph — safe-docx already exposes the core primitive: content_fingerprint (opt-in via include_fingerprint: true on read_file), which gives a normalized-text hash that's stable across reads, machines, and re-uploads.
The remaining gap is duplicate disambiguation when the same normalized paragraph text appears multiple times in one document. With only content_fingerprint, two paragraphs with byte-identical text get the same fingerprint, so a consumer can't reference "the second occurrence of WHEREAS …" without computing its own ordinal.
Every consumer who hits this has to:
- Request
include_fingerprint: true
- Group paragraphs by
content_fingerprint
- Assign document-order ordinals inside each duplicate group
- Build their own composite key
That logic is small but easy to drift on across consumers.
Proposal
Keep _bk_* and content_fingerprint exactly as-is. Do not change id, do not change edit-anchor semantics, and do not change TOON/simple output or Google Docs behavior.
Instead, add an opt-in JSON-only duplicate-disambiguation surface for DOCX sessions:
When both flags are enabled, each paragraph node would include two new fields:
Optional convenience field if maintainers prefer a single composite string:
Happy with either structured fields alone, or structured fields plus the convenience composite.
Semantics
content_fingerprint keeps its current algorithm (NFKC normalization, Cf/invisible stripping, whitespace collapse + trim, sha256:nfkc:<32hex>).
content_fingerprint_ordinal is 1-based.
- Ordinals assigned in document order among paragraphs with the same
content_fingerprint.
content_fingerprint_count_in_document is the total count of paragraphs sharing that fingerprint.
- For unique fingerprints: ordinal
1, count 1.
- Reordering duplicate paragraphs may change ordinals; that's acceptable because this metadata is a read-only disambiguator, not an edit anchor.
Why this shape (vs. introducing _p_* IDs)
Surfaced during pre-filing review against the actual safe-docx codebase:
_bk_* IDs are already deterministic and stable across reopens for identical stored DOCX bytes — not session-scoped (see skills/docx-editing/SKILL.md:187, packages/docx-mcp/README.md:180, packages/docx-mcp/src/tools/paragraph_id_stability.traceability.test.ts:25).
- The
content_fingerprint mechanism already exists and is exactly the right primitive (see packages/docx-core/src/primitives/content_fingerprint.ts:29).
- Bookmark resolution only recognizes names starting with
_bk_, so introducing _p_* would require deep anchor plumbing across replace_text, insert_paragraph, apply_plan, add_comment, add_footnote, and other paragraph-ID consumers (packages/docx-core/src/primitives/bookmarks.ts:128,161,276).
node_ids filtering on read_file matches _bk_* (packages/docx-mcp/src/tools/read_file.ts:328) — an ID-shape change would require redefining that too.
This issue avoids all of that by staying additive on the existing JSON output.
Acceptance criteria
read_file(format="json", include_fingerprint=true) remains backward-compatible.
read_file(format="json", include_fingerprint=true, include_fingerprint_ordinal=true) adds the two new metadata fields (and optionally portable_paragraph_ref) while leaving id unchanged.
id continues to be _bk_*, and edit tools continue to accept only _bk_*.
- The existing fingerprint algorithm remains unchanged (NFKC, Cf/invisible stripping, whitespace collapse,
sha256:nfkc:<32hex>).
- New tests cover:
- Unique paragraph → ordinal
1, count 1
- Duplicate normalized text → deterministic ordinals in document order
- Whitespace-only variants share the same
content_fingerprint and get distinct ordinals
- TOON/simple output unchanged regardless of flag
- Google Docs ignores the new flag (or it's rejected / documented as DOCX-only)
Out of scope
- Replacing
_bk_* with _p_*
- Making portable references valid edit anchors
- Changing TOON / simple output
- Changing Google Docs output
- Any new normalization algorithm or shorter hash format
- Footnote / endnote / comment fingerprints (separate request if anyone needs them)
Why this helps downstream consumers
For manual-ingest / citation pipelines, this removes the need to re-implement duplicate grouping while preserving safe-docx's existing "stable edit anchor (_bk_*) vs portable content hash (content_fingerprint)" split.
Consumers wanting a document-local portable key can store:
- document identifier
content_fingerprint
content_fingerprint_ordinal
Consumers wanting same-text grouping across documents continue using content_fingerprint alone.
Pre-filing review note: I initially proposed a bookmark_mode=content_addressable parameter that would have swapped _bk_* for _p_* hashes. Dynamic peer review against the safe-docx source (Codex + Gemini, both with direct repo access) found my premise was wrong: _bk_* is already stable, content_fingerprint already exists, and the ID-swap would have broken every paragraph-ID-consuming tool. This is the rewritten ask: additive duplicate metadata on top of the existing fingerprint surface.
Add opt-in duplicate-disambiguation metadata alongside
content_fingerprintinread_file(format="json")Problem
For downstream consumers (e.g. legal-context's manual-DOCX ingest pipeline) that want to reference a specific paragraph from outside the safe-docx session — for example, storing a stable paragraph ID alongside a database row so a re-extraction matches the same paragraph — safe-docx already exposes the core primitive:
content_fingerprint(opt-in viainclude_fingerprint: trueonread_file), which gives a normalized-text hash that's stable across reads, machines, and re-uploads.The remaining gap is duplicate disambiguation when the same normalized paragraph text appears multiple times in one document. With only
content_fingerprint, two paragraphs with byte-identical text get the same fingerprint, so a consumer can't reference "the second occurrence of WHEREAS …" without computing its own ordinal.Every consumer who hits this has to:
include_fingerprint: truecontent_fingerprintThat logic is small but easy to drift on across consumers.
Proposal
Keep
_bk_*andcontent_fingerprintexactly as-is. Do not changeid, do not change edit-anchor semantics, and do not change TOON/simple output or Google Docs behavior.Instead, add an opt-in JSON-only duplicate-disambiguation surface for DOCX sessions:
{ "file_path": "/path/to/doc.docx", "format": "json", "include_fingerprint": true, "include_fingerprint_ordinal": true // new; default false }When both flags are enabled, each paragraph node would include two new fields:
{ "id": "_bk_a3f29c10b8e4", "content_fingerprint": "sha256:nfkc:5d2e8f1a4c5b7d2e8f1a4c5b7d2e8f1a", "content_fingerprint_ordinal": 1, // new — 1-based, document-order among duplicates "content_fingerprint_count_in_document": 2 // new — total paragraphs sharing this fingerprint }Optional convenience field if maintainers prefer a single composite string:
Happy with either structured fields alone, or structured fields plus the convenience composite.
Semantics
content_fingerprintkeeps its current algorithm (NFKC normalization, Cf/invisible stripping, whitespace collapse + trim,sha256:nfkc:<32hex>).content_fingerprint_ordinalis 1-based.content_fingerprint.content_fingerprint_count_in_documentis the total count of paragraphs sharing that fingerprint.1, count1.Why this shape (vs. introducing
_p_*IDs)Surfaced during pre-filing review against the actual safe-docx codebase:
_bk_*IDs are already deterministic and stable across reopens for identical stored DOCX bytes — not session-scoped (seeskills/docx-editing/SKILL.md:187,packages/docx-mcp/README.md:180,packages/docx-mcp/src/tools/paragraph_id_stability.traceability.test.ts:25).content_fingerprintmechanism already exists and is exactly the right primitive (seepackages/docx-core/src/primitives/content_fingerprint.ts:29)._bk_, so introducing_p_*would require deep anchor plumbing acrossreplace_text,insert_paragraph,apply_plan,add_comment,add_footnote, and other paragraph-ID consumers (packages/docx-core/src/primitives/bookmarks.ts:128,161,276).node_idsfiltering onread_filematches_bk_*(packages/docx-mcp/src/tools/read_file.ts:328) — an ID-shape change would require redefining that too.This issue avoids all of that by staying additive on the existing JSON output.
Acceptance criteria
read_file(format="json", include_fingerprint=true)remains backward-compatible.read_file(format="json", include_fingerprint=true, include_fingerprint_ordinal=true)adds the two new metadata fields (and optionallyportable_paragraph_ref) while leavingidunchanged.idcontinues to be_bk_*, and edit tools continue to accept only_bk_*.sha256:nfkc:<32hex>).1, count1content_fingerprintand get distinct ordinalsOut of scope
_bk_*with_p_*Why this helps downstream consumers
For manual-ingest / citation pipelines, this removes the need to re-implement duplicate grouping while preserving safe-docx's existing "stable edit anchor (
_bk_*) vs portable content hash (content_fingerprint)" split.Consumers wanting a document-local portable key can store:
content_fingerprintcontent_fingerprint_ordinalConsumers wanting same-text grouping across documents continue using
content_fingerprintalone.Pre-filing review note: I initially proposed a
bookmark_mode=content_addressableparameter that would have swapped_bk_*for_p_*hashes. Dynamic peer review against the safe-docx source (Codex + Gemini, both with direct repo access) found my premise was wrong:_bk_*is already stable,content_fingerprintalready exists, and the ID-swap would have broken every paragraph-ID-consuming tool. This is the rewritten ask: additive duplicate metadata on top of the existing fingerprint surface.