Skip to content

Add opt-in duplicate-disambiguation metadata alongside content_fingerprint in read_file #205

@stevenobiajulu

Description

@stevenobiajulu

Add opt-in duplicate-disambiguation metadata alongside content_fingerprint in read_file(format="json")

Problem

For downstream consumers (e.g. legal-context's manual-DOCX ingest pipeline) that want to reference a specific paragraph from outside the safe-docx session — for example, storing a stable paragraph ID alongside a database row so a re-extraction matches the same paragraph — safe-docx already exposes the core primitive: content_fingerprint (opt-in via include_fingerprint: true on read_file), which gives a normalized-text hash that's stable across reads, machines, and re-uploads.

The remaining gap is duplicate disambiguation when the same normalized paragraph text appears multiple times in one document. With only content_fingerprint, two paragraphs with byte-identical text get the same fingerprint, so a consumer can't reference "the second occurrence of WHEREAS …" without computing its own ordinal.

Every consumer who hits this has to:

  1. Request include_fingerprint: true
  2. Group paragraphs by content_fingerprint
  3. Assign document-order ordinals inside each duplicate group
  4. Build their own composite key

That logic is small but easy to drift on across consumers.

Proposal

Keep _bk_* and content_fingerprint exactly as-is. Do not change id, do not change edit-anchor semantics, and do not change TOON/simple output or Google Docs behavior.

Instead, add an opt-in JSON-only duplicate-disambiguation surface for DOCX sessions:

{
  "file_path": "/path/to/doc.docx",
  "format": "json",
  "include_fingerprint": true,
  "include_fingerprint_ordinal": true  // new; default false
}

When both flags are enabled, each paragraph node would include two new fields:

{
  "id": "_bk_a3f29c10b8e4",
  "content_fingerprint": "sha256:nfkc:5d2e8f1a4c5b7d2e8f1a4c5b7d2e8f1a",
  "content_fingerprint_ordinal": 1,             // new — 1-based, document-order among duplicates
  "content_fingerprint_count_in_document": 2    // new — total paragraphs sharing this fingerprint
}

Optional convenience field if maintainers prefer a single composite string:

"portable_paragraph_ref": "sha256:nfkc:5d2e8f1a4c5b7d2e8f1a4c5b7d2e8f1a#1"

Happy with either structured fields alone, or structured fields plus the convenience composite.

Semantics

  • content_fingerprint keeps its current algorithm (NFKC normalization, Cf/invisible stripping, whitespace collapse + trim, sha256:nfkc:<32hex>).
  • content_fingerprint_ordinal is 1-based.
  • Ordinals assigned in document order among paragraphs with the same content_fingerprint.
  • content_fingerprint_count_in_document is the total count of paragraphs sharing that fingerprint.
  • For unique fingerprints: ordinal 1, count 1.
  • Reordering duplicate paragraphs may change ordinals; that's acceptable because this metadata is a read-only disambiguator, not an edit anchor.

Why this shape (vs. introducing _p_* IDs)

Surfaced during pre-filing review against the actual safe-docx codebase:

  • _bk_* IDs are already deterministic and stable across reopens for identical stored DOCX bytes — not session-scoped (see skills/docx-editing/SKILL.md:187, packages/docx-mcp/README.md:180, packages/docx-mcp/src/tools/paragraph_id_stability.traceability.test.ts:25).
  • The content_fingerprint mechanism already exists and is exactly the right primitive (see packages/docx-core/src/primitives/content_fingerprint.ts:29).
  • Bookmark resolution only recognizes names starting with _bk_, so introducing _p_* would require deep anchor plumbing across replace_text, insert_paragraph, apply_plan, add_comment, add_footnote, and other paragraph-ID consumers (packages/docx-core/src/primitives/bookmarks.ts:128,161,276).
  • node_ids filtering on read_file matches _bk_* (packages/docx-mcp/src/tools/read_file.ts:328) — an ID-shape change would require redefining that too.

This issue avoids all of that by staying additive on the existing JSON output.

Acceptance criteria

  • read_file(format="json", include_fingerprint=true) remains backward-compatible.
  • read_file(format="json", include_fingerprint=true, include_fingerprint_ordinal=true) adds the two new metadata fields (and optionally portable_paragraph_ref) while leaving id unchanged.
  • id continues to be _bk_*, and edit tools continue to accept only _bk_*.
  • The existing fingerprint algorithm remains unchanged (NFKC, Cf/invisible stripping, whitespace collapse, sha256:nfkc:<32hex>).
  • New tests cover:
    • Unique paragraph → ordinal 1, count 1
    • Duplicate normalized text → deterministic ordinals in document order
    • Whitespace-only variants share the same content_fingerprint and get distinct ordinals
    • TOON/simple output unchanged regardless of flag
    • Google Docs ignores the new flag (or it's rejected / documented as DOCX-only)

Out of scope

  • Replacing _bk_* with _p_*
  • Making portable references valid edit anchors
  • Changing TOON / simple output
  • Changing Google Docs output
  • Any new normalization algorithm or shorter hash format
  • Footnote / endnote / comment fingerprints (separate request if anyone needs them)

Why this helps downstream consumers

For manual-ingest / citation pipelines, this removes the need to re-implement duplicate grouping while preserving safe-docx's existing "stable edit anchor (_bk_*) vs portable content hash (content_fingerprint)" split.

Consumers wanting a document-local portable key can store:

  • document identifier
  • content_fingerprint
  • content_fingerprint_ordinal

Consumers wanting same-text grouping across documents continue using content_fingerprint alone.


Pre-filing review note: I initially proposed a bookmark_mode=content_addressable parameter that would have swapped _bk_* for _p_* hashes. Dynamic peer review against the safe-docx source (Codex + Gemini, both with direct repo access) found my premise was wrong: _bk_* is already stable, content_fingerprint already exists, and the ID-swap would have broken every paragraph-ID-consuming tool. This is the rewritten ask: additive duplicate metadata on top of the existing fingerprint surface.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions