Add opt-in duplicate-disambiguation metadata alongside content_fingerprint in read_file

# Add opt-in duplicate-disambiguation metadata alongside `content_fingerprint` in `read_file(format="json")`

## Problem

For downstream consumers (e.g. legal-context's manual-DOCX ingest pipeline) that want to reference a specific paragraph from outside the safe-docx session — for example, storing a stable paragraph ID alongside a database row so a re-extraction matches the same paragraph — safe-docx already exposes the core primitive: `content_fingerprint` (opt-in via `include_fingerprint: true` on `read_file`), which gives a normalized-text hash that's stable across reads, machines, and re-uploads.

The remaining gap is **duplicate disambiguation when the same normalized paragraph text appears multiple times in one document**. With only `content_fingerprint`, two paragraphs with byte-identical text get the same fingerprint, so a consumer can't reference "the second occurrence of WHEREAS …" without computing its own ordinal.

Every consumer who hits this has to:
1. Request `include_fingerprint: true`
2. Group paragraphs by `content_fingerprint`
3. Assign document-order ordinals inside each duplicate group
4. Build their own composite key

That logic is small but easy to drift on across consumers.

## Proposal

Keep `_bk_*` and `content_fingerprint` exactly as-is. Do **not** change `id`, do **not** change edit-anchor semantics, and do **not** change TOON/simple output or Google Docs behavior.

Instead, add an opt-in JSON-only duplicate-disambiguation surface for DOCX sessions:

```jsonc
{
  "file_path": "/path/to/doc.docx",
  "format": "json",
  "include_fingerprint": true,
  "include_fingerprint_ordinal": true  // new; default false
}
```

When both flags are enabled, each paragraph node would include two new fields:

```jsonc
{
  "id": "_bk_a3f29c10b8e4",
  "content_fingerprint": "sha256:nfkc:5d2e8f1a4c5b7d2e8f1a4c5b7d2e8f1a",
  "content_fingerprint_ordinal": 1,             // new — 1-based, document-order among duplicates
  "content_fingerprint_count_in_document": 2    // new — total paragraphs sharing this fingerprint
}
```

Optional convenience field if maintainers prefer a single composite string:

```jsonc
"portable_paragraph_ref": "sha256:nfkc:5d2e8f1a4c5b7d2e8f1a4c5b7d2e8f1a#1"
```

Happy with either structured fields alone, or structured fields plus the convenience composite.

## Semantics

- `content_fingerprint` keeps its current algorithm (NFKC normalization, Cf/invisible stripping, whitespace collapse + trim, `sha256:nfkc:<32hex>`).
- `content_fingerprint_ordinal` is 1-based.
- Ordinals assigned in document order among paragraphs with the same `content_fingerprint`.
- `content_fingerprint_count_in_document` is the total count of paragraphs sharing that fingerprint.
- For unique fingerprints: ordinal `1`, count `1`.
- Reordering duplicate paragraphs may change ordinals; that's acceptable because this metadata is a read-only disambiguator, not an edit anchor.

## Why this shape (vs. introducing `_p_*` IDs)

Surfaced during pre-filing review against the actual safe-docx codebase:

- `_bk_*` IDs are **already** deterministic and stable across reopens for identical stored DOCX bytes — not session-scoped (see `skills/docx-editing/SKILL.md:187`, `packages/docx-mcp/README.md:180`, `packages/docx-mcp/src/tools/paragraph_id_stability.traceability.test.ts:25`).
- The `content_fingerprint` mechanism already exists and is exactly the right primitive (see `packages/docx-core/src/primitives/content_fingerprint.ts:29`).
- Bookmark resolution only recognizes names starting with `_bk_`, so introducing `_p_*` would require deep anchor plumbing across `replace_text`, `insert_paragraph`, `apply_plan`, `add_comment`, `add_footnote`, and other paragraph-ID consumers (`packages/docx-core/src/primitives/bookmarks.ts:128,161,276`).
- `node_ids` filtering on `read_file` matches `_bk_*` (`packages/docx-mcp/src/tools/read_file.ts:328`) — an ID-shape change would require redefining that too.

This issue avoids all of that by staying additive on the existing JSON output.

## Acceptance criteria

- `read_file(format="json", include_fingerprint=true)` remains backward-compatible.
- `read_file(format="json", include_fingerprint=true, include_fingerprint_ordinal=true)` adds the two new metadata fields (and optionally `portable_paragraph_ref`) while leaving `id` unchanged.
- `id` continues to be `_bk_*`, and edit tools continue to accept only `_bk_*`.
- The existing fingerprint algorithm remains unchanged (NFKC, Cf/invisible stripping, whitespace collapse, `sha256:nfkc:<32hex>`).
- New tests cover:
  - Unique paragraph → ordinal `1`, count `1`
  - Duplicate normalized text → deterministic ordinals in document order
  - Whitespace-only variants share the same `content_fingerprint` and get distinct ordinals
  - TOON/simple output unchanged regardless of flag
  - Google Docs ignores the new flag (or it's rejected / documented as DOCX-only)

## Out of scope

- Replacing `_bk_*` with `_p_*`
- Making portable references valid edit anchors
- Changing TOON / simple output
- Changing Google Docs output
- Any new normalization algorithm or shorter hash format
- Footnote / endnote / comment fingerprints (separate request if anyone needs them)

## Why this helps downstream consumers

For manual-ingest / citation pipelines, this removes the need to re-implement duplicate grouping while preserving safe-docx's existing "stable edit anchor (`_bk_*`) vs portable content hash (`content_fingerprint`)" split.

Consumers wanting a document-local portable key can store:
- document identifier
- `content_fingerprint`
- `content_fingerprint_ordinal`

Consumers wanting same-text grouping across documents continue using `content_fingerprint` alone.

---

*Pre-filing review note: I initially proposed a `bookmark_mode=content_addressable` parameter that would have swapped `_bk_*` for `_p_*` hashes. Dynamic peer review against the safe-docx source (Codex + Gemini, both with direct repo access) found my premise was wrong: `_bk_*` is already stable, `content_fingerprint` already exists, and the ID-swap would have broken every paragraph-ID-consuming tool. This is the rewritten ask: additive duplicate metadata on top of the existing fingerprint surface.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add opt-in duplicate-disambiguation metadata alongside content_fingerprint in read_file #205

Add opt-in duplicate-disambiguation metadata alongside `content_fingerprint` in `read_file(format="json")`

Problem

Proposal

Semantics

Why this shape (vs. introducing `_p_*` IDs)

Acceptance criteria

Out of scope

Why this helps downstream consumers

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add opt-in duplicate-disambiguation metadata alongside content_fingerprint in read_file #205

Description

Add opt-in duplicate-disambiguation metadata alongside content_fingerprint in read_file(format="json")

Problem

Proposal

Semantics

Why this shape (vs. introducing _p_* IDs)

Acceptance criteria

Out of scope

Why this helps downstream consumers

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Add opt-in duplicate-disambiguation metadata alongside `content_fingerprint` in `read_file(format="json")`

Why this shape (vs. introducing `_p_*` IDs)