Add opt-in include_footnotes to read_file for single-call body + footnotes retrieval

# Add opt-in `include_footnotes` to `read_file` for single-call body + footnotes retrieval

## Problem

Today `read_file` returns the contents of `word/document.xml` — body paragraphs with inline footnote-reference markers `[^N]`. Footnotes themselves live in `word/footnotes.xml` and require a separate `get_footnotes` call. To fully reconstruct a document an agent must:

1. Call `read_file(format="json")` to get body paragraphs with `[^N]` markers.
2. Call `get_footnotes()` to get footnote bodies.
3. Stitch them together, preserving order and any multi-paragraph footnote bodies.

That's fine for "body only, footnotes are noise" use cases. But for full-fidelity ingest — converting a DOCX to canonical Markdown for storage, citation, or downstream search indexing — it's two calls + a manual stitch, easy to get wrong (especially when a footnote body contains multiple paragraphs or its own inline formatting).

**Concrete example**: the NVCA Model Stock Purchase Agreement has 109 footnotes totaling ~43,000 characters of substantive drafting guidance. Those footnotes ARE the document's value for downstream legal-AI consumers — they explain why each section is drafted the way it is. Losing them, or mis-stitching them, means losing the most useful part of the document.

## Proposal

Add an opt-in parameter:

```jsonc
{
  "name": "read_file",
  "input": {
    "path": "/path/to/doc.docx",
    "format": "json",
    "include_footnotes": true  // new; default false
  }
}
```

### Recommended shape — top-level `footnotes` field

Avoid inlining footnote bodies into `content[]` nodes — that would break the 1:1 `content[]` array index invariant that edit tooling relies on. Instead, sidecar them at the top level:

```jsonc
{
  "content": [
    { "id": "_bk_a1b2c3", "text": "This Agreement is made as of...[^1]", ... },
    ...
  ],
  "footnotes": [
    {
      "id": "1",                                 // matches the [^N] in body
      "ref_paragraph_ids": ["_bk_a1b2c3"],       // ARRAY — usually 1 elt, but DOCX has been
                                                 // observed (illegally) reusing the same
                                                 // footnote ID from multiple paragraphs
      "paragraphs": [                            // multi-paragraph footnote support
        { "text": "See generally Smith v. Jones, ...", "style": "FootnoteText" },
        { "text": "Continuation paragraph.", "style": "FootnoteText" }
      ]
    },
    ...
  ]
}
```

When `include_footnotes=false` (or absent), output is byte-identical to today's `read_file`.

### `toon` format support

`renderToonWithCommentEndnotes` (`packages/docx-core/src/primitives/document_view.ts:897`) already supports sidecar blocks for comments via `#COMMENTS`. A symmetric `#FOOTNOTES` block at the end of the toon document is the obvious shape — same conceptual model as comments, no new invariant.

## Why opt-in

Footnote bodies can double or triple response size. For NVCA SPA: ~16k body words → ~25k more with footnotes. Consumers who only want body shouldn't pay for that.

## Architectural implication for `docx-core` (called out explicitly)

The current `Footnote` type in `packages/docx-core/src/primitives/footnotes.ts:42` is:

```typescript
{ id, displayNumber, text: string, anchoredParagraphId }
```

— a flat `text: string` joined with `\n` (`extractFootnoteText` at `footnotes.ts:284-312`). It does **not** preserve multi-paragraph structure or internal formatting.

Implementing this issue's `paragraphs: [...]` shape **requires upgrading the core `Footnote` model** in `docx-core` to retain paragraph node structure and run-level formatting, not just flat text. That's a model change, not just an output-shape change. Calling this out explicitly so the maintainer knows the scope before accepting.

The existing `ref_paragraph_id` linkage IS already computed (`getFootnotes` at `footnotes.ts:234-250` walks up to the containing `<w:p>` via `getParagraphBookmarkId`), though it stores only the first reference. Making `ref_paragraph_ids` plural is a small change in `anchorMap`.

## Acceptance criteria

- `read_file(format="json")` with no flag or `include_footnotes=false` → output byte-identical to today.
- `read_file(format="json", include_footnotes=true)` → adds a top-level `footnotes` array in the shape above.
- Core `Footnote` model upgraded: multi-paragraph footnote bodies preserved at the same node-level fidelity as body paragraphs.
- Footnote-internal formatting preserved (bold, italic, citation runs).
- `ref_paragraph_ids` is an array — handles the (illegal but observed) case of the same footnote ID referenced from multiple paragraphs.
- `toon` format: when `include_footnotes=true`, emits footnotes in a sidecar `#FOOTNOTES` block at the end, conceptually identical to `#COMMENTS`.
- Tests cover:
  - Zero-footnote document → `footnotes` is an empty array (or omitted)
  - Single-footnote document
  - Multi-paragraph footnote body → `paragraphs[]` has the right shape and count
  - Footnote that itself contains a footnote (legal in OOXML, rare)
  - NVCA SPA-scale fixture (109 footnotes) — exit cleanly, all 109 represented
  - `include_footnotes=false` (default) → response is byte-identical to current

## Out of scope

- Endnotes (`word/endnotes.xml`). Separate ticket if needed — model is similar but not identical.
- Comments (`word/comments.xml`). `get_comments` already exists.
- Modifying footnotes via `read_file` — still goes through `add_footnote` / `update_footnote` / `delete_footnote`.
- `grep` searching inside footnote bodies. Probably a good idea behind its own flag; file separately.

## Reference: downstream consumer

legal-context's `/lc:ingest-manual-source` skill is currently doing two-pass extraction (`read_file` + `get_footnotes`) and manually stitching `[^N]` references. With this change it becomes a single-call read plus a render pass that emits GFM footnote definitions.

---

*Pre-filing review note: dynamic peer review against the safe-docx source confirmed that (a) `read_file` today does NOT include footnote bodies — only marker suffixes via `collectFootnoteMarkerSuffix` (`packages/docx-mcp/src/tools/read_file.ts:168`); (b) the current `Footnote` type is flat-text (`footnotes.ts:42`), so this issue inherently requires a docx-core model upgrade — called out explicitly above; (c) `ref_paragraph_id` linkage is already computed but is currently single-valued, so the proposal changes it to an array (`ref_paragraph_ids`); (d) toon's existing `#COMMENTS` sidecar block (`document_view.ts:897`) is the right precedent for a symmetric `#FOOTNOTES` block.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add opt-in include_footnotes to read_file for single-call body + footnotes retrieval #207

Add opt-in `include_footnotes` to `read_file` for single-call body + footnotes retrieval

Problem

Proposal

Recommended shape — top-level `footnotes` field

`toon` format support

Why opt-in

Architectural implication for `docx-core` (called out explicitly)

Acceptance criteria

Out of scope

Reference: downstream consumer

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add opt-in include_footnotes to read_file for single-call body + footnotes retrieval #207

Description

Add opt-in include_footnotes to read_file for single-call body + footnotes retrieval

Problem

Proposal

Recommended shape — top-level footnotes field

toon format support

Why opt-in

Architectural implication for docx-core (called out explicitly)

Acceptance criteria

Out of scope

Reference: downstream consumer

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Add opt-in `include_footnotes` to `read_file` for single-call body + footnotes retrieval

Recommended shape — top-level `footnotes` field

`toon` format support

Architectural implication for `docx-core` (called out explicitly)