Skip to content

Add opt-in include_footnotes to read_file for single-call body + footnotes retrieval #207

@stevenobiajulu

Description

@stevenobiajulu

Add opt-in include_footnotes to read_file for single-call body + footnotes retrieval

Problem

Today read_file returns the contents of word/document.xml — body paragraphs with inline footnote-reference markers [^N]. Footnotes themselves live in word/footnotes.xml and require a separate get_footnotes call. To fully reconstruct a document an agent must:

  1. Call read_file(format="json") to get body paragraphs with [^N] markers.
  2. Call get_footnotes() to get footnote bodies.
  3. Stitch them together, preserving order and any multi-paragraph footnote bodies.

That's fine for "body only, footnotes are noise" use cases. But for full-fidelity ingest — converting a DOCX to canonical Markdown for storage, citation, or downstream search indexing — it's two calls + a manual stitch, easy to get wrong (especially when a footnote body contains multiple paragraphs or its own inline formatting).

Concrete example: the NVCA Model Stock Purchase Agreement has 109 footnotes totaling ~43,000 characters of substantive drafting guidance. Those footnotes ARE the document's value for downstream legal-AI consumers — they explain why each section is drafted the way it is. Losing them, or mis-stitching them, means losing the most useful part of the document.

Proposal

Add an opt-in parameter:

{
  "name": "read_file",
  "input": {
    "path": "/path/to/doc.docx",
    "format": "json",
    "include_footnotes": true  // new; default false
  }
}

Recommended shape — top-level footnotes field

Avoid inlining footnote bodies into content[] nodes — that would break the 1:1 content[] array index invariant that edit tooling relies on. Instead, sidecar them at the top level:

{
  "content": [
    { "id": "_bk_a1b2c3", "text": "This Agreement is made as of...[^1]", ... },
    ...
  ],
  "footnotes": [
    {
      "id": "1",                                 // matches the [^N] in body
      "ref_paragraph_ids": ["_bk_a1b2c3"],       // ARRAY — usually 1 elt, but DOCX has been
                                                 // observed (illegally) reusing the same
                                                 // footnote ID from multiple paragraphs
      "paragraphs": [                            // multi-paragraph footnote support
        { "text": "See generally Smith v. Jones, ...", "style": "FootnoteText" },
        { "text": "Continuation paragraph.", "style": "FootnoteText" }
      ]
    },
    ...
  ]
}

When include_footnotes=false (or absent), output is byte-identical to today's read_file.

toon format support

renderToonWithCommentEndnotes (packages/docx-core/src/primitives/document_view.ts:897) already supports sidecar blocks for comments via #COMMENTS. A symmetric #FOOTNOTES block at the end of the toon document is the obvious shape — same conceptual model as comments, no new invariant.

Why opt-in

Footnote bodies can double or triple response size. For NVCA SPA: ~16k body words → ~25k more with footnotes. Consumers who only want body shouldn't pay for that.

Architectural implication for docx-core (called out explicitly)

The current Footnote type in packages/docx-core/src/primitives/footnotes.ts:42 is:

{ id, displayNumber, text: string, anchoredParagraphId }

— a flat text: string joined with \n (extractFootnoteText at footnotes.ts:284-312). It does not preserve multi-paragraph structure or internal formatting.

Implementing this issue's paragraphs: [...] shape requires upgrading the core Footnote model in docx-core to retain paragraph node structure and run-level formatting, not just flat text. That's a model change, not just an output-shape change. Calling this out explicitly so the maintainer knows the scope before accepting.

The existing ref_paragraph_id linkage IS already computed (getFootnotes at footnotes.ts:234-250 walks up to the containing <w:p> via getParagraphBookmarkId), though it stores only the first reference. Making ref_paragraph_ids plural is a small change in anchorMap.

Acceptance criteria

  • read_file(format="json") with no flag or include_footnotes=false → output byte-identical to today.
  • read_file(format="json", include_footnotes=true) → adds a top-level footnotes array in the shape above.
  • Core Footnote model upgraded: multi-paragraph footnote bodies preserved at the same node-level fidelity as body paragraphs.
  • Footnote-internal formatting preserved (bold, italic, citation runs).
  • ref_paragraph_ids is an array — handles the (illegal but observed) case of the same footnote ID referenced from multiple paragraphs.
  • toon format: when include_footnotes=true, emits footnotes in a sidecar #FOOTNOTES block at the end, conceptually identical to #COMMENTS.
  • Tests cover:
    • Zero-footnote document → footnotes is an empty array (or omitted)
    • Single-footnote document
    • Multi-paragraph footnote body → paragraphs[] has the right shape and count
    • Footnote that itself contains a footnote (legal in OOXML, rare)
    • NVCA SPA-scale fixture (109 footnotes) — exit cleanly, all 109 represented
    • include_footnotes=false (default) → response is byte-identical to current

Out of scope

  • Endnotes (word/endnotes.xml). Separate ticket if needed — model is similar but not identical.
  • Comments (word/comments.xml). get_comments already exists.
  • Modifying footnotes via read_file — still goes through add_footnote / update_footnote / delete_footnote.
  • grep searching inside footnote bodies. Probably a good idea behind its own flag; file separately.

Reference: downstream consumer

legal-context's /lc:ingest-manual-source skill is currently doing two-pass extraction (read_file + get_footnotes) and manually stitching [^N] references. With this change it becomes a single-call read plus a render pass that emits GFM footnote definitions.


Pre-filing review note: dynamic peer review against the safe-docx source confirmed that (a) read_file today does NOT include footnote bodies — only marker suffixes via collectFootnoteMarkerSuffix (packages/docx-mcp/src/tools/read_file.ts:168); (b) the current Footnote type is flat-text (footnotes.ts:42), so this issue inherently requires a docx-core model upgrade — called out explicitly above; (c) ref_paragraph_id linkage is already computed but is currently single-valued, so the proposal changes it to an array (ref_paragraph_ids); (d) toon's existing #COMMENTS sidecar block (document_view.ts:897) is the right precedent for a symmetric #FOOTNOTES block.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions