Robust heading detection — parse w:outlineLvl, expand HeadingValue source taxonomy, document precedence
Problem
DocumentViewNode already carries a heading field, and deriveHeading (packages/docx-core/src/primitives/document_view.ts:260) populates it — but the current implementation only matches paragraphs whose style is literally Heading 1 … Heading 6. That covers a small fraction of real-world DOCX in some domains.
Two concrete gaps confirmed against the current code:
-
w:outlineLvl is not parsed at all. Word lets authors mark a body paragraph as outline-level-N without applying a heading style; this is the closest thing in OOXML to "I meant this to be a heading even though I didn't use the heading style." extractParagraphFormatting in packages/docx-core/src/primitives/styles.ts:112 ignores w:outlineLvl today, so paragraphs that rely solely on it are invisible to deriveHeading.
-
No source taxonomy on HeadingValue. Today a downstream consumer can't tell whether a heading was detected via built-in style, list metadata, outline level, or a formatting heuristic — they all look the same. That matters because consumers need to decide how much to trust each path (high confidence for builtin_style, low for formatting_heuristic).
Concrete real-world case: the NVCA Model Stock Purchase Agreement (canonical Series A reference document, publicly distributed by NVCA) uses ZERO Word built-in heading styles. The title SERIES [___] PREFERRED STOCK PURCHASE AGREEMENT is a centered, bold body paragraph. Section labels like 1. Purchase and Sale of Preferred Stock are body paragraphs with manual bold runs. Lawyer-authored DOCX in general is closer to this pattern than to the "use the styles pane" pattern Word's UI assumes.
Proposal
Split into two phases. Phase A is small and deterministic; Phase B is heuristic and opt-in.
Phase A — Deterministic detection improvements (default-on, additive)
-
Parse w:outlineLvl. Add it to ParagraphFormatting (or wherever extractParagraphFormatting returns paragraph-level OOXML properties). Surface as outline_level: number | null.
-
Recognize linked-list-as-heading. When a paragraph belongs to an outline-numbered list whose level is linked to a heading style (w:lvl/w:pStyle referencing a Heading N style), classify it with source: "list_metadata" and the corresponding level. (numbering.ts's ListLevelInfo may need a pStyle field to make this surfaceable cheaply.)
-
Extend HeadingValue with a source field:
export type HeadingValue = {
text: string;
level: number | null;
source:
| "builtin_style" // Heading 1-9, Title, Subtitle, localized equivalents
| "list_metadata" // outline list linked to a heading style
| "outline_level" // w:outlineLvl on a body paragraph
| "formatting_heuristic"; // Phase B only
};
-
Recognize localized built-in style names (en, fr, de, es, ja minimum — Heading 1, Titre 1, Überschrift 1, Título 1, 見出し 1 etc.). Ideally driven by a maintained data file rather than hardcoded in a switch.
-
Document the precedence rules: builtin_style → list_metadata → outline_level → (Phase B) formatting_heuristic. First match wins.
These are deterministic. They should run by default — there's no good reason a consumer would want outline_level ignored if it's set on a body paragraph.
Phase B — Opt-in formatting heuristic
Add include_heading_inference: boolean = false to read_file. When true, run a low-confidence heuristic on paragraphs not classified by Phase A. The heuristic should NOT depend on font size (legal section headers are usually body-size); a better signal in the legal-DOCX corpus is alignment + weight + case:
A paragraph is heuristically a level-1 heading if all of: alignment == center, entire paragraph is bold, text is all-caps (or title-case), not inside a table cell, and followed by at least one non-empty body paragraph.
Mark these source: "formatting_heuristic" so consumers can flag them for review.
Explicitly negative cases the test suite should cover:
TOC 1 / TOC 2 styled entries look like headings — they aren't.
- One-word centered bold paragraphs mid-section (often emphasis, not headings).
- Headings inside table cells (should not be detected even if formatting matches).
Why split into two phases
Phase A is small, deterministic, and arguably should ship alone first — it's a strict expansion of existing capability and consumers can opt in safely. Phase B is opinionated about a heuristic and benefits from running the deterministic path first to see which paragraphs remain unclassified.
If maintainers prefer one issue per phase, happy to split this into two.
Acceptance criteria
Phase A
w:outlineLvl parsed in extractParagraphFormatting; exposed as outline_level on ParagraphFormatting (or equivalent).
HeadingValue.source field added with the four-value enum above.
deriveHeading updated to follow the precedence: builtin_style → list_metadata → outline_level, first-match-wins.
- Localized heading style names (W3-canonical set: en, fr, de, es, ja minimum) recognized; ideally as a maintained data file, not hardcoded.
- Tests cover:
w:outlineLvl=1 on a generic style → detected as level 1, source outline_level
- Numbered list linked to
Heading 2 style → detected as level 2, source list_metadata
- Localized style name (
Titre 1) → detected as level 1, source builtin_style
TOC 1 styled paragraph → NOT classified as a heading
- Nested headings (1 → 2 → 1) produce correct levels in order
Phase B (if shipped together)
include_heading_inference: boolean parameter on read_file; default false.
- When
true, paragraphs not matched by Phase A are evaluated against the alignment+bold+case heuristic.
- Heuristic-matched paragraphs get
source: "formatting_heuristic", level 1 (don't try to infer level 2+ from formatting alone).
- Tests cover:
- Centered + bold + all-caps + followed-by-body → detected only when flag is true
- Centered + bold mid-paragraph (one-word emphasis) → NOT detected
- Centered + bold inside table cell → NOT detected
- NVCA Model SPA fixture (or equivalent contract with zero built-in heading styles): heuristic catches the document title and top-level section labels, not mid-paragraph emphasis
Out of scope
- Auto-promoting heuristic-detected headings to real
Heading N styles
- Heading detection inside text boxes or SDT (Structured Document Tag) content
- TOC generation as a tool
- Headings in
gdocs output (Google Docs has its own heading model)
Reference: downstream consumer
legal-context's /lc:ingest-manual-source skill currently has to fall back to its own heading detection when ingesting documents like the NVCA Model SPA. With Phase A alone, that fallback collapses for any document whose author bothered to set w:outlineLvl. With Phase B, it collapses for the all-formatting-no-styles case too.
Pre-filing review note: dynamic peer review against the safe-docx source confirmed that (a) DocumentViewNode already has a heading field but deriveHeading only matches literal Heading 1-6 styles, (b) w:outlineLvl is genuinely not parsed today, and (c) the existing 1.5x-font-size heuristic I'd proposed is wrong for legal corpora — bold + centered + all-caps is a better signal because legal section labels are usually body-size.
Robust heading detection — parse
w:outlineLvl, expandHeadingValuesource taxonomy, document precedenceProblem
DocumentViewNodealready carries aheadingfield, andderiveHeading(packages/docx-core/src/primitives/document_view.ts:260) populates it — but the current implementation only matches paragraphs whose style is literallyHeading 1…Heading 6. That covers a small fraction of real-world DOCX in some domains.Two concrete gaps confirmed against the current code:
w:outlineLvlis not parsed at all. Word lets authors mark a body paragraph as outline-level-N without applying a heading style; this is the closest thing in OOXML to "I meant this to be a heading even though I didn't use the heading style."extractParagraphFormattinginpackages/docx-core/src/primitives/styles.ts:112ignoresw:outlineLvltoday, so paragraphs that rely solely on it are invisible toderiveHeading.No source taxonomy on
HeadingValue. Today a downstream consumer can't tell whether a heading was detected via built-in style, list metadata, outline level, or a formatting heuristic — they all look the same. That matters because consumers need to decide how much to trust each path (high confidence forbuiltin_style, low forformatting_heuristic).Concrete real-world case: the NVCA Model Stock Purchase Agreement (canonical Series A reference document, publicly distributed by NVCA) uses ZERO Word built-in heading styles. The title
SERIES [___] PREFERRED STOCK PURCHASE AGREEMENTis a centered, bold body paragraph. Section labels like1. Purchase and Sale of Preferred Stockare body paragraphs with manual bold runs. Lawyer-authored DOCX in general is closer to this pattern than to the "use the styles pane" pattern Word's UI assumes.Proposal
Split into two phases. Phase A is small and deterministic; Phase B is heuristic and opt-in.
Phase A — Deterministic detection improvements (default-on, additive)
Parse
w:outlineLvl. Add it toParagraphFormatting(or whereverextractParagraphFormattingreturns paragraph-level OOXML properties). Surface asoutline_level: number | null.Recognize linked-list-as-heading. When a paragraph belongs to an outline-numbered list whose level is linked to a heading style (
w:lvl/w:pStylereferencing aHeading Nstyle), classify it withsource: "list_metadata"and the corresponding level. (numbering.ts'sListLevelInfomay need apStylefield to make this surfaceable cheaply.)Extend
HeadingValuewith asourcefield:Recognize localized built-in style names (en, fr, de, es, ja minimum —
Heading 1,Titre 1,Überschrift 1,Título 1,見出し 1etc.). Ideally driven by a maintained data file rather than hardcoded in a switch.Document the precedence rules:
builtin_style→list_metadata→outline_level→ (Phase B)formatting_heuristic. First match wins.These are deterministic. They should run by default — there's no good reason a consumer would want
outline_levelignored if it's set on a body paragraph.Phase B — Opt-in formatting heuristic
Add
include_heading_inference: boolean = falsetoread_file. Whentrue, run a low-confidence heuristic on paragraphs not classified by Phase A. The heuristic should NOT depend on font size (legal section headers are usually body-size); a better signal in the legal-DOCX corpus is alignment + weight + case:Mark these
source: "formatting_heuristic"so consumers can flag them for review.Explicitly negative cases the test suite should cover:
TOC 1/TOC 2styled entries look like headings — they aren't.Why split into two phases
Phase A is small, deterministic, and arguably should ship alone first — it's a strict expansion of existing capability and consumers can opt in safely. Phase B is opinionated about a heuristic and benefits from running the deterministic path first to see which paragraphs remain unclassified.
If maintainers prefer one issue per phase, happy to split this into two.
Acceptance criteria
Phase A
w:outlineLvlparsed inextractParagraphFormatting; exposed asoutline_levelonParagraphFormatting(or equivalent).HeadingValue.sourcefield added with the four-value enum above.deriveHeadingupdated to follow the precedence:builtin_style→list_metadata→outline_level, first-match-wins.w:outlineLvl=1on a generic style → detected as level 1, sourceoutline_levelHeading 2style → detected as level 2, sourcelist_metadataTitre 1) → detected as level 1, sourcebuiltin_styleTOC 1styled paragraph → NOT classified as a headingPhase B (if shipped together)
include_heading_inference: booleanparameter onread_file; defaultfalse.true, paragraphs not matched by Phase A are evaluated against the alignment+bold+case heuristic.source: "formatting_heuristic", level1(don't try to infer level 2+ from formatting alone).Out of scope
Heading Nstylesgdocsoutput (Google Docs has its own heading model)Reference: downstream consumer
legal-context's
/lc:ingest-manual-sourceskill currently has to fall back to its own heading detection when ingesting documents like the NVCA Model SPA. With Phase A alone, that fallback collapses for any document whose author bothered to setw:outlineLvl. With Phase B, it collapses for the all-formatting-no-styles case too.Pre-filing review note: dynamic peer review against the safe-docx source confirmed that (a)
DocumentViewNodealready has aheadingfield butderiveHeadingonly matches literalHeading 1-6styles, (b)w:outlineLvlis genuinely not parsed today, and (c) the existing 1.5x-font-size heuristic I'd proposed is wrong for legal corpora —bold + centered + all-capsis a better signal because legal section labels are usually body-size.