Robust heading detection: parse w:outlineLvl, expand HeadingValue source taxonomy, document precedence

# Robust heading detection — parse `w:outlineLvl`, expand `HeadingValue` source taxonomy, document precedence

## Problem

`DocumentViewNode` already carries a `heading` field, and `deriveHeading` (`packages/docx-core/src/primitives/document_view.ts:260`) populates it — but the current implementation only matches paragraphs whose style is literally `Heading 1` … `Heading 6`. That covers a small fraction of real-world DOCX in some domains.

Two concrete gaps confirmed against the current code:

1. **`w:outlineLvl` is not parsed at all.** Word lets authors mark a body paragraph as outline-level-N without applying a heading style; this is the closest thing in OOXML to "I meant this to be a heading even though I didn't use the heading style." `extractParagraphFormatting` in `packages/docx-core/src/primitives/styles.ts:112` ignores `w:outlineLvl` today, so paragraphs that rely solely on it are invisible to `deriveHeading`.

2. **No source taxonomy on `HeadingValue`.** Today a downstream consumer can't tell whether a heading was detected via built-in style, list metadata, outline level, or a formatting heuristic — they all look the same. That matters because consumers need to decide how much to trust each path (high confidence for `builtin_style`, low for `formatting_heuristic`).

**Concrete real-world case**: the NVCA Model Stock Purchase Agreement (canonical Series A reference document, publicly distributed by NVCA) uses ZERO Word built-in heading styles. The title `SERIES [___] PREFERRED STOCK PURCHASE AGREEMENT` is a centered, bold body paragraph. Section labels like `1. Purchase and Sale of Preferred Stock` are body paragraphs with manual bold runs. Lawyer-authored DOCX in general is closer to this pattern than to the "use the styles pane" pattern Word's UI assumes.

## Proposal

Split into two phases. Phase A is small and deterministic; Phase B is heuristic and opt-in.

### Phase A — Deterministic detection improvements (default-on, additive)

1. **Parse `w:outlineLvl`.** Add it to `ParagraphFormatting` (or wherever `extractParagraphFormatting` returns paragraph-level OOXML properties). Surface as `outline_level: number | null`.

2. **Recognize linked-list-as-heading.** When a paragraph belongs to an outline-numbered list whose level is linked to a heading style (`w:lvl/w:pStyle` referencing a `Heading N` style), classify it with `source: "list_metadata"` and the corresponding level. (`numbering.ts`'s `ListLevelInfo` may need a `pStyle` field to make this surfaceable cheaply.)

3. **Extend `HeadingValue` with a `source` field**:

   ```typescript
   export type HeadingValue = {
     text: string;
     level: number | null;
     source:
       | "builtin_style"        // Heading 1-9, Title, Subtitle, localized equivalents
       | "list_metadata"         // outline list linked to a heading style
       | "outline_level"         // w:outlineLvl on a body paragraph
       | "formatting_heuristic"; // Phase B only
   };
   ```

4. **Recognize localized built-in style names** (en, fr, de, es, ja minimum — `Heading 1`, `Titre 1`, `Überschrift 1`, `Título 1`, `見出し 1` etc.). Ideally driven by a maintained data file rather than hardcoded in a switch.

5. **Document the precedence rules**: `builtin_style` → `list_metadata` → `outline_level` → (Phase B) `formatting_heuristic`. First match wins.

These are deterministic. They should run by default — there's no good reason a consumer would want `outline_level` ignored if it's set on a body paragraph.

### Phase B — Opt-in formatting heuristic

Add `include_heading_inference: boolean = false` to `read_file`. When `true`, run a low-confidence heuristic on paragraphs not classified by Phase A. The heuristic should NOT depend on font size (legal section headers are usually body-size); a better signal in the legal-DOCX corpus is **alignment + weight + case**:

> A paragraph is heuristically a level-1 heading if all of: `alignment == center`, `entire paragraph is bold`, `text is all-caps (or title-case)`, not inside a table cell, and followed by at least one non-empty body paragraph.

Mark these `source: "formatting_heuristic"` so consumers can flag them for review.

Explicitly negative cases the test suite should cover:
- `TOC 1` / `TOC 2` styled entries look like headings — they aren't.
- One-word centered bold paragraphs mid-section (often emphasis, not headings).
- Headings inside table cells (should not be detected even if formatting matches).

## Why split into two phases

Phase A is small, deterministic, and arguably should ship alone first — it's a strict expansion of existing capability and consumers can opt in safely. Phase B is opinionated about a heuristic and benefits from running the deterministic path first to see which paragraphs remain unclassified.

If maintainers prefer one issue per phase, happy to split this into two.

## Acceptance criteria

### Phase A

- `w:outlineLvl` parsed in `extractParagraphFormatting`; exposed as `outline_level` on `ParagraphFormatting` (or equivalent).
- `HeadingValue.source` field added with the four-value enum above.
- `deriveHeading` updated to follow the precedence: `builtin_style` → `list_metadata` → `outline_level`, first-match-wins.
- Localized heading style names (W3-canonical set: en, fr, de, es, ja minimum) recognized; ideally as a maintained data file, not hardcoded.
- Tests cover:
  - `w:outlineLvl=1` on a generic style → detected as level 1, source `outline_level`
  - Numbered list linked to `Heading 2` style → detected as level 2, source `list_metadata`
  - Localized style name (`Titre 1`) → detected as level 1, source `builtin_style`
  - `TOC 1` styled paragraph → NOT classified as a heading
  - Nested headings (1 → 2 → 1) produce correct levels in order

### Phase B (if shipped together)

- `include_heading_inference: boolean` parameter on `read_file`; default `false`.
- When `true`, paragraphs not matched by Phase A are evaluated against the alignment+bold+case heuristic.
- Heuristic-matched paragraphs get `source: "formatting_heuristic"`, level `1` (don't try to infer level 2+ from formatting alone).
- Tests cover:
  - Centered + bold + all-caps + followed-by-body → detected only when flag is true
  - Centered + bold mid-paragraph (one-word emphasis) → NOT detected
  - Centered + bold inside table cell → NOT detected
  - NVCA Model SPA fixture (or equivalent contract with zero built-in heading styles): heuristic catches the document title and top-level section labels, not mid-paragraph emphasis

## Out of scope

- Auto-promoting heuristic-detected headings to real `Heading N` styles
- Heading detection inside text boxes or SDT (Structured Document Tag) content
- TOC generation as a tool
- Headings in `gdocs` output (Google Docs has its own heading model)

## Reference: downstream consumer

legal-context's `/lc:ingest-manual-source` skill currently has to fall back to its own heading detection when ingesting documents like the NVCA Model SPA. With Phase A alone, that fallback collapses for any document whose author bothered to set `w:outlineLvl`. With Phase B, it collapses for the all-formatting-no-styles case too.

---

*Pre-filing review note: dynamic peer review against the safe-docx source confirmed that (a) `DocumentViewNode` already has a `heading` field but `deriveHeading` only matches literal `Heading 1-6` styles, (b) `w:outlineLvl` is genuinely not parsed today, and (c) the existing 1.5x-font-size heuristic I'd proposed is wrong for legal corpora — `bold + centered + all-caps` is a better signal because legal section labels are usually body-size.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robust heading detection: parse w:outlineLvl, expand HeadingValue source taxonomy, document precedence #206

Robust heading detection — parse `w:outlineLvl`, expand `HeadingValue` source taxonomy, document precedence

Problem

Proposal

Phase A — Deterministic detection improvements (default-on, additive)

Phase B — Opt-in formatting heuristic

Why split into two phases

Acceptance criteria

Phase A

Phase B (if shipped together)

Out of scope

Reference: downstream consumer

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Robust heading detection: parse w:outlineLvl, expand HeadingValue source taxonomy, document precedence #206

Description

Robust heading detection — parse w:outlineLvl, expand HeadingValue source taxonomy, document precedence

Problem

Proposal

Phase A — Deterministic detection improvements (default-on, additive)

Phase B — Opt-in formatting heuristic

Why split into two phases

Acceptance criteria

Phase A

Phase B (if shipped together)

Out of scope

Reference: downstream consumer

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Robust heading detection — parse `w:outlineLvl`, expand `HeadingValue` source taxonomy, document precedence