Skip to content

Robust heading detection: parse w:outlineLvl, expand HeadingValue source taxonomy, document precedence #206

@stevenobiajulu

Description

@stevenobiajulu

Robust heading detection — parse w:outlineLvl, expand HeadingValue source taxonomy, document precedence

Problem

DocumentViewNode already carries a heading field, and deriveHeading (packages/docx-core/src/primitives/document_view.ts:260) populates it — but the current implementation only matches paragraphs whose style is literally Heading 1Heading 6. That covers a small fraction of real-world DOCX in some domains.

Two concrete gaps confirmed against the current code:

  1. w:outlineLvl is not parsed at all. Word lets authors mark a body paragraph as outline-level-N without applying a heading style; this is the closest thing in OOXML to "I meant this to be a heading even though I didn't use the heading style." extractParagraphFormatting in packages/docx-core/src/primitives/styles.ts:112 ignores w:outlineLvl today, so paragraphs that rely solely on it are invisible to deriveHeading.

  2. No source taxonomy on HeadingValue. Today a downstream consumer can't tell whether a heading was detected via built-in style, list metadata, outline level, or a formatting heuristic — they all look the same. That matters because consumers need to decide how much to trust each path (high confidence for builtin_style, low for formatting_heuristic).

Concrete real-world case: the NVCA Model Stock Purchase Agreement (canonical Series A reference document, publicly distributed by NVCA) uses ZERO Word built-in heading styles. The title SERIES [___] PREFERRED STOCK PURCHASE AGREEMENT is a centered, bold body paragraph. Section labels like 1. Purchase and Sale of Preferred Stock are body paragraphs with manual bold runs. Lawyer-authored DOCX in general is closer to this pattern than to the "use the styles pane" pattern Word's UI assumes.

Proposal

Split into two phases. Phase A is small and deterministic; Phase B is heuristic and opt-in.

Phase A — Deterministic detection improvements (default-on, additive)

  1. Parse w:outlineLvl. Add it to ParagraphFormatting (or wherever extractParagraphFormatting returns paragraph-level OOXML properties). Surface as outline_level: number | null.

  2. Recognize linked-list-as-heading. When a paragraph belongs to an outline-numbered list whose level is linked to a heading style (w:lvl/w:pStyle referencing a Heading N style), classify it with source: "list_metadata" and the corresponding level. (numbering.ts's ListLevelInfo may need a pStyle field to make this surfaceable cheaply.)

  3. Extend HeadingValue with a source field:

    export type HeadingValue = {
      text: string;
      level: number | null;
      source:
        | "builtin_style"        // Heading 1-9, Title, Subtitle, localized equivalents
        | "list_metadata"         // outline list linked to a heading style
        | "outline_level"         // w:outlineLvl on a body paragraph
        | "formatting_heuristic"; // Phase B only
    };
  4. Recognize localized built-in style names (en, fr, de, es, ja minimum — Heading 1, Titre 1, Überschrift 1, Título 1, 見出し 1 etc.). Ideally driven by a maintained data file rather than hardcoded in a switch.

  5. Document the precedence rules: builtin_stylelist_metadataoutline_level → (Phase B) formatting_heuristic. First match wins.

These are deterministic. They should run by default — there's no good reason a consumer would want outline_level ignored if it's set on a body paragraph.

Phase B — Opt-in formatting heuristic

Add include_heading_inference: boolean = false to read_file. When true, run a low-confidence heuristic on paragraphs not classified by Phase A. The heuristic should NOT depend on font size (legal section headers are usually body-size); a better signal in the legal-DOCX corpus is alignment + weight + case:

A paragraph is heuristically a level-1 heading if all of: alignment == center, entire paragraph is bold, text is all-caps (or title-case), not inside a table cell, and followed by at least one non-empty body paragraph.

Mark these source: "formatting_heuristic" so consumers can flag them for review.

Explicitly negative cases the test suite should cover:

  • TOC 1 / TOC 2 styled entries look like headings — they aren't.
  • One-word centered bold paragraphs mid-section (often emphasis, not headings).
  • Headings inside table cells (should not be detected even if formatting matches).

Why split into two phases

Phase A is small, deterministic, and arguably should ship alone first — it's a strict expansion of existing capability and consumers can opt in safely. Phase B is opinionated about a heuristic and benefits from running the deterministic path first to see which paragraphs remain unclassified.

If maintainers prefer one issue per phase, happy to split this into two.

Acceptance criteria

Phase A

  • w:outlineLvl parsed in extractParagraphFormatting; exposed as outline_level on ParagraphFormatting (or equivalent).
  • HeadingValue.source field added with the four-value enum above.
  • deriveHeading updated to follow the precedence: builtin_stylelist_metadataoutline_level, first-match-wins.
  • Localized heading style names (W3-canonical set: en, fr, de, es, ja minimum) recognized; ideally as a maintained data file, not hardcoded.
  • Tests cover:
    • w:outlineLvl=1 on a generic style → detected as level 1, source outline_level
    • Numbered list linked to Heading 2 style → detected as level 2, source list_metadata
    • Localized style name (Titre 1) → detected as level 1, source builtin_style
    • TOC 1 styled paragraph → NOT classified as a heading
    • Nested headings (1 → 2 → 1) produce correct levels in order

Phase B (if shipped together)

  • include_heading_inference: boolean parameter on read_file; default false.
  • When true, paragraphs not matched by Phase A are evaluated against the alignment+bold+case heuristic.
  • Heuristic-matched paragraphs get source: "formatting_heuristic", level 1 (don't try to infer level 2+ from formatting alone).
  • Tests cover:
    • Centered + bold + all-caps + followed-by-body → detected only when flag is true
    • Centered + bold mid-paragraph (one-word emphasis) → NOT detected
    • Centered + bold inside table cell → NOT detected
    • NVCA Model SPA fixture (or equivalent contract with zero built-in heading styles): heuristic catches the document title and top-level section labels, not mid-paragraph emphasis

Out of scope

  • Auto-promoting heuristic-detected headings to real Heading N styles
  • Heading detection inside text boxes or SDT (Structured Document Tag) content
  • TOC generation as a tool
  • Headings in gdocs output (Google Docs has its own heading model)

Reference: downstream consumer

legal-context's /lc:ingest-manual-source skill currently has to fall back to its own heading detection when ingesting documents like the NVCA Model SPA. With Phase A alone, that fallback collapses for any document whose author bothered to set w:outlineLvl. With Phase B, it collapses for the all-formatting-no-styles case too.


Pre-filing review note: dynamic peer review against the safe-docx source confirmed that (a) DocumentViewNode already has a heading field but deriveHeading only matches literal Heading 1-6 styles, (b) w:outlineLvl is genuinely not parsed today, and (c) the existing 1.5x-font-size heuristic I'd proposed is wrong for legal corpora — bold + centered + all-caps is a better signal because legal section labels are usually body-size.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions