Skip to content

Refactoring HTML parsing implementation #169

@bradhanks

Description

@bradhanks

Proposal: HTML Tag Parsing from Regex to pattern matching

Motivation

We sometimes misfire parsing HTML tags embedded in Markdown. By refactoring the parser to use pattern matching instead of regex, we can improve correctness, performance, and maintainability because HTML5 specification provides a closed world problem.


Problem Statement

Current Issues:

  1. HTML Tag Parsing
  2. Buggy Behavior

Proposed Solution

Refine the existing HTML parsing steps based on the HTML5 specs:

  1. Tag Identification: Pattern match on HTML tags (e.g., <a>, <div>, <address>).
  2. Attribute Parsing: Handle attributes based on the tag: pattern match through up to 10 attribute categories.
  3. Content Parsing: Content parsing process remains unchanged.

Implementation Steps

Phase 1: Specification Mapping & Data Modeling

  1. Enumerate Valid Tags and Attributes

    • Create a comprehensive list of all valid HTML5 tags (109 total).
    • Categorize attributes into 10 groups (e.g., global, ARIA, data-*).
    • Map element-specific attributes for each tag (e.g., <img> has src, alt, etc.).
  2. Define Types

    • Use Elixir types to model HTML nodes and attributes:
      @type html_tag :: :div | :a | :img | ...
      @type html_attr :: {atom(), String.t()}

or something like that.

Phase 2: Parser Implementation

  1. Tag Parsing - Replace regex with pattern-matched dispatch
  2. Attribute Parsing - Context-aware parsing based on tag
  3. Performance Optimizations - Use binary matching (<<c::utf8, rest::binary>>) for zero-copy parsing.

Phase 3: Testing and Validation

  1. Unit Tests

    • Cover all 109 tags and their attributes.
    • Test edge cases like malformed tags, nested tags, and special characters.
  2. Property-Based Testing

    • Use StreamData to generate valid/invalid HTML snippets and validate parser behavior.
  3. CI Enhancements

    • Add coverage checks for spec completeness.
    • Benchmark against the current regex implementation to measure performance gains.
  4. Error Handling

    • Provide clear error messages for invalid markup.
    • Support configurable strictness levels (warn vs. fail-fast).

Benefits


Next Steps

  • Finalize the HTML5 data spreadsheet. which tags need to look for which tags
  • Draft the EarmarkParser.HTML module skeleton.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions