Proposal: HTML Tag Parsing from Regex to pattern matching
Motivation
We sometimes misfire parsing HTML tags embedded in Markdown. By refactoring the parser to use pattern matching instead of regex, we can improve correctness, performance, and maintainability because HTML5 specification provides a closed world problem.
Problem Statement
Current Issues:
- HTML Tag Parsing
- Buggy Behavior
Proposed Solution
Refine the existing HTML parsing steps based on the HTML5 specs:
- Tag Identification: Pattern match on HTML tags (e.g.,
<a>, <div>, <address>).
- Attribute Parsing: Handle attributes based on the tag: pattern match through up to 10 attribute categories.
- Content Parsing: Content parsing process remains unchanged.
Implementation Steps
Phase 1: Specification Mapping & Data Modeling
-
Enumerate Valid Tags and Attributes
- Create a comprehensive list of all valid HTML5 tags (109 total).
- Categorize attributes into 10 groups (e.g., global, ARIA, data-*).
- Map element-specific attributes for each tag (e.g.,
<img> has src, alt, etc.).
-
Define Types
- Use Elixir types to model HTML nodes and attributes:
@type html_tag :: :div | :a | :img | ...
@type html_attr :: {atom(), String.t()}
or something like that.
Phase 2: Parser Implementation
- Tag Parsing - Replace regex with pattern-matched dispatch
- Attribute Parsing - Context-aware parsing based on tag
- Performance Optimizations - Use binary matching (
<<c::utf8, rest::binary>>) for zero-copy parsing.
Phase 3: Testing and Validation
-
Unit Tests
- Cover all 109 tags and their attributes.
- Test edge cases like malformed tags, nested tags, and special characters.
-
Property-Based Testing
- Use StreamData to generate valid/invalid HTML snippets and validate parser behavior.
-
CI Enhancements
- Add coverage checks for spec completeness.
- Benchmark against the current regex implementation to measure performance gains.
-
Error Handling
- Provide clear error messages for invalid markup.
- Support configurable strictness levels (warn vs. fail-fast).
Benefits
Next Steps
- Finalize the HTML5 data spreadsheet. which tags need to look for which tags
- Draft the
EarmarkParser.HTML module skeleton.
Proposal: HTML Tag Parsing from Regex to pattern matching
Motivation
We sometimes misfire parsing HTML tags embedded in Markdown. By refactoring the parser to use pattern matching instead of regex, we can improve correctness, performance, and maintainability because HTML5 specification provides a closed world problem.
Problem Statement
Current Issues:
|within HTML tags (Faulty parsing of | within ` #148).Proposed Solution
Refine the existing HTML parsing steps based on the HTML5 specs:
<a>,<div>,<address>).Implementation Steps
Phase 1: Specification Mapping & Data Modeling
Enumerate Valid Tags and Attributes
<img>hassrc,alt, etc.).Define Types
or something like that.
Phase 2: Parser Implementation
<<c::utf8, rest::binary>>) for zero-copy parsing.Phase 3: Testing and Validation
Unit Tests
Property-Based Testing
CI Enhancements
Error Handling
Benefits
Maintainability:
behead/2.Performance: Efficient parsing with minimal backtracking.
Correctness:
Next Steps
EarmarkParser.HTMLmodule skeleton.