Refactoring HTML parsing implementation

# Proposal: HTML Tag Parsing from Regex to pattern matching

## Motivation
We sometimes misfire parsing HTML tags embedded in Markdown. By refactoring the parser to use **pattern matching** instead of regex, we can improve correctness, performance, and maintainability because [HTML5 specification](https://html.spec.whatwg.org/#elements-2) provides a closed world problem.

---

## Problem Statement
### Current Issues:
1. **HTML Tag Parsing**
   - Faulty parsing of nested HTML tags (#161).
   - Inconsistent handling of attributes (#102, #7).
   - Poor handling of special characters in attributes (#139, #144).
2. **Buggy Behavior**
   - Incorrect parsing of `|` within HTML tags (#148).
   - Spaces in front of HTML tags not allowed (#102).
   - Inline Attribute Lists (IAL) removed from fenced code blocks (#94).

## Proposed Solution
Refine the existing HTML parsing steps based on the [HTML5 specs](https://html.spec.whatwg.org/#elements-2):
1. **Tag Identification**: Pattern match on HTML tags (e.g., `<a>`, `<div>`, `<address>`).
2. **Attribute Parsing**: Handle attributes based on the tag: pattern match through up to [10 attribute categories](https://gist.github.com/bradhanks/ec078a7f1c1bbd442e40993581b6b8f4#file-gistfile1-txt).
3. **Content Parsing**: Content parsing process remains unchanged.

## Implementation Steps
### Phase 1: Specification Mapping & Data Modeling
1. **Enumerate Valid Tags and Attributes**  
   - Create a comprehensive list of all valid HTML5 tags (109 total).  
   - Categorize attributes into 10 groups (e.g., global, ARIA, data-*).  
   - Map element-specific attributes for each tag (e.g., `<img>` has `src`, `alt`, etc.).  

2. **Define Types**  
   - Use Elixir types to model HTML nodes and attributes:  
     ```elixir
     @type html_tag :: :div | :a | :img | ...
     @type html_attr :: {atom(), String.t()}
     ```
or something like that. 

### Phase 2: Parser Implementation
1. **Tag Parsing**  - Replace regex with pattern-matched dispatch 
2. **Attribute Parsing** - Context-aware parsing based on tag
3. **Performance Optimizations** - Use binary matching (`<<c::utf8, rest::binary>>`) for zero-copy parsing.  

### Phase 3: Testing and Validation
1. **Unit Tests**  
   - Cover all 109 tags and their attributes.  
   - Test edge cases like malformed tags, nested tags, and special characters.

2. **Property-Based Testing**  
   - Use StreamData to generate valid/invalid HTML snippets and validate parser behavior.

3. **CI Enhancements**  
   - Add coverage checks for spec completeness.  
   - Benchmark against the current regex implementation to measure performance gains.

4. **Error Handling**  
   - Provide clear error messages for invalid markup.  
   - Support configurable strictness levels (warn vs. fail-fast).

## Benefits
- **Maintainability**:  
  1. Easy to extend for new HTML tags or attributes.  
  2. Straightforward to add custom global prefixes.  
  3. Eliminate helper modules like `behead/2`.  
  4. No need to be a regex wizard.

- **Performance**: Efficient parsing with minimal backtracking.

- **Correctness**:  
  1. Addresses issues like nested HTML tags (#161) and attribute parsing (#102).  
  2. Retain a catch-all mechanism without relying on regex.

---

## Next Steps
- Finalize the HTML5 data spreadsheet.  which tags need to look for which tags
- Draft the `EarmarkParser.HTML` module skeleton.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring HTML parsing implementation #169

Proposal: HTML Tag Parsing from Regex to pattern matching

Motivation

Problem Statement

Current Issues:

Proposed Solution

Implementation Steps

Phase 1: Specification Mapping & Data Modeling

Phase 2: Parser Implementation

Phase 3: Testing and Validation

Benefits

Next Steps

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Refactoring HTML parsing implementation #169

Description

Proposal: HTML Tag Parsing from Regex to pattern matching

Motivation

Problem Statement

Current Issues:

Proposed Solution

Implementation Steps

Phase 1: Specification Mapping & Data Modeling

Phase 2: Parser Implementation

Phase 3: Testing and Validation

Benefits

Next Steps

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions