Change Request: Using `cheerio` library instead of manual regex for handling HTML

### Environment

ESLint version: HEAD
@eslint/markdown version: HEAD
Node version: 20.18.0
npm version: 10.9.2
Operating System: Windows 11


### What problem do you want to solve?

@eslint/eslint-team 

Hi team 😄 

I’d like to suggest using the [`cheerio`](https://cheerio.js.org/) library instead of manual regex for handling HTML, as it would provide a more robust and reliable approach.

(Handling HTML directly may not be the main focus for the team, since the `prelint` feature is currently under RFC. However, HTML is a standard feature of CommonMark, and we already have a lot of logic built around it. That’s why I believe this change is necessary.)

---

While working on the `@eslint/markdown` repository, I’ve seen a lot of regex-related fixes. Since Markdown handles natural text, sometimes regex is unavoidable.

However, lately many fixes have focused on false positives or negatives with HTML nodes, as seen in these issues and PRs:

- Ongoing: 
    - https://github.com/eslint/markdown/issues/481, 
    - https://github.com/eslint/markdown/issues/464, 
    - https://github.com/eslint/markdown/pull/480, 
    - https://github.com/eslint/markdown/pull/468  
- Merged: 
    - https://github.com/eslint/markdown/pull/463, 
    - https://github.com/eslint/markdown/pull/465, 
    - https://github.com/eslint/markdown/pull/385

Manual regex for HTML nodes worked fine initially, but as the project has grown and more users are adopting this plugin, more problems are cropping up. Some examples:

- False positives and negatives
- Inconsistent HTML node handling across rules
- Potential security issues (ReDos)
- Increased maintenance cost for reviewing regex-related issues and writing robust patterns

Because of all this, there’s a growing need for consistent HTML node handling. 

So, I’d like to propose introducing `cheerio` for HTML node handling.

Here are some pros and cons:

**Pros:**  
- More robust code with fewer false positives/negatives  
- Consistent HTML node handling across the repository, reducing maintenance cost
- Safer from ReDos attacks  (We already have merged a PR about this)
- Less time spent reviewing and maintaining custom regex

**Cons:**  
- Adds a new dependency  
- Some learning curve for the library
- There’s an ongoing RFC for using `prelint` for HTML handling (though I’m not sure this covers all cases, since HTML is a standard part of CommonMark and we already have a lot of logic around it; a quicker solution may be needed, as `prelint` could take a while)

---

As for the roadmap, I don’t think the transition would be too costly:
- For ongoing issues/PRs: Authors could switch to `cheerio` instead of regex
- For others: I’m happy to take this on and refactor as needed

### What do you think is the correct solution?

I’d like to suggest using the [`cheerio`](https://cheerio.js.org/) library instead of manual regex for handling HTML, as it would provide a more robust and reliable approach.


### Participation

- [x] I am willing to submit a pull request for this change.

### Additional comments

If it would help make things more robust and reliable, I’m open to other suggestions. Let’s think about how we can handle HTML nodes in a consistent and safe way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change Request: Using `cheerio` library instead of manual regex for handling HTML #483

Environment

What problem do you want to solve?

What do you think is the correct solution?

Participation

Additional comments

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Change Request: Using cheerio library instead of manual regex for handling HTML #483

Description

Environment

What problem do you want to solve?

What do you think is the correct solution?

Participation

Additional comments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Change Request: Using `cheerio` library instead of manual regex for handling HTML #483