Environment
ESLint version: HEAD
@eslint/markdown version: HEAD
Node version: 20.18.0
npm version: 10.9.2
Operating System: Windows 11
What problem do you want to solve?
@eslint/eslint-team
Hi team 😄
I’d like to suggest using the cheerio library instead of manual regex for handling HTML, as it would provide a more robust and reliable approach.
(Handling HTML directly may not be the main focus for the team, since the prelint feature is currently under RFC. However, HTML is a standard feature of CommonMark, and we already have a lot of logic built around it. That’s why I believe this change is necessary.)
While working on the @eslint/markdown repository, I’ve seen a lot of regex-related fixes. Since Markdown handles natural text, sometimes regex is unavoidable.
However, lately many fixes have focused on false positives or negatives with HTML nodes, as seen in these issues and PRs:
Manual regex for HTML nodes worked fine initially, but as the project has grown and more users are adopting this plugin, more problems are cropping up. Some examples:
- False positives and negatives
- Inconsistent HTML node handling across rules
- Potential security issues (ReDos)
- Increased maintenance cost for reviewing regex-related issues and writing robust patterns
Because of all this, there’s a growing need for consistent HTML node handling.
So, I’d like to propose introducing cheerio for HTML node handling.
Here are some pros and cons:
Pros:
- More robust code with fewer false positives/negatives
- Consistent HTML node handling across the repository, reducing maintenance cost
- Safer from ReDos attacks (We already have merged a PR about this)
- Less time spent reviewing and maintaining custom regex
Cons:
- Adds a new dependency
- Some learning curve for the library
- There’s an ongoing RFC for using
prelint for HTML handling (though I’m not sure this covers all cases, since HTML is a standard part of CommonMark and we already have a lot of logic around it; a quicker solution may be needed, as prelint could take a while)
As for the roadmap, I don’t think the transition would be too costly:
- For ongoing issues/PRs: Authors could switch to
cheerio instead of regex
- For others: I’m happy to take this on and refactor as needed
What do you think is the correct solution?
I’d like to suggest using the cheerio library instead of manual regex for handling HTML, as it would provide a more robust and reliable approach.
Participation
Additional comments
If it would help make things more robust and reliable, I’m open to other suggestions. Let’s think about how we can handle HTML nodes in a consistent and safe way.
Environment
ESLint version: HEAD
@eslint/markdown version: HEAD
Node version: 20.18.0
npm version: 10.9.2
Operating System: Windows 11
What problem do you want to solve?
@eslint/eslint-team
Hi team 😄
I’d like to suggest using the
cheeriolibrary instead of manual regex for handling HTML, as it would provide a more robust and reliable approach.(Handling HTML directly may not be the main focus for the team, since the
prelintfeature is currently under RFC. However, HTML is a standard feature of CommonMark, and we already have a lot of logic built around it. That’s why I believe this change is necessary.)While working on the
@eslint/markdownrepository, I’ve seen a lot of regex-related fixes. Since Markdown handles natural text, sometimes regex is unavoidable.However, lately many fixes have focused on false positives or negatives with HTML nodes, as seen in these issues and PRs:
no-multiple-h1andrequire-alt-textmiss errors after a HTML comment is closed #464,no-missing-link-fragment#465,require-alt-textrule to ignore commented images #385Manual regex for HTML nodes worked fine initially, but as the project has grown and more users are adopting this plugin, more problems are cropping up. Some examples:
Because of all this, there’s a growing need for consistent HTML node handling.
So, I’d like to propose introducing
cheeriofor HTML node handling.Here are some pros and cons:
Pros:
Cons:
prelintfor HTML handling (though I’m not sure this covers all cases, since HTML is a standard part of CommonMark and we already have a lot of logic around it; a quicker solution may be needed, asprelintcould take a while)As for the roadmap, I don’t think the transition would be too costly:
cheerioinstead of regexWhat do you think is the correct solution?
I’d like to suggest using the
cheeriolibrary instead of manual regex for handling HTML, as it would provide a more robust and reliable approach.Participation
Additional comments
If it would help make things more robust and reliable, I’m open to other suggestions. Let’s think about how we can handle HTML nodes in a consistent and safe way.