feat: add htmlToHast for parsing HTML into HAST (#60)#140
Open
IEvangelist wants to merge 1 commit into
Open
Conversation
Adds an html5ever-backed parser that turns an HTML string into a HAST tree (elements, text, comments, doctype), mirroring hast-util-from-html in document mode. Exposed as `htmlToHast` from the npm package and `create_hast_handle_from_html` at the NAPI boundary. The Rust parser lives behind an opt-in `from-html` feature on satteri-ast; the NAPI binding enables it by default (like `mdx`) so the full build ships it and lite (--no-default-features) builds drop it. Refs bruits#60 Co-authored-by: Copilot App <223556219+Copilot@users.noreply.github.com>
Merging this PR will not alter performance
Comparing Footnotes
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Prototype for #60 — a built-in HTML → HAST parser, the Sätteri equivalent of
hast-util-from-html/rehype-raw's parsing step.Adds
htmlToHast(html: string): HastNodeto the npm package (andcreate_hast_handle_from_htmlat the NAPI boundary). It parses an HTML string into a materialized HAST tree — elements, text, comments, and doctype — usinghtml5ever's spec-compliant tree builder, in document mode (result is arootwrapping the implied<html>subtree, matchinghast-util-from-html's default).Approach
Follows the parser recommendation from my benchmark writeup on the issue: #60 (comment) —
html5everdriving a custom, arena-friendlyTreeSinkrather than pulling inrcdom.crates/satteri-ast/src/hast/from_html.rs: an index-addressedTreeSink(Handle = usizeinto aVec<Node>) that mirrors rcdom's tree-mutation semantics (append + text coalescing, foster parenting, adoption-agency reparenting,add_attrs_if_missing, template contents). After parsing, the flat node list is walked once into anArenaBuilder<Hast>, reusing the existing HAST codec so serialize/materialize/render all work unchanged.class/href/etc. round-trip.Feature gating
from-htmlis opt-in onsatteri-ast(not in its default features) so size-conscious consumers can drophtml5everentirely.mdx): the full native build ships it, and lite--no-default-featuresbuilds drop it.Happy to flip this either way — e.g. make it fully opt-in on the binding too — depending on how you'd like to weigh the binary-size cost.
Tests
#[test]s infrom_html.rscovering document wrapping, attributes, text/comment/doctype emission, character references, raw-text elements, foster parenting, and adoption-agency reparenting.packages/satteri/test/from-html.test.ts— asserts the materialized tree shape (root → html/head/body, tag names, text/comment/doctype, verbatim properties) and a full round-trip through theunified+rehype-stringifyecosystem, which is the actual interop story from Add a built-in way to get HAST from HTML #60.Known follow-ups (deliberately out of scope for this prototype)
class, notclassName: [...]; no boolean/number coercion). Fullproperty-informationhandling (space/comma-separated lists, booleans, SVG casing) would be the natural next step for exacthast-util-from-htmlparity.disabledcurrently renders asdisabled=""; SVG/xmlns prefixes aren't specialized yet.hast-util-from-html'sfragment: true) could be added later.A
minorSampo changeset is included.Refs #60