Skip to content

feat: add htmlToHast for parsing HTML into HAST (#60)#140

Open
IEvangelist wants to merge 1 commit into
bruits:mainfrom
IEvangelist:dapine/hast-from-html
Open

feat: add htmlToHast for parsing HTML into HAST (#60)#140
IEvangelist wants to merge 1 commit into
bruits:mainfrom
IEvangelist:dapine/hast-from-html

Conversation

@IEvangelist

Copy link
Copy Markdown

Summary

Prototype for #60 — a built-in HTML → HAST parser, the Sätteri equivalent of hast-util-from-html / rehype-raw's parsing step.

Adds htmlToHast(html: string): HastNode to the npm package (and create_hast_handle_from_html at the NAPI boundary). It parses an HTML string into a materialized HAST tree — elements, text, comments, and doctype — using html5ever's spec-compliant tree builder, in document mode (result is a root wrapping the implied <html> subtree, matching hast-util-from-html's default).

import { htmlToHast } from "satteri";

const tree = htmlToHast("<p>hi</p>");
// { type: "root", children: [{ type: "element", tagName: "html", ... }] }

Approach

Follows the parser recommendation from my benchmark writeup on the issue: #60 (comment)html5ever driving a custom, arena-friendly TreeSink rather than pulling in rcdom.

  • crates/satteri-ast/src/hast/from_html.rs: an index-addressed TreeSink (Handle = usize into a Vec<Node>) that mirrors rcdom's tree-mutation semantics (append + text coalescing, foster parenting, adoption-agency reparenting, add_attrs_if_missing, template contents). After parsing, the flat node list is walked once into an ArenaBuilder<Hast>, reusing the existing HAST codec so serialize/materialize/render all work unchanged.
  • Attributes are stored verbatim as string properties; the existing renderer and materializer pass them straight through, so class/href/etc. round-trip.

Feature gating

  • from-html is opt-in on satteri-ast (not in its default features) so size-conscious consumers can drop html5ever entirely.
  • The NAPI binding enables it by default (like mdx): the full native build ships it, and lite --no-default-features builds drop it.

Happy to flip this either way — e.g. make it fully opt-in on the binding too — depending on how you'd like to weigh the binary-size cost.

Tests

  • Unit (Rust): 11 #[test]s in from_html.rs covering document wrapping, attributes, text/comment/doctype emission, character references, raw-text elements, foster parenting, and adoption-agency reparenting.
  • End-to-end (vitest): packages/satteri/test/from-html.test.ts — asserts the materialized tree shape (root → html/head/body, tag names, text/comment/doctype, verbatim properties) and a full round-trip through the unified + rehype-stringify ecosystem, which is the actual interop story from Add a built-in way to get HAST from HTML #60.

Known follow-ups (deliberately out of scope for this prototype)

  • Property-information normalization — properties are kept verbatim (class, not className: [...]; no boolean/number coercion). Full property-information handling (space/comma-separated lists, booleans, SVG casing) would be the natural next step for exact hast-util-from-html parity.
  • Boolean / namespaced attributes — e.g. disabled currently renders as disabled=""; SVG/xmlns prefixes aren't specialized yet.
  • Fragment mode — only document mode is implemented; a fragment entry point (hast-util-from-html's fragment: true) could be added later.

A minor Sampo changeset is included.

Refs #60

Adds an html5ever-backed parser that turns an HTML string into a HAST tree (elements, text, comments, doctype), mirroring hast-util-from-html in document mode. Exposed as `htmlToHast` from the npm package and `create_hast_handle_from_html` at the NAPI boundary.

The Rust parser lives behind an opt-in `from-html` feature on satteri-ast; the NAPI binding enables it by default (like `mdx`) so the full build ships it and lite (--no-default-features) builds drop it.

Refs bruits#60

Co-authored-by: Copilot App <223556219+Copilot@users.noreply.github.com>
@Princesseuh Princesseuh added this to the 0.10.0 milestone Jul 1, 2026
@codspeed-hq

codspeed-hq Bot commented Jul 1, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 24 untouched benchmarks
⏩ 8 skipped benchmarks1


Comparing IEvangelist:dapine/hast-from-html (608bf13) with main (ef3974e)

Open in CodSpeed

Footnotes

  1. 8 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@IEvangelist IEvangelist marked this pull request as ready for review July 1, 2026 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants