Skip to content

Latest commit

 

History

History
352 lines (266 loc) · 12.9 KB

File metadata and controls

352 lines (266 loc) · 12.9 KB

html Documentation

This is the canonical manual for usage, API, selector behavior, performance workflow, conformance expectations, and internals.

Table of Contents

Requirements

  • Zig 0.16.0
  • Mutable input buffers ([]u8) for destructive parsing
  • []const u8 inputs are supported when ParseOptions.non_destructive = true

Quick Start

const std = @import("std");
const html = @import("html");
const options: html.ParseOptions = .{};

test "basic parse + query" {
    var input = "<div id='app'><a class='nav' href='/docs'>Docs</a></div>".*;
    var doc = try options.parse(std.testing.allocator, &input);
    defer doc.deinit();

    var links = doc.query("div#app > a.nav");
    const a = links.next() orelse return error.TestUnexpectedResult;
    const href = (try a.getAttributeValue(std.testing.allocator, "href")) orelse return error.TestUnexpectedResult;
    defer href.free(&doc, std.testing.allocator);
    try std.testing.expectEqualStrings("/docs", href.value);
}

Source examples:

  • examples/basic_parse_query.zig
  • examples/query_time_decode.zig

All examples are verified by running zig build examples-check

Core API

Parse and document lifecycle

  • const opts: ParseOptions = .{};
  • var doc = try opts.parse(allocator, input);
  • doc.deinit()
  • doc.clear()
  • destructive options accept mutable input and parse it in place
  • non-destructive options accept read-only input and parse directly from the original bytes
  • maximum parseable input size is controlled at build time with -Dintlen

Query APIs

  • Compile-time selectors:
    • var it = doc.query(comptime selector); it.next()
    • doc.query(comptime selector)
  • Runtime selectors:
    • var it = doc.queryRuntime(compiled_selector); it.next()
    • doc.queryRuntime(compiled_selector)
  • Cached runtime selectors:
    • var it = doc.queryRuntime(selector); it.next()
    • doc.queryRuntime(selector)
    • selector created via try Selector.compileRuntime(allocator, source)

Node APIs

  • Navigation:
    • tagName()
    • parentNode()
    • nextSibling()
    • prevSibling()
    • children() (iterator of wrapped child nodes; collect(allocator) returns an owned []Node)
    • children().last() only when ParseOptions.store_last_child = true
  • Text:
    • innerTextWithOptions(gpa, TextOptions) returns TextResult
    • TextResult.value
    • TextResult.free(doc, gpa)
    • innerTextOwnedWithOptions(gpa, TextOptions) always allocates
  • Attributes:
    • getAttributeValue(gpa, name) returns !?AttributeValueResult
    • AttributeValueResult.value
    • AttributeValueResult.free(doc, gpa)
    • getAttributeValueRaw(name) returns the current raw value bytes; destructive documents may expose bytes mutated by prior decoded lookups
  • Scoped queries:
    • same iterator-first query family as Document (query and queryRuntime)

Helpers

  • doc.html(), doc.head(), doc.body()
  • TextResult.isBorrowed(doc) to check whether text points into document source bytes

Parse/Text options

  • ParseOptions
    • drop_whitespace_text_nodes: bool = true
    • non_destructive: bool = false
  • build option:
    • -Dintlen=u16|u32|u64|usize
    • controls the integer width used for source spans and node indexes
    • too-small widths fail fast with error.InputTooLarge
  • TextOptions
    • normalize_whitespace: bool = true
    • unescape: bool = true
  • parse/query work split:
    • parse keeps raw text and attribute spans as source slices
    • destructive mode may decode attrs/text in place on query-time APIs
    • non-destructive mode keeps attrs/text read-only and materializes decoded output only when needed

Design Notes

  • destructive parsing is the default because the parser and lazy decode paths mutate source bytes in place for throughput
  • non-destructive parsing avoids a full-source copy and instead moves lazy attr/text decoding out of the input buffer
  • nodes are stored in one contiguous array and linked by indexes rather than pointers to keep traversal cache-friendly and make -Dintlen effective
  • attribute storage stays span-based instead of building heap objects so parse cost scales with actual queries, not attribute count
  • query-time decoding keeps parse throughput high by avoiding eager entity decode and whitespace normalization for bytes that may never be read

Non-Destructive Parsing

Use a non-destructive document type when the caller bytes must remain unchanged.

const opts: html.ParseOptions = .{ .non_destructive = true };
const html_bytes = "<div id='x' data-v='a&amp;b'> hi &amp; bye </div>";
var doc = try opts.parse(std.testing.allocator, html_bytes);
defer doc.deinit();

Behavior:

  • the default destructive path is unchanged and still parses caller memory directly
  • non-destructive mode does not allocate or rewrite a full source copy
  • lazy attribute reads never rewrite the source buffer
  • lazy text reads never rewrite the source buffer
  • text extraction allocates only when decoding or normalization requires materialized output
  • Document.writeHtml and Document.format return the exact original source bytes in non-destructive mode
  • node-level formatting still serializes from parsed state rather than replaying original source slices

Use cases:

  • parsing file-backed memory maps
  • preserving original bytes for hashing, diffing, or cache keys
  • running parser queries without allowing in-place mutation of shared buffers

Instrumentation wrappers

  • queryWithHooks(doc, comptime_selector, hooks)
  • queryRuntimeWithHooks(doc, compiled_selector, hooks)

Selector Support

Supported selectors:

  • tag selectors and universal *
  • #id, .class
  • attributes:
    • [a], [a=v], [a^=v], [a$=v], [a*=v], [a~=v], [a|=v]
  • combinators:
    • descendant (a b)
    • child (a > b)
    • adjacent sibling (a + b)
    • general sibling (a ~ b)
  • grouping: a, b, c
  • pseudo-classes:
    • :first-child
    • :last-child
    • :nth-child(An+B) with odd/even and forms like 3n+1, +3n-2, -n+6
    • :not(...) (simple selector payload)
  • parser guardrails:
    • multiple #id predicates in one compound (for example #a#b) are rejected as invalid

Compilation modes:

  • comptime selectors fail at compile time when invalid
  • runtime selectors return error.InvalidSelector

Mode Guidance

html is permissive by design. Choose the document type by workload:

Mode Parse Options Best For Tradeoffs
strictest const opts = html.ParseOptions{ .drop_whitespace_text_nodes = .none }; traversal predictability and text fidelity keeps whitespace-only text nodes
fastest const opts = html.ParseOptions{}; throughput-first scraping whitespace-only text nodes dropped; raw node metadata is compact
non-destructive const opts = html.ParseOptions{ .non_destructive = true }; preserving input bytes, memory maps, exact whole-document formatting decoded attrs/text are materialized outside the source buffer
full metadata const opts = html.ParseOptions{ .store_last_child = true, .store_prev_sibling = true }; O(1) children().last() and previous-sibling traversal two extra persisted node indexes

Fallback playbook:

  1. Start with fastest for bulk workloads.
  2. Move unstable domains to strictest.
  3. Compile runtime selectors once and reuse queryRuntime iterators for repeated queries.

Performance and Benchmarks

Run benchmarks:

zig build bench-compare
zig build tools -- run-benchmarks --profile quick
zig build tools -- run-benchmarks --profile stable

Artifacts:

  • bench/results/latest.md
  • bench/results/latest.json

Benchmark policy:

  • parse comparisons include strlen, lexbor, and parse-only lol-html
  • query parse/match/cached sections benchmark html
  • repeated runtime selector workloads should use cached selectors

Latest Benchmark Snapshot

Warning: throughput numbers are not conformance claims. This parser is permissive by design; see Conformance Status.

Source: bench/results/latest.json (stable profile).

Parse Throughput Comparison (MB/s)

Fixture ours-compact ours-full lol-html
rust-lang.html 2559.69 2307.87 1463.58
wiki-html.html 1908.49 1950.56 1246.95
mdn-html.html 3089.15 3000.33 1803.77
w3-html52.html 1291.87 1321.23 688.85
hn.html 1583.62 1595.97 859.50
python-org.html 2062.70 1910.20 1319.32
kernel-org.html 1953.55 1802.75 1269.25
gnu-org.html 2501.68 2416.61 1471.70
ziglang-org.html 2009.32 1875.55 1209.16
ziglang-doc-master.html 1431.35 1388.04 1052.82
wikipedia-unicode-list.html 1805.46 1823.80 1117.75
whatwg-html-spec.html 1381.78 1356.35 923.50
synthetic-forms.html 1323.92 1298.26 723.53
synthetic-table-grid.html 1228.26 1221.23 716.57
synthetic-list-nested.html 1357.69 1312.12 667.09
synthetic-comments-doctype.html 2304.51 2144.48 918.52
synthetic-template-rich.html 1014.82 1051.87 485.01
synthetic-whitespace-noise.html 1716.97 1540.53 1021.84
synthetic-news-feed.html 1408.10 1309.01 645.40
synthetic-ecommerce.html 1246.06 1216.35 635.49
synthetic-forum-thread.html 1295.57 1251.38 622.70

Query Match Throughput

Case compact ops/s compact ns/op full ops/s full ns/op
attr-heavy-button 167952.79 5954.05 147300.14 6788.86
attr-heavy-nav 99400.15 10060.35 110758.09 9028.69

Cached Query Throughput

Case compact ops/s compact ns/op full ops/s full ns/op
attr-heavy-button 180998.57 5524.91 160690.48 6223.14
attr-heavy-nav 97343.93 10272.85 95827.84 10435.38

Query Parse Throughput (ours)

Selector case Ops/s ns/op
simple 9792520.96 102.12
complex 5450445.38 183.47
grouped 7317061.48 136.67

For full per-parser, per-fixture tables and gate output:

  • bench/results/latest.md
  • bench/results/latest.json

Conformance Status

Run conformance suites:

zig build conformance
# or
zig build tools -- run-external-suites --mode both

Artifact: bench/results/external_suite_report.json

Tracked suites:

  • selector suites: nwmatcher, qwery_contextual
  • parser suites:
    • html5lib tree-construction subset
    • WHATWG HTML parsing corpus (via WPT html/syntax/parsing/html5lib_*.html)

Fetched suite repos are cached under bench/.cache/suites/ (gitignored).

Architecture

Core modules:

  • src/html/parser.zig: permissive parse pipeline
  • src/html/scanner.zig: byte-scanning hot-path helpers
  • src/html/tags.zig: tag metadata and hash dispatch
  • src/html/attr.zig: attribute scanning, lazy materialization, and decode helpers
  • src/html/entities.zig: entity decode utilities
  • src/selector/runtime.zig, src/selector/compile_time.zig: selector parsing
  • src/selector/matcher.zig: selector matching/combinator traversal

Data model highlights:

  • Document always owns node/index storage and may either parse a mutable caller buffer in place or borrow a read-only caller buffer unchanged
  • parser-only construction state stays in src/html/parser.zig; Document retains only post-parse/query state
  • nodes are contiguous and linked by indexes for traversal
  • attributes are traversed directly from source spans (no heap attribute objects)
  • the build-time -Dintlen option widens or shrinks those spans and indexes uniformly
  • destructive mode is the performance baseline; non-destructive mode exists as an opt-in isolation boundary

Troubleshooting

Query returns nothing

  • validate selector syntax with Selector.compileRuntime(allocator, source)
  • check scope (Document vs scoped Node)

Unexpected innerText

  • default innerText normalizes whitespace
  • use innerTextWithOptions(..., .{ .normalize_whitespace = false }) for raw spacing
  • use innerTextWithOptions(..., .{ .unescape = false }) to preserve entity escapes
  • use innerTextOwnedWithOptions(...) when output must always be allocated
  • call TextResult.free(doc, gpa) for non-owned text results

Runtime iterator invalidation

Runtime selector memory must outlive any iterator returned by queryRuntime.

Input buffer changed

Expected: parse and lazy decode paths mutate source bytes in place.

If the bytes must not change, instantiate a non-destructive document type.