This is the canonical manual for usage, API, selector behavior, performance workflow, conformance expectations, and internals.
- Requirements
- Quick Start
- Core API
- Non-Destructive Parsing
- Selector Support
- Mode Guidance
- Performance and Benchmarks
- Latest Benchmark Snapshot
- Conformance Status
- Architecture
- Troubleshooting
- Zig
0.16.0 - Mutable input buffers (
[]u8) for destructive parsing []const u8inputs are supported whenParseOptions.non_destructive = true
const std = @import("std");
const html = @import("html");
const options: html.ParseOptions = .{};
test "basic parse + query" {
var input = "<div id='app'><a class='nav' href='/docs'>Docs</a></div>".*;
var doc = try options.parse(std.testing.allocator, &input);
defer doc.deinit();
var links = doc.query("div#app > a.nav");
const a = links.next() orelse return error.TestUnexpectedResult;
const href = (try a.getAttributeValue(std.testing.allocator, "href")) orelse return error.TestUnexpectedResult;
defer href.free(&doc, std.testing.allocator);
try std.testing.expectEqualStrings("/docs", href.value);
}Source examples:
examples/basic_parse_query.zigexamples/query_time_decode.zig
All examples are verified by running zig build examples-check
const opts: ParseOptions = .{};var doc = try opts.parse(allocator, input);doc.deinit()doc.clear()- destructive options accept mutable input and parse it in place
- non-destructive options accept read-only input and parse directly from the original bytes
- maximum parseable input size is controlled at build time with
-Dintlen
- Compile-time selectors:
var it = doc.query(comptime selector); it.next()doc.query(comptime selector)
- Runtime selectors:
var it = doc.queryRuntime(compiled_selector); it.next()doc.queryRuntime(compiled_selector)
- Cached runtime selectors:
var it = doc.queryRuntime(selector); it.next()doc.queryRuntime(selector)- selector created via
try Selector.compileRuntime(allocator, source)
- Navigation:
tagName()parentNode()nextSibling()prevSibling()children()(iterator of wrapped child nodes;collect(allocator)returns an owned[]Node)children().last()only whenParseOptions.store_last_child = true
- Text:
innerTextWithOptions(gpa, TextOptions)returnsTextResultTextResult.valueTextResult.free(doc, gpa)innerTextOwnedWithOptions(gpa, TextOptions)always allocates
- Attributes:
getAttributeValue(gpa, name)returns!?AttributeValueResultAttributeValueResult.valueAttributeValueResult.free(doc, gpa)getAttributeValueRaw(name)returns the current raw value bytes; destructive documents may expose bytes mutated by prior decoded lookups
- Scoped queries:
- same iterator-first query family as
Document(queryandqueryRuntime)
- same iterator-first query family as
doc.html(),doc.head(),doc.body()TextResult.isBorrowed(doc)to check whether text points into document source bytes
ParseOptionsdrop_whitespace_text_nodes: bool = truenon_destructive: bool = false
- build option:
-Dintlen=u16|u32|u64|usize- controls the integer width used for source spans and node indexes
- too-small widths fail fast with
error.InputTooLarge
TextOptionsnormalize_whitespace: bool = trueunescape: bool = true
- parse/query work split:
- parse keeps raw text and attribute spans as source slices
- destructive mode may decode attrs/text in place on query-time APIs
- non-destructive mode keeps attrs/text read-only and materializes decoded output only when needed
- destructive parsing is the default because the parser and lazy decode paths mutate source bytes in place for throughput
- non-destructive parsing avoids a full-source copy and instead moves lazy attr/text decoding out of the input buffer
- nodes are stored in one contiguous array and linked by indexes rather than pointers to keep traversal cache-friendly and make
-Dintleneffective - attribute storage stays span-based instead of building heap objects so parse cost scales with actual queries, not attribute count
- query-time decoding keeps parse throughput high by avoiding eager entity decode and whitespace normalization for bytes that may never be read
Use a non-destructive document type when the caller bytes must remain unchanged.
const opts: html.ParseOptions = .{ .non_destructive = true };
const html_bytes = "<div id='x' data-v='a&b'> hi & bye </div>";
var doc = try opts.parse(std.testing.allocator, html_bytes);
defer doc.deinit();Behavior:
- the default destructive path is unchanged and still parses caller memory directly
- non-destructive mode does not allocate or rewrite a full source copy
- lazy attribute reads never rewrite the source buffer
- lazy text reads never rewrite the source buffer
- text extraction allocates only when decoding or normalization requires materialized output
Document.writeHtmlandDocument.formatreturn the exact original source bytes in non-destructive mode- node-level formatting still serializes from parsed state rather than replaying original source slices
Use cases:
- parsing file-backed memory maps
- preserving original bytes for hashing, diffing, or cache keys
- running parser queries without allowing in-place mutation of shared buffers
queryWithHooks(doc, comptime_selector, hooks)queryRuntimeWithHooks(doc, compiled_selector, hooks)
Supported selectors:
- tag selectors and universal
* #id,.class- attributes:
[a],[a=v],[a^=v],[a$=v],[a*=v],[a~=v],[a|=v]
- combinators:
- descendant (
a b) - child (
a > b) - adjacent sibling (
a + b) - general sibling (
a ~ b)
- descendant (
- grouping:
a, b, c - pseudo-classes:
:first-child:last-child:nth-child(An+B)withodd/evenand forms like3n+1,+3n-2,-n+6:not(...)(simple selector payload)
- parser guardrails:
- multiple
#idpredicates in one compound (for example#a#b) are rejected as invalid
- multiple
Compilation modes:
- comptime selectors fail at compile time when invalid
- runtime selectors return
error.InvalidSelector
html is permissive by design. Choose the document type by workload:
| Mode | Parse Options | Best For | Tradeoffs |
|---|---|---|---|
strictest |
const opts = html.ParseOptions{ .drop_whitespace_text_nodes = .none }; |
traversal predictability and text fidelity | keeps whitespace-only text nodes |
fastest |
const opts = html.ParseOptions{}; |
throughput-first scraping | whitespace-only text nodes dropped; raw node metadata is compact |
non-destructive |
const opts = html.ParseOptions{ .non_destructive = true }; |
preserving input bytes, memory maps, exact whole-document formatting | decoded attrs/text are materialized outside the source buffer |
full metadata |
const opts = html.ParseOptions{ .store_last_child = true, .store_prev_sibling = true }; |
O(1) children().last() and previous-sibling traversal |
two extra persisted node indexes |
Fallback playbook:
- Start with
fastestfor bulk workloads. - Move unstable domains to
strictest. - Compile runtime selectors once and reuse
queryRuntimeiterators for repeated queries.
Run benchmarks:
zig build bench-compare
zig build tools -- run-benchmarks --profile quick
zig build tools -- run-benchmarks --profile stableArtifacts:
bench/results/latest.mdbench/results/latest.json
Benchmark policy:
- parse comparisons include
strlen,lexbor, and parse-onlylol-html - query parse/match/cached sections benchmark
html - repeated runtime selector workloads should use cached selectors
Warning: throughput numbers are not conformance claims. This parser is permissive by design; see Conformance Status.
Source: bench/results/latest.json (stable profile).
| Fixture | ours-compact | ours-full | lol-html |
|---|---|---|---|
rust-lang.html |
2559.69 | 2307.87 | 1463.58 |
wiki-html.html |
1908.49 | 1950.56 | 1246.95 |
mdn-html.html |
3089.15 | 3000.33 | 1803.77 |
w3-html52.html |
1291.87 | 1321.23 | 688.85 |
hn.html |
1583.62 | 1595.97 | 859.50 |
python-org.html |
2062.70 | 1910.20 | 1319.32 |
kernel-org.html |
1953.55 | 1802.75 | 1269.25 |
gnu-org.html |
2501.68 | 2416.61 | 1471.70 |
ziglang-org.html |
2009.32 | 1875.55 | 1209.16 |
ziglang-doc-master.html |
1431.35 | 1388.04 | 1052.82 |
wikipedia-unicode-list.html |
1805.46 | 1823.80 | 1117.75 |
whatwg-html-spec.html |
1381.78 | 1356.35 | 923.50 |
synthetic-forms.html |
1323.92 | 1298.26 | 723.53 |
synthetic-table-grid.html |
1228.26 | 1221.23 | 716.57 |
synthetic-list-nested.html |
1357.69 | 1312.12 | 667.09 |
synthetic-comments-doctype.html |
2304.51 | 2144.48 | 918.52 |
synthetic-template-rich.html |
1014.82 | 1051.87 | 485.01 |
synthetic-whitespace-noise.html |
1716.97 | 1540.53 | 1021.84 |
synthetic-news-feed.html |
1408.10 | 1309.01 | 645.40 |
synthetic-ecommerce.html |
1246.06 | 1216.35 | 635.49 |
synthetic-forum-thread.html |
1295.57 | 1251.38 | 622.70 |
| Case | compact ops/s | compact ns/op | full ops/s | full ns/op |
|---|---|---|---|---|
attr-heavy-button |
167952.79 | 5954.05 | 147300.14 | 6788.86 |
attr-heavy-nav |
99400.15 | 10060.35 | 110758.09 | 9028.69 |
| Case | compact ops/s | compact ns/op | full ops/s | full ns/op |
|---|---|---|---|---|
attr-heavy-button |
180998.57 | 5524.91 | 160690.48 | 6223.14 |
attr-heavy-nav |
97343.93 | 10272.85 | 95827.84 | 10435.38 |
| Selector case | Ops/s | ns/op |
|---|---|---|
simple |
9792520.96 | 102.12 |
complex |
5450445.38 | 183.47 |
grouped |
7317061.48 | 136.67 |
For full per-parser, per-fixture tables and gate output:
bench/results/latest.mdbench/results/latest.json
Run conformance suites:
zig build conformance
# or
zig build tools -- run-external-suites --mode bothArtifact: bench/results/external_suite_report.json
Tracked suites:
- selector suites:
nwmatcher,qwery_contextual - parser suites:
- html5lib tree-construction subset
- WHATWG HTML parsing corpus (via WPT
html/syntax/parsing/html5lib_*.html)
Fetched suite repos are cached under bench/.cache/suites/ (gitignored).
Core modules:
src/html/parser.zig: permissive parse pipelinesrc/html/scanner.zig: byte-scanning hot-path helperssrc/html/tags.zig: tag metadata and hash dispatchsrc/html/attr.zig: attribute scanning, lazy materialization, and decode helperssrc/html/entities.zig: entity decode utilitiessrc/selector/runtime.zig,src/selector/compile_time.zig: selector parsingsrc/selector/matcher.zig: selector matching/combinator traversal
Data model highlights:
Documentalways owns node/index storage and may either parse a mutable caller buffer in place or borrow a read-only caller buffer unchanged- parser-only construction state stays in
src/html/parser.zig;Documentretains only post-parse/query state - nodes are contiguous and linked by indexes for traversal
- attributes are traversed directly from source spans (no heap attribute objects)
- the build-time
-Dintlenoption widens or shrinks those spans and indexes uniformly - destructive mode is the performance baseline; non-destructive mode exists as an opt-in isolation boundary
- validate selector syntax with
Selector.compileRuntime(allocator, source) - check scope (
Documentvs scopedNode)
- default
innerTextnormalizes whitespace - use
innerTextWithOptions(..., .{ .normalize_whitespace = false })for raw spacing - use
innerTextWithOptions(..., .{ .unescape = false })to preserve entity escapes - use
innerTextOwnedWithOptions(...)when output must always be allocated - call
TextResult.free(doc, gpa)for non-owned text results
Runtime selector memory must outlive any iterator returned by queryRuntime.
Expected: parse and lazy decode paths mutate source bytes in place.
If the bytes must not change, instantiate a non-destructive document type.