fix: validate UTF-8 continuation bytes in domParseChar by toddr-bot · Pull Request #149 · cpan-authors/XML-LibXML

toddr-bot · 2026-05-08T12:26:49Z

Summary

domParseChar() read continuation bytes for multi-byte UTF-8 sequences without verifying they exist or are valid. A truncated sequence (e.g., "a\xF0") caused out-of-bounds heap reads past the NUL terminator. This affects all DOM creation methods that validate node names (createElement, createAttribute, setNodeName, etc.).

Fixes #146

Security

Vulnerability: Out-of-bounds heap read in domParseChar() (dom.c:293-314).

Mechanism: When domParseChar() encounters a multi-byte UTF-8 leading byte (e.g., 0xF0 for a 4-byte sequence), it immediately reads continuation bytes cur[1] through cur[3] without checking they exist or have the required 10xxxxxx form. For a truncated input like "a\xF0", the function reads past the NUL terminator into adjacent heap memory. The caller LibXML_test_node_name() then advances its pointer by the (incorrect) reported length and continues scanning heap memory until hitting a zero byte or unmapped page.

Attack surface: Any code path where untrusted input is used as an XML element name, attribute name, or node name — 9 call sites in LibXML.xs including createElement, createElementNS, createAttribute, createAttributeNS, setNodeName, etc.

Impact:

Denial of service (high confidence): crash when the OOB read hits unmapped memory.
Information disclosure (low confidence): adjacent heap data is read and used for character classification (IS_LETTER, IS_DIGIT macros), but the raw values are not returned to the Perl caller, limiting the exfiltration surface.

Severity: High — any application that passes user-controlled strings to DOM name methods is affected.

Recommendation: This warrants a CVE and a security advisory for XML::LibXML.

Changes

Add continuation byte validation ((byte & 0xC0) == 0x80) in domParseChar() for 2-byte, 3-byte, and 4-byte UTF-8 sequences, matching libxml2's own xmlCurrentChar() behavior
Add truncated UTF-8 sequences to the @badnames test array in t/06elements.t
Add dedicated security regression test t/48_security_oob_utf8_gh146.t with 65 tests covering all truncation points, invalid continuations, and all affected DOM entry points

Test plan

t/48_security_oob_utf8_gh146.t: 65 tests across 13 malformed byte sequences × 5 DOM methods
t/06elements.t: 3 additional bad-name cases (\xF0, \xE0, \xC0)
Full test suite passes (85+ files, 2600+ tests, 0 failures)

Generated by Kōan /fix

Quality Report

Changes: 6 files changed, 170 insertions(+), 16 deletions(-)

Code scan: clean

Tests: passed (OK)

Branch hygiene: clean

Generated by Kōan post-mission quality pipeline

… read domParseChar() read continuation bytes for multi-byte UTF-8 sequences without verifying they actually exist or are valid. A truncated sequence like "a\xF0" caused reads past the NUL terminator into uninitialized heap memory. The caller LibXML_test_node_name() then advanced its pointer by the (wrong) reported length, continuing to read from uncontrolled heap until hitting a zero byte or unmapped memory. Add validation that each continuation byte has the 10xxxxxx form before reading it, matching libxml2's own xmlCurrentChar() behavior. Invalid sequences now return 0 with *len = -1. Fixes #146 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

toddr · 2026-05-08T12:43:44Z

@toddr-bot This looks like it's a reportable security problem with this repo. please create an additional unit test file specifically with this issue that shows the potential vulnerability so any researcher can see how the issues works. please also add a SECURITY section in the description and the commit message explaining the risk.

SECURITY: domParseChar() read UTF-8 continuation bytes without bounds checking. A crafted node name containing a truncated multi-byte UTF-8 sequence (e.g., "a\xF0") caused out-of-bounds heap reads past the NUL terminator in LibXML_test_node_name(). This affects all DOM methods that validate names: createElement, createElementNS, createAttribute, createAttributeNS, setNodeName, and others (9 call sites in LibXML.xs). Risk: An attacker who controls node names passed to any DOM creation method can trigger denial of service (crash on unmapped memory) and potentially read sensitive data from adjacent heap allocations. The read data is used only for character classification and not returned to the caller, limiting the information disclosure surface. This commit adds t/48_security_oob_utf8_gh146.t with 65 targeted tests covering: - Truncated 2/3/4-byte UTF-8 sequences at various truncation points - Invalid continuation bytes (non-10xxxxxx) in all positions - All affected entry points: createElement, createElementNS, createAttribute, createAttributeNS, setNodeName Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

toddr · 2026-05-08T13:06:15Z

        if ((c & 0xe0) == 0xe0) {
            if ((c & 0xf0) == 0xf0) {
                /* 4-byte code */
+                if ((cur[1] & 0xC0) != 0x80 ||


@toddr-bot shouldn't we be doing a a length check regardless of if the byte looks like it could be utf8? it seems like cur[3] could be out of bounds?

You're right — the continuation byte checks (`& 0xC0 != 0x80`) happen to catch NUL terminators via short-circuit evaluation (since `0x00 & 0xC0 = 0x00 != 0x80`), but that's subtle and relies on the reader knowing that valid continuation bytes (0x80-0xBF) are always non-zero. An explicit length check is clearer and more defensive.

Fixed in 35f1407: `domParseChar()` now takes a `remaining` parameter. The caller (`LibXML_test_node_name`) computes it once via `xmlStrlen()` and decrements as it advances. Each multi-byte path checks `remaining < N` before reading any continuation bytes, so we never touch `cur[1]`/`cur[2]`/`cur[3]` without knowing they're in bounds — regardless of byte content.

Address review feedback: the continuation byte checks (& 0xC0 != 0x80) implicitly catch NUL terminators via short-circuit evaluation, but this is subtle. Add an explicit `remaining` parameter so bounds are checked before any continuation byte is read, regardless of byte content. The caller (LibXML_test_node_name) now computes the remaining byte count once via xmlStrlen() and decrements it as it advances through the string. domParseChar checks `remaining < N` before reading N-byte sequences, short-circuiting before the continuation byte checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

toddr · 2026-05-08T14:39:12Z

        if ((c & 0xe0) == 0xe0) {
            if ((c & 0xf0) == 0xf0) {
                /* 4-byte code */
+                if (remaining < 4 ||


@toddr-bot how do we handle what looks like the start of a utf8 word when we don't have enough remaining? should we be aborting? can we just treat them as bytes?

domParseChar() returned *len = -1 for invalid/truncated UTF-8 sequences. While the sole caller (LibXML_test_node_name) never uses len after a failed character class check, *len = -1 is a latent bug: any future caller doing cur += len would move the pointer backward into already-scanned memory. Change all error paths to *len = 1, return 0. This: - Rejects the character (return 0 fails all IS_LETTER/IS_DIGIT checks) - Lets callers safely advance past the bad byte (*len is always >= 1) - Avoids treating raw leading bytes as their Unicode codepoint values (e.g. 0xF0 = U+00F0 ð would pass IS_LETTER, allowing broken UTF-8 through name validation) Addresses review feedback on PR #149: we reject truncated sequences rather than treating them as bytes, because the byte values of UTF-8 leading bytes (0xC0-0xFF) overlap with valid Latin-1 letter codepoints. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

toddr-bot · 2026-05-08T14:46:05Z

Re: review comment — treating truncated leading bytes as raw bytes:

We can't treat them as bytes — the byte values of UTF-8 leading bytes (0xC0–0xFF) overlap with valid Latin-1 letter codepoints. For example, 0xF0 interpreted as a raw byte is U+00F0 (ð, Latin small letter eth), which passes IS_LETTER and would allow broken UTF-8 through name validation.

The right approach: reject the character but make the length safe to advance past.

Changed all error paths from *len = -1 to *len = 1, keeping return 0:

Return 0 → fails all character class checks → name rejected (same behavior as before)
*len = 1 → callers can safely skip the bad byte without going backward (-1) or looping forever (0)

The *len = -1 convention was a latent footgun — LibXML_test_node_name happens to never use len after a rejection, but any future caller doing cur += len would get UB. Fixed in 9e60616.

toddr · 2026-05-08T14:59:41Z

+                if (remaining < 4 ||
+                    (cur[1] & 0xC0) != 0x80 ||
+                    (cur[2] & 0xC0) != 0x80 ||
+                    (cur[3] & 0xC0) != 0x80)


@toddr-bot wouldn't this read past the end of cur if remaining < 4

I think the strategy is wrong here.

Split the combined `remaining < N || (cur[1]...) || ...` conditions into two distinct if-blocks: check remaining first, return early if insufficient, then validate continuation bytes. The previous pattern relied on || short-circuit evaluation to prevent OOB reads — correct per C spec, but subtle enough that a future editor could break it by reordering conditions. Addresses reviewer feedback on PR #149. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

toddr-bot · 2026-05-08T15:04:32Z

Addressed in 882b17a — the bounds checks are now separated into their own if blocks with early returns, so continuation byte reads only happen after remaining is verified. No reliance on || short-circuit for safety anymore.

Each multi-byte path (2/3/4-byte) follows the same pattern:

if (remaining < N) {
    *len = 1;
    return(0);
}
if ((cur[1] & 0xC0) != 0x80 || ...)
{
    *len = 1;
    return(0);
}

Full test suite passes.

nwellnhof · 2026-05-12T10:54:54Z

You could save yourself all the trouble and just use xmlValidateName.

perlpunk · 2026-05-12T15:05:40Z

Is this fix ready to use as a patch? I'm maintaining openSUSE devel:languages:perl packages.
From the comments it is not clear.
Thanks!

toddr marked this pull request as ready for review May 8, 2026 12:44

toddr reviewed May 8, 2026

View reviewed changes

nixpkgs-security-tracker Bot mentioned this pull request May 11, 2026

XML::LibXML versions through 2.0210 for Perl read out-of-bounds heap memory when parsing XML node names containing truncated UTF-8 byte sequences NixOS/nixpkgs#519088

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: validate UTF-8 continuation bytes in domParseChar#149

fix: validate UTF-8 continuation bytes in domParseChar#149
toddr-bot wants to merge 5 commits into
mainfrom
koan.toddr.bot/fix-issue-146

toddr-bot commented May 8, 2026 •

edited

Loading

Uh oh!

toddr commented May 8, 2026

Uh oh!

toddr May 8, 2026

Uh oh!

toddr-bot May 8, 2026

Uh oh!

toddr May 8, 2026

Uh oh!

toddr-bot commented May 8, 2026

Uh oh!

toddr May 8, 2026

Uh oh!

toddr May 8, 2026

Uh oh!

toddr-bot commented May 8, 2026

Uh oh!

nwellnhof commented May 12, 2026

Uh oh!

perlpunk commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

toddr-bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Security

Changes

Test plan

Quality Report

Uh oh!

toddr commented May 8, 2026

Uh oh!

toddr May 8, 2026

Choose a reason for hiding this comment

Uh oh!

toddr-bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

toddr May 8, 2026

Choose a reason for hiding this comment

Uh oh!

toddr-bot commented May 8, 2026

Uh oh!

toddr May 8, 2026

Choose a reason for hiding this comment

Uh oh!

toddr May 8, 2026

Choose a reason for hiding this comment

Uh oh!

toddr-bot commented May 8, 2026

Uh oh!

nwellnhof commented May 12, 2026

Uh oh!

perlpunk commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

toddr-bot commented May 8, 2026 •

edited

Loading