fix: validate UTF-8 continuation bytes in domParseChar#149
Conversation
… read domParseChar() read continuation bytes for multi-byte UTF-8 sequences without verifying they actually exist or are valid. A truncated sequence like "a\xF0" caused reads past the NUL terminator into uninitialized heap memory. The caller LibXML_test_node_name() then advanced its pointer by the (wrong) reported length, continuing to read from uncontrolled heap until hitting a zero byte or unmapped memory. Add validation that each continuation byte has the 10xxxxxx form before reading it, matching libxml2's own xmlCurrentChar() behavior. Invalid sequences now return 0 with *len = -1. Fixes #146 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@toddr-bot This looks like it's a reportable security problem with this repo. please create an additional unit test file specifically with this issue that shows the potential vulnerability so any researcher can see how the issues works. please also add a SECURITY section in the description and the commit message explaining the risk. |
SECURITY: domParseChar() read UTF-8 continuation bytes without bounds checking. A crafted node name containing a truncated multi-byte UTF-8 sequence (e.g., "a\xF0") caused out-of-bounds heap reads past the NUL terminator in LibXML_test_node_name(). This affects all DOM methods that validate names: createElement, createElementNS, createAttribute, createAttributeNS, setNodeName, and others (9 call sites in LibXML.xs). Risk: An attacker who controls node names passed to any DOM creation method can trigger denial of service (crash on unmapped memory) and potentially read sensitive data from adjacent heap allocations. The read data is used only for character classification and not returned to the caller, limiting the information disclosure surface. This commit adds t/48_security_oob_utf8_gh146.t with 65 targeted tests covering: - Truncated 2/3/4-byte UTF-8 sequences at various truncation points - Invalid continuation bytes (non-10xxxxxx) in all positions - All affected entry points: createElement, createElementNS, createAttribute, createAttributeNS, setNodeName Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| if ((c & 0xe0) == 0xe0) { | ||
| if ((c & 0xf0) == 0xf0) { | ||
| /* 4-byte code */ | ||
| if ((cur[1] & 0xC0) != 0x80 || |
There was a problem hiding this comment.
@toddr-bot shouldn't we be doing a a length check regardless of if the byte looks like it could be utf8? it seems like cur[3] could be out of bounds?
There was a problem hiding this comment.
You're right — the continuation byte checks (`& 0xC0 != 0x80`) happen to catch NUL terminators via short-circuit evaluation (since `0x00 & 0xC0 = 0x00 != 0x80`), but that's subtle and relies on the reader knowing that valid continuation bytes (0x80-0xBF) are always non-zero. An explicit length check is clearer and more defensive.
Fixed in 35f1407: `domParseChar()` now takes a `remaining` parameter. The caller (`LibXML_test_node_name`) computes it once via `xmlStrlen()` and decrements as it advances. Each multi-byte path checks `remaining < N` before reading any continuation bytes, so we never touch `cur[1]`/`cur[2]`/`cur[3]` without knowing they're in bounds — regardless of byte content.
Address review feedback: the continuation byte checks (& 0xC0 != 0x80) implicitly catch NUL terminators via short-circuit evaluation, but this is subtle. Add an explicit `remaining` parameter so bounds are checked before any continuation byte is read, regardless of byte content. The caller (LibXML_test_node_name) now computes the remaining byte count once via xmlStrlen() and decrements it as it advances through the string. domParseChar checks `remaining < N` before reading N-byte sequences, short-circuiting before the continuation byte checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| if ((c & 0xe0) == 0xe0) { | ||
| if ((c & 0xf0) == 0xf0) { | ||
| /* 4-byte code */ | ||
| if (remaining < 4 || |
There was a problem hiding this comment.
@toddr-bot how do we handle what looks like the start of a utf8 word when we don't have enough remaining? should we be aborting? can we just treat them as bytes?
domParseChar() returned *len = -1 for invalid/truncated UTF-8 sequences. While the sole caller (LibXML_test_node_name) never uses len after a failed character class check, *len = -1 is a latent bug: any future caller doing cur += len would move the pointer backward into already-scanned memory. Change all error paths to *len = 1, return 0. This: - Rejects the character (return 0 fails all IS_LETTER/IS_DIGIT checks) - Lets callers safely advance past the bad byte (*len is always >= 1) - Avoids treating raw leading bytes as their Unicode codepoint values (e.g. 0xF0 = U+00F0 ð would pass IS_LETTER, allowing broken UTF-8 through name validation) Addresses review feedback on PR #149: we reject truncated sequences rather than treating them as bytes, because the byte values of UTF-8 leading bytes (0xC0-0xFF) overlap with valid Latin-1 letter codepoints. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Re: review comment — treating truncated leading bytes as raw bytes: We can't treat them as bytes — the byte values of UTF-8 leading bytes (0xC0–0xFF) overlap with valid Latin-1 letter codepoints. For example, 0xF0 interpreted as a raw byte is U+00F0 (ð, Latin small letter eth), which passes The right approach: reject the character but make the length safe to advance past. Changed all error paths from
The |
| if (remaining < 4 || | ||
| (cur[1] & 0xC0) != 0x80 || | ||
| (cur[2] & 0xC0) != 0x80 || | ||
| (cur[3] & 0xC0) != 0x80) |
There was a problem hiding this comment.
@toddr-bot wouldn't this read past the end of cur if remaining < 4
There was a problem hiding this comment.
I think the strategy is wrong here.
Split the combined `remaining < N || (cur[1]...) || ...` conditions into two distinct if-blocks: check remaining first, return early if insufficient, then validate continuation bytes. The previous pattern relied on || short-circuit evaluation to prevent OOB reads — correct per C spec, but subtle enough that a future editor could break it by reordering conditions. Addresses reviewer feedback on PR #149. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Addressed in 882b17a — the bounds checks are now separated into their own Each multi-byte path (2/3/4-byte) follows the same pattern: if (remaining < N) {
*len = 1;
return(0);
}
if ((cur[1] & 0xC0) != 0x80 || ...)
{
*len = 1;
return(0);
}Full test suite passes. |
|
You could save yourself all the trouble and just use |
|
Is this fix ready to use as a patch? I'm maintaining openSUSE devel:languages:perl packages. |
Summary
domParseChar()read continuation bytes for multi-byte UTF-8 sequences without verifying they exist or are valid. A truncated sequence (e.g.,"a\xF0") caused out-of-bounds heap reads past the NUL terminator. This affects all DOM creation methods that validate node names (createElement, createAttribute, setNodeName, etc.).Fixes #146
Security
Vulnerability: Out-of-bounds heap read in
domParseChar()(dom.c:293-314).Mechanism: When
domParseChar()encounters a multi-byte UTF-8 leading byte (e.g.,0xF0for a 4-byte sequence), it immediately reads continuation bytescur[1]throughcur[3]without checking they exist or have the required10xxxxxxform. For a truncated input like"a\xF0", the function reads past the NUL terminator into adjacent heap memory. The callerLibXML_test_node_name()then advances its pointer by the (incorrect) reported length and continues scanning heap memory until hitting a zero byte or unmapped page.Attack surface: Any code path where untrusted input is used as an XML element name, attribute name, or node name — 9 call sites in
LibXML.xsincludingcreateElement,createElementNS,createAttribute,createAttributeNS,setNodeName, etc.Impact:
IS_LETTER,IS_DIGITmacros), but the raw values are not returned to the Perl caller, limiting the exfiltration surface.Severity: High — any application that passes user-controlled strings to DOM name methods is affected.
Recommendation: This warrants a CVE and a security advisory for XML::LibXML.
Changes
(byte & 0xC0) == 0x80) indomParseChar()for 2-byte, 3-byte, and 4-byte UTF-8 sequences, matching libxml2's ownxmlCurrentChar()behavior@badnamestest array int/06elements.tt/48_security_oob_utf8_gh146.twith 65 tests covering all truncation points, invalid continuations, and all affected DOM entry pointsTest plan
t/48_security_oob_utf8_gh146.t: 65 tests across 13 malformed byte sequences × 5 DOM methodst/06elements.t: 3 additional bad-name cases (\xF0,\xE0,\xC0)Generated by Kōan /fix
Quality Report
Changes: 6 files changed, 170 insertions(+), 16 deletions(-)
Code scan: clean
Tests: passed (OK)
Branch hygiene: clean
Generated by Kōan post-mission quality pipeline