Skip to content

fix: validate UTF-8 continuation bytes in domParseChar#149

Open
toddr-bot wants to merge 5 commits into
mainfrom
koan.toddr.bot/fix-issue-146
Open

fix: validate UTF-8 continuation bytes in domParseChar#149
toddr-bot wants to merge 5 commits into
mainfrom
koan.toddr.bot/fix-issue-146

Conversation

@toddr-bot
Copy link
Copy Markdown
Contributor

@toddr-bot toddr-bot commented May 8, 2026

Summary

domParseChar() read continuation bytes for multi-byte UTF-8 sequences without verifying they exist or are valid. A truncated sequence (e.g., "a\xF0") caused out-of-bounds heap reads past the NUL terminator. This affects all DOM creation methods that validate node names (createElement, createAttribute, setNodeName, etc.).

Fixes #146

Security

Vulnerability: Out-of-bounds heap read in domParseChar() (dom.c:293-314).

Mechanism: When domParseChar() encounters a multi-byte UTF-8 leading byte (e.g., 0xF0 for a 4-byte sequence), it immediately reads continuation bytes cur[1] through cur[3] without checking they exist or have the required 10xxxxxx form. For a truncated input like "a\xF0", the function reads past the NUL terminator into adjacent heap memory. The caller LibXML_test_node_name() then advances its pointer by the (incorrect) reported length and continues scanning heap memory until hitting a zero byte or unmapped page.

Attack surface: Any code path where untrusted input is used as an XML element name, attribute name, or node name — 9 call sites in LibXML.xs including createElement, createElementNS, createAttribute, createAttributeNS, setNodeName, etc.

Impact:

  • Denial of service (high confidence): crash when the OOB read hits unmapped memory.
  • Information disclosure (low confidence): adjacent heap data is read and used for character classification (IS_LETTER, IS_DIGIT macros), but the raw values are not returned to the Perl caller, limiting the exfiltration surface.

Severity: High — any application that passes user-controlled strings to DOM name methods is affected.

Recommendation: This warrants a CVE and a security advisory for XML::LibXML.

Changes

  • Add continuation byte validation ((byte & 0xC0) == 0x80) in domParseChar() for 2-byte, 3-byte, and 4-byte UTF-8 sequences, matching libxml2's own xmlCurrentChar() behavior
  • Add truncated UTF-8 sequences to the @badnames test array in t/06elements.t
  • Add dedicated security regression test t/48_security_oob_utf8_gh146.t with 65 tests covering all truncation points, invalid continuations, and all affected DOM entry points

Test plan

  • t/48_security_oob_utf8_gh146.t: 65 tests across 13 malformed byte sequences × 5 DOM methods
  • t/06elements.t: 3 additional bad-name cases (\xF0, \xE0, \xC0)
  • Full test suite passes (85+ files, 2600+ tests, 0 failures)

Generated by Kōan /fix


Quality Report

Changes: 6 files changed, 170 insertions(+), 16 deletions(-)

Code scan: clean

Tests: passed (OK)

Branch hygiene: clean

Generated by Kōan post-mission quality pipeline

… read

domParseChar() read continuation bytes for multi-byte UTF-8 sequences
without verifying they actually exist or are valid. A truncated sequence
like "a\xF0" caused reads past the NUL terminator into uninitialized
heap memory. The caller LibXML_test_node_name() then advanced its
pointer by the (wrong) reported length, continuing to read from
uncontrolled heap until hitting a zero byte or unmapped memory.

Add validation that each continuation byte has the 10xxxxxx form
before reading it, matching libxml2's own xmlCurrentChar() behavior.
Invalid sequences now return 0 with *len = -1.

Fixes #146

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@toddr
Copy link
Copy Markdown
Member

toddr commented May 8, 2026

@toddr-bot This looks like it's a reportable security problem with this repo. please create an additional unit test file specifically with this issue that shows the potential vulnerability so any researcher can see how the issues works. please also add a SECURITY section in the description and the commit message explaining the risk.

@toddr toddr marked this pull request as ready for review May 8, 2026 12:44
SECURITY: domParseChar() read UTF-8 continuation bytes without bounds
checking. A crafted node name containing a truncated multi-byte UTF-8
sequence (e.g., "a\xF0") caused out-of-bounds heap reads past the NUL
terminator in LibXML_test_node_name(). This affects all DOM methods
that validate names: createElement, createElementNS, createAttribute,
createAttributeNS, setNodeName, and others (9 call sites in
LibXML.xs).

Risk: An attacker who controls node names passed to any DOM creation
method can trigger denial of service (crash on unmapped memory) and
potentially read sensitive data from adjacent heap allocations. The
read data is used only for character classification and not returned
to the caller, limiting the information disclosure surface.

This commit adds t/48_security_oob_utf8_gh146.t with 65 targeted
tests covering:
- Truncated 2/3/4-byte UTF-8 sequences at various truncation points
- Invalid continuation bytes (non-10xxxxxx) in all positions
- All affected entry points: createElement, createElementNS,
  createAttribute, createAttributeNS, setNodeName

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread dom.c
if ((c & 0xe0) == 0xe0) {
if ((c & 0xf0) == 0xf0) {
/* 4-byte code */
if ((cur[1] & 0xC0) != 0x80 ||
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toddr-bot shouldn't we be doing a a length check regardless of if the byte looks like it could be utf8? it seems like cur[3] could be out of bounds?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — the continuation byte checks (`& 0xC0 != 0x80`) happen to catch NUL terminators via short-circuit evaluation (since `0x00 & 0xC0 = 0x00 != 0x80`), but that's subtle and relies on the reader knowing that valid continuation bytes (0x80-0xBF) are always non-zero. An explicit length check is clearer and more defensive.

Fixed in 35f1407: `domParseChar()` now takes a `remaining` parameter. The caller (`LibXML_test_node_name`) computes it once via `xmlStrlen()` and decrements as it advances. Each multi-byte path checks `remaining < N` before reading any continuation bytes, so we never touch `cur[1]`/`cur[2]`/`cur[3]` without knowing they're in bounds — regardless of byte content.

Address review feedback: the continuation byte checks (& 0xC0 != 0x80)
implicitly catch NUL terminators via short-circuit evaluation, but this
is subtle. Add an explicit `remaining` parameter so bounds are checked
before any continuation byte is read, regardless of byte content.

The caller (LibXML_test_node_name) now computes the remaining byte count
once via xmlStrlen() and decrements it as it advances through the string.
domParseChar checks `remaining < N` before reading N-byte sequences,
short-circuiting before the continuation byte checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread dom.c Outdated
if ((c & 0xe0) == 0xe0) {
if ((c & 0xf0) == 0xf0) {
/* 4-byte code */
if (remaining < 4 ||
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toddr-bot how do we handle what looks like the start of a utf8 word when we don't have enough remaining? should we be aborting? can we just treat them as bytes?

domParseChar() returned *len = -1 for invalid/truncated UTF-8
sequences. While the sole caller (LibXML_test_node_name) never uses
len after a failed character class check, *len = -1 is a latent bug:
any future caller doing cur += len would move the pointer backward
into already-scanned memory.

Change all error paths to *len = 1, return 0. This:
- Rejects the character (return 0 fails all IS_LETTER/IS_DIGIT checks)
- Lets callers safely advance past the bad byte (*len is always >= 1)
- Avoids treating raw leading bytes as their Unicode codepoint values
  (e.g. 0xF0 = U+00F0 ð would pass IS_LETTER, allowing broken UTF-8
  through name validation)

Addresses review feedback on PR #149: we reject truncated sequences
rather than treating them as bytes, because the byte values of UTF-8
leading bytes (0xC0-0xFF) overlap with valid Latin-1 letter codepoints.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@toddr-bot
Copy link
Copy Markdown
Contributor Author

Re: review comment — treating truncated leading bytes as raw bytes:

We can't treat them as bytes — the byte values of UTF-8 leading bytes (0xC0–0xFF) overlap with valid Latin-1 letter codepoints. For example, 0xF0 interpreted as a raw byte is U+00F0 (ð, Latin small letter eth), which passes IS_LETTER and would allow broken UTF-8 through name validation.

The right approach: reject the character but make the length safe to advance past.

Changed all error paths from *len = -1 to *len = 1, keeping return 0:

  • Return 0 → fails all character class checks → name rejected (same behavior as before)
  • *len = 1 → callers can safely skip the bad byte without going backward (-1) or looping forever (0)

The *len = -1 convention was a latent footgun — LibXML_test_node_name happens to never use len after a rejection, but any future caller doing cur += len would get UB. Fixed in 9e60616.

Comment thread dom.c
if (remaining < 4 ||
(cur[1] & 0xC0) != 0x80 ||
(cur[2] & 0xC0) != 0x80 ||
(cur[3] & 0xC0) != 0x80)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toddr-bot wouldn't this read past the end of cur if remaining < 4

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the strategy is wrong here.

Split the combined `remaining < N || (cur[1]...) || ...` conditions
into two distinct if-blocks: check remaining first, return early if
insufficient, then validate continuation bytes. The previous pattern
relied on || short-circuit evaluation to prevent OOB reads — correct
per C spec, but subtle enough that a future editor could break it by
reordering conditions.

Addresses reviewer feedback on PR #149.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@toddr-bot
Copy link
Copy Markdown
Contributor Author

Addressed in 882b17a — the bounds checks are now separated into their own if blocks with early returns, so continuation byte reads only happen after remaining is verified. No reliance on || short-circuit for safety anymore.

Each multi-byte path (2/3/4-byte) follows the same pattern:

if (remaining < N) {
    *len = 1;
    return(0);
}
if ((cur[1] & 0xC0) != 0x80 || ...)
{
    *len = 1;
    return(0);
}

Full test suite passes.

@nwellnhof
Copy link
Copy Markdown
Contributor

You could save yourself all the trouble and just use xmlValidateName.

@perlpunk
Copy link
Copy Markdown

Is this fix ready to use as a patch? I'm maintaining openSUSE devel:languages:perl packages.
From the comments it is not clear.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Security: Out-of-bounds heap read in domParseChar on truncated UTF-8 sequences

4 participants