Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 21 additions & 3 deletions core/engine/src/builtins/regexp/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1136,6 +1136,24 @@ impl RegExp {
// 9. If flags contains "u" or flags contains "v", let fullUnicode be true; else let fullUnicode be false.
let full_unicode = flags.contains(b'u') || flags.contains(b'v');

// When the /u or /v flag is active, the input string is modeled as a sequence
// of Unicode code points (§22.2.2). Since `last_index` is a UTF-16 code unit
// index, it may point to the trailing half of a surrogate pair, which is not
// a valid code point boundary. In that case, we adjust the matcher start
// position to the preceding lead surrogate so matching begins at a valid
// code point boundary.
// Ref: https://tc39.es/ecma262/#sec-pattern-semantics
let mut start_index = last_index;
if full_unicode
&& start_index > 0
&& let Some(cu) = input.code_unit_at(start_index as usize)
&& (0xDC00..=0xDFFF).contains(&cu)
&& let Some(prev_cu) = input.code_unit_at(start_index as usize - 1)
Comment on lines +1149 to +1151
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_index is derived from last_index (u64) and cast to usize in code_unit_at before validating last_index <= length. On 32-bit targets (or with very large lastIndex values), the as usize cast can truncate and accidentally make the index appear in-bounds, leading to incorrect surrogate-boundary adjustment (and potentially incorrect match results). Consider moving the surrogate-boundary adjustment block to after the if last_index > length { ... return } early-exit, and/or using a checked conversion (usize::try_from) before calling code_unit_at.

Suggested change
&& let Some(cu) = input.code_unit_at(start_index as usize)
&& (0xDC00..=0xDFFF).contains(&cu)
&& let Some(prev_cu) = input.code_unit_at(start_index as usize - 1)
&& start_index <= length
&& let Ok(start_index_usize) = usize::try_from(start_index)
&& let Some(cu) = input.code_unit_at(start_index_usize)
&& (0xDC00..=0xDFFF).contains(&cu)
&& let Some(prev_cu) = input.code_unit_at(start_index_usize - 1)

Copilot uses AI. Check for mistakes.
&& (0xD800..=0xDBFF).contains(&prev_cu)
{
start_index -= 1;
}

// NOTE: The following steps are take care of by regress:
//
// SKIP: 10. Let matchSucceeded be false.
Expand Down Expand Up @@ -1163,13 +1181,13 @@ impl RegExp {
let input = input.to_vec();

// NOTE: We can use the faster ucs2 variant since there will never be two byte unicode.
matcher.find_from_ucs2(&input, last_index as usize).next()
matcher.find_from_ucs2(&input, start_index as usize).next()
}
(true, JsStrVariant::Utf16(input)) => {
matcher.find_from_utf16(input, last_index as usize).next()
matcher.find_from_utf16(input, start_index as usize).next()
}
(false, JsStrVariant::Utf16(input)) => {
matcher.find_from_ucs2(input, last_index as usize).next()
matcher.find_from_ucs2(input, start_index as usize).next()
Comment on lines 1183 to +1190
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The matcher is now invoked with start_index, but sticky emulation still checks match_value.start() != last_index. When /u or /v adjusts start_index (e.g. lastIndex points into the trailing surrogate), this will incorrectly reject valid sticky matches because match_value.start() will equal start_index (adjusted) rather than last_index (original). The sticky check should compare against the actual matcher start position (or a computed “effective lastIndex” used to start matching) to avoid a behavior regression for /yu and /yv.

Copilot uses AI. Check for mistakes.
}
};

Expand Down
48 changes: 48 additions & 0 deletions core/engine/src/builtins/regexp/tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -262,3 +262,51 @@ fn regexp_no_panic_on_empty_class_quantifier() {
// It should return null without panicking.
run_test_actions([TestAction::assert_eq("/[]*1/u.exec()", JsValue::null())]);
}

#[test]
fn regexp_exec_coercion_order() {
// ECMAScript §21.2.5.2.1 — RegExpExec
// Ensures ToString(input) happens before accessing lastIndex
run_test_actions([TestAction::assert_eq(
indoc! {r#"
let log = [];
let re = /a/g;

re.lastIndex = {
valueOf() { log.push("lastIndex"); return 0; }
};

let str = {
toString() { log.push("string"); return "a"; }
};

re.exec(str);
log.join(",");
"#},
js_str!("string,lastIndex"),
)]);
}

#[test]
fn regexp_unicode_lastindex_surrogate_boundary() {
run_test_actions([TestAction::assert_eq(
indoc! {r#"
let re = /./gu;
re.lastIndex = 1;
re.exec("💩")[0];
"#},
js_str!("💩"),
)]);
}

#[test]
fn regexp_unicode_lastindex_no_adjustment() {
run_test_actions([TestAction::assert_eq(
indoc! {r#"
let re = /./gu;
re.lastIndex = 0;
re.exec("💩")[0];
"#},
js_str!("💩"),
)]);
}
Loading