Add ascii fast path for unicode_word_indices and unicode_words by PSeitz-dd · Pull Request #147 · unicode-rs/unicode-segmentation

PSeitz-dd · 2025-07-11T05:54:09Z

This PR adds a fast path for unicode_word_indices and unicode_words.

Add ascii version behind ascii check
Add benchmark
Add enum Iterator to wrap ascii and non-ascii into a single iterator
Add proptest to make sure ascii and non-ascii methods behave exactly the same.

Performance for ASCII text is greatly improved, while everything else gets a slight performance hit.

unicode_word_indices/log
                        time:   [209.68 ns 212.35 ns 215.32 ns]
                        thrpt:  [761.82 MiB/s 772.45 MiB/s 782.29 MiB/s]
                 change:
                        time:   [-82.313% -82.117% -81.888%] (p = 0.00 < 0.05)
                        thrpt:  [+452.12% +459.18% +465.38%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
unicode_word_indices/english
                        time:   [426.95 µs 428.03 µs 429.18 µs]
                        thrpt:  [110.42 MiB/s 110.72 MiB/s 110.99 MiB/s]
                 change:
                        time:   [+2.0057% +2.3798% +2.7729%] (p = 0.00 < 0.05)
                        thrpt:  [-2.6981% -2.3245% -1.9662%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
unicode_word_indices/japanese
                        time:   [415.66 µs 416.25 µs 416.86 µs]
                        thrpt:  [116.02 MiB/s 116.18 MiB/s 116.35 MiB/s]
                 change:
                        time:   [+0.5766% +0.8184% +1.0398%] (p = 0.00 < 0.05)
                        thrpt:  [-1.0291% -0.8118% -0.5733%]

Manishearth · 2025-07-14T15:30:07Z

src/word.rs

-/// [`unicode_words`]: trait.UnicodeSegmentation.html#tymethod.unicode_words
-/// [`UnicodeSegmentation`]: trait.UnicodeSegmentation.html
-#[derive(Debug)]
-pub struct UnicodeWords<'a> {


issue: removing these types is a breaking change

We deliberately have concrete types here so that they can be used inside other things conveniently.

I updated the PR to keep the Iterator structs
(personally I prefer Iterators though as the contract is clearer)

Manishearth · 2025-07-15T15:27:23Z

src/word.rs

 pub struct UnicodeWordIndices<'a> {
    #[allow(clippy::type_complexity)]
-    inner: Filter<UWordBoundIndices<'a>, fn(&(usize, &str)) -> bool>,
+    inner: Box<dyn DoubleEndedIterator<Item = (usize, &'a str)> + 'a>,


I don't want this to allocate either.

I replaced it with a custom enum iterator (it's also much faster :))

Manishearth

as it stands I think this is still a ways away from landing, for spec correctness reasons and for performance reasons. We should not be allocating a trait object here; you can get the same effect with a custom iterator that wraps an enum around the two iterators. (Either also works for this but I don't want to add a dependency to a somewhat bottom-of-the-deptree crate)

src/word.rs

PSeitz-dd · 2025-07-17T05:18:12Z

as it stands I think this is still a ways away from landing, for spec correctness reasons and for performance reasons. We should not be allocating a trait object here; you can get the same effect with a custom iterator that wraps an enum around the two iterators. (Either also works for this but I don't want to add a dependency to a somewhat bottom-of-the-deptree crate)

I replaced the Box with a custom enum. It's faster, and the performance hit for non-ascii is quite low now.

Manishearth · 2025-07-17T20:57:10Z

src/word.rs

+///
+/// Any other single ASCII byte is its own boundary (the default WB999).
+#[derive(Debug)]
+pub struct AsciiWordBoundIter<'a> {


This should not be pub, maybe pub(crate) if you need it to be

Manishearth

Thanks!!

PSeitz-dd added 3 commits July 11, 2025 13:01

add benchmark

592d99d

add ascii fastpath

eca9043

add test case IP

b5ed407

PSeitz-dd force-pushed the master branch from 9a85968 to b5ed407 Compare July 11, 2025 06:27

add log to benches

9b1b7f9

Manishearth requested changes Jul 14, 2025

View reviewed changes

PSeitz-dd added 3 commits July 15, 2025 19:55

restore iterators

6f96a23

add backwards iterator

7beb8a6

restore test

a3881da

Manishearth reviewed Jul 15, 2025

View reviewed changes

src/word.rs Show resolved Hide resolved

PSeitz-dd added 2 commits July 16, 2025 20:15

replace Box with Enum

7599d62

add comments with reference to the spec

e29c432

PSeitz-dd added 3 commits July 17, 2025 13:18

remove unused alloc

5a09f28

readd Debug derive

f76a997

use import

b556333

Manishearth reviewed Jul 17, 2025

View reviewed changes

remove pub

0e7674a

Manishearth approved these changes Jul 28, 2025

View reviewed changes

Manishearth merged commit af87c8d into unicode-rs:master Jul 28, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ascii fast path for unicode_word_indices and unicode_words#147

Add ascii fast path for unicode_word_indices and unicode_words#147
Manishearth merged 13 commits intounicode-rs:masterfrom
PSeitz:master

PSeitz-dd commented Jul 11, 2025 •

edited

Loading

Uh oh!

Manishearth Jul 14, 2025

Uh oh!

PSeitz Jul 15, 2025

Uh oh!

Manishearth Jul 15, 2025

Uh oh!

PSeitz Jul 17, 2025

Uh oh!

Manishearth left a comment

Uh oh!

Uh oh!

PSeitz-dd commented Jul 17, 2025

Uh oh!

Manishearth Jul 17, 2025

Uh oh!

Manishearth left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

PSeitz-dd commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Manishearth Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

PSeitz Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Manishearth Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

PSeitz Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Manishearth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PSeitz-dd commented Jul 17, 2025

Uh oh!

Manishearth Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Manishearth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PSeitz-dd commented Jul 11, 2025 •

edited

Loading