Add ascii fast path for unicode_word_indices and unicode_words#147
Add ascii fast path for unicode_word_indices and unicode_words#147Manishearth merged 13 commits intounicode-rs:masterfrom
Conversation
| /// [`unicode_words`]: trait.UnicodeSegmentation.html#tymethod.unicode_words | ||
| /// [`UnicodeSegmentation`]: trait.UnicodeSegmentation.html | ||
| #[derive(Debug)] | ||
| pub struct UnicodeWords<'a> { |
There was a problem hiding this comment.
issue: removing these types is a breaking change
We deliberately have concrete types here so that they can be used inside other things conveniently.
There was a problem hiding this comment.
I updated the PR to keep the Iterator structs
(personally I prefer Iterators though as the contract is clearer)
src/word.rs
Outdated
| pub struct UnicodeWordIndices<'a> { | ||
| #[allow(clippy::type_complexity)] | ||
| inner: Filter<UWordBoundIndices<'a>, fn(&(usize, &str)) -> bool>, | ||
| inner: Box<dyn DoubleEndedIterator<Item = (usize, &'a str)> + 'a>, |
There was a problem hiding this comment.
I don't want this to allocate either.
There was a problem hiding this comment.
I replaced it with a custom enum iterator (it's also much faster :))
Manishearth
left a comment
There was a problem hiding this comment.
as it stands I think this is still a ways away from landing, for spec correctness reasons and for performance reasons. We should not be allocating a trait object here; you can get the same effect with a custom iterator that wraps an enum around the two iterators. (Either also works for this but I don't want to add a dependency to a somewhat bottom-of-the-deptree crate)
I replaced the Box with a custom enum. It's faster, and the performance hit for non-ascii is quite low now. |
src/word.rs
Outdated
| /// | ||
| /// Any other single ASCII byte is its own boundary (the default WB999). | ||
| #[derive(Debug)] | ||
| pub struct AsciiWordBoundIter<'a> { |
There was a problem hiding this comment.
This should not be pub, maybe pub(crate) if you need it to be
This PR adds a fast path for
unicode_word_indicesandunicode_words.Performance for ASCII text is greatly improved, while everything else gets a slight performance hit.