vello_common: Part Two of the strip rendering regression fix (make use of _mm_max_ps semantics)#1464
Open
tomcur wants to merge 3 commits intolinebender:mainfrom
Open
vello_common: Part Two of the strip rendering regression fix (make use of _mm_max_ps semantics)#1464tomcur wants to merge 3 commits intolinebender:mainfrom
_mm_max_ps semantics)#1464tomcur wants to merge 3 commits intolinebender:mainfrom
Conversation
…e of `_mm_max_ps` semantics)
Copying from the code comment:
> We know `ymin` and `ymax` are finite. We require the `max` operation to pick `ymin`
> if its first operand is NaN. On x86, that's precisely the semantics of `_mm_max_ps`,
> which `f32x4::max` emits. For AArch64, we do require the `f32x4::max_precise`
> semantics (as `max` returns NaN if either operand is NaN); however, the precise
> version should be comparatively less expensive than on x86. For `min`, we then know
> both operands are finite, so we can unambigously use the relaxed version.
This may be somewhat controversial, as it's `cfg`s behavior based on the
target architecture, and makes use of undocumented `fearless_simd`
behavior. The good part is that if this ever breaks, the tests will
fail loudly (because NaNs happen a lot here!).
Benches as follows on my x86 machine (i7-13700k):
```
render_strips/Ghostscript_Tiger_simd
time: [188.69 µs 189.12 µs 189.63 µs]
change: [-3.0455% -2.5888% -2.2385%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 50 measurements (14.00%)
7 (14.00%) high mild
render_strips/paris-30k_simd
time: [23.308 ms 23.373 ms 23.449 ms]
change: [-0.7675% +0.2120% +1.0973%] (p = 0.67 > 0.05)
No change in performance detected.
Found 3 outliers among 50 measurements (6.00%)
2 (4.00%) high mild
1 (2.00%) high severe
```
2671a93 to
be428af
Compare
LaurenzV
approved these changes
Feb 21, 2026
Collaborator
LaurenzV
left a comment
There was a problem hiding this comment.
I think this is reasonable, especially because we control fearless_simd (so if it ever changes there, we will be aware), and the tests do cover this. :)
| // both operands are finite, so we can unambigously use the relaxed version. If this | ||
| // ever breaks, tests should fail loudly, because NaNs happen a lot here! | ||
| trait F32x4MaxExt { | ||
| fn max_if_nan_take_second(self, rhs: Self) -> Self; |
Collaborator
There was a problem hiding this comment.
Should it maybe be max_if_one_nan_take_the_other to make the intention more clear? Up to you 😄
Member
Author
There was a problem hiding this comment.
Agree! I've named it max_if_first_nan_take_second as those are the semantics we need. The name max_if_one_nan_take_the_other doesn't quite work, because those are not the semantics that _mm_max_ps has: its semantics could be named max_if_either_nan_take_second.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
On top of Part One: #1463.
Copying from the code comment:
This may be somewhat controversial, as it
cfgs behavior based on the target architecture, and makes use of undocumentedfearless_simdbehavior. The good part is that if this ever breaks, the tests will fail loudly (because NaNs happen a lot here!).Benches as follows relative to
mainbefore bumpingfearless_simdon my x86 machine (i7-13700k).