vello_common: Part Two of the strip rendering regression fix (make use of `_mm_max_ps` semantics) by tomcur · Pull Request #1464 · linebender/vello

tomcur · 2026-02-20T22:49:13Z

On top of Part One: #1463.

Copying from the code comment:

We know ymin and ymax are finite. We require the max operation to pick ymin if its first operand is NaN. On x86, that's precisely the semantics of _mm_max_ps, which f32x4::max emits. For AArch64, we do require the f32x4::max_precise semantics (as max returns NaN if either operand is NaN); however, the precise version should be comparatively less expensive than on x86. For min, we then know both operands are finite, so we can unambigously use the relaxed version.

This may be somewhat controversial, as it cfgs behavior based on the target architecture, and makes use of undocumented fearless_simd behavior. The good part is that if this ever breaks, the tests will fail loudly (because NaNs happen a lot here!).

Benches as follows relative to main before bumping fearless_simd on my x86 machine (i7-13700k).

render_strips/Ghostscript_Tiger_simd
                        time:   [188.69 µs 189.12 µs 189.63 µs]
                        change: [-3.0455% -2.5888% -2.2385%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 50 measurements (14.00%)
  7 (14.00%) high mild
render_strips/paris-30k_simd
                        time:   [23.308 ms 23.373 ms 23.449 ms]
                        change: [-0.7675% +0.2120% +1.0973%] (p = 0.67 > 0.05)
                        No change in performance detected.
Found 3 outliers among 50 measurements (6.00%)
  2 (4.00%) high mild
  1 (2.

…e of `_mm_max_ps` semantics) Copying from the code comment: > We know `ymin` and `ymax` are finite. We require the `max` operation to pick `ymin` > if its first operand is NaN. On x86, that's precisely the semantics of `_mm_max_ps`, > which `f32x4::max` emits. For AArch64, we do require the `f32x4::max_precise` > semantics (as `max` returns NaN if either operand is NaN); however, the precise > version should be comparatively less expensive than on x86. For `min`, we then know > both operands are finite, so we can unambigously use the relaxed version. This may be somewhat controversial, as it's `cfg`s behavior based on the target architecture, and makes use of undocumented `fearless_simd` behavior. The good part is that if this ever breaks, the tests will fail loudly (because NaNs happen a lot here!). Benches as follows on my x86 machine (i7-13700k): ``` render_strips/Ghostscript_Tiger_simd time: [188.69 µs 189.12 µs 189.63 µs] change: [-3.0455% -2.5888% -2.2385%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 50 measurements (14.00%) 7 (14.00%) high mild render_strips/paris-30k_simd time: [23.308 ms 23.373 ms 23.449 ms] change: [-0.7675% +0.2120% +1.0973%] (p = 0.67 > 0.05) No change in performance detected. Found 3 outliers among 50 measurements (6.00%) 2 (4.00%) high mild 1 (2.00%) high severe ```

LaurenzV

I think this is reasonable, especially because we control fearless_simd (so if it ever changes there, we will be aware), and the tests do cover this. :)

LaurenzV · 2026-02-21T11:09:16Z

sparse_strips/vello_common/src/strip.rs

+            // both operands are finite, so we can unambigously use the relaxed version. If this
+            // ever breaks, tests should fail loudly, because NaNs happen a lot here!
+            trait F32x4MaxExt {
+                fn max_if_nan_take_second(self, rhs: Self) -> Self;


Should it maybe be max_if_one_nan_take_the_other to make the intention more clear? Up to you 😄

Agree! I've named it max_if_first_nan_take_second as those are the semantics we need. The name max_if_one_nan_take_the_other doesn't quite work, because those are not the semantics that _mm_max_ps has: its semantics could be named max_if_either_nan_take_second.

tomcur force-pushed the push-klrwpqzoltvm branch from 2671a93 to be428af Compare February 21, 2026 10:59

LaurenzV approved these changes Feb 21, 2026

View reviewed changes

tomcur added 2 commits February 22, 2026 10:46

Clarify

a46a86b

Typo

3eda7f6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

vello_common: Part Two of the strip rendering regression fix (make use of `_mm_max_ps` semantics)#1464

vello_common: Part Two of the strip rendering regression fix (make use of `_mm_max_ps` semantics)#1464
tomcur wants to merge 3 commits intolinebender:mainfrom
tomcur:push-klrwpqzoltvm

tomcur commented Feb 20, 2026 •

edited

Loading

Uh oh!

LaurenzV left a comment

Uh oh!

LaurenzV Feb 21, 2026

Uh oh!

tomcur Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

tomcur commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LaurenzV left a comment

Choose a reason for hiding this comment

Uh oh!

LaurenzV Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

tomcur Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomcur commented Feb 20, 2026 •

edited

Loading