Skip to content

Comments

vello_common: Part Two of the strip rendering regression fix (make use of _mm_max_ps semantics)#1464

Open
tomcur wants to merge 3 commits intolinebender:mainfrom
tomcur:push-klrwpqzoltvm
Open

vello_common: Part Two of the strip rendering regression fix (make use of _mm_max_ps semantics)#1464
tomcur wants to merge 3 commits intolinebender:mainfrom
tomcur:push-klrwpqzoltvm

Conversation

@tomcur
Copy link
Member

@tomcur tomcur commented Feb 20, 2026

On top of Part One: #1463.

Copying from the code comment:

We know ymin and ymax are finite. We require the max operation to pick ymin if its first operand is NaN. On x86, that's precisely the semantics of _mm_max_ps, which f32x4::max emits. For AArch64, we do require the f32x4::max_precise semantics (as max returns NaN if either operand is NaN); however, the precise version should be comparatively less expensive than on x86. For min, we then know both operands are finite, so we can unambigously use the relaxed version.

This may be somewhat controversial, as it cfgs behavior based on the target architecture, and makes use of undocumented fearless_simd behavior. The good part is that if this ever breaks, the tests will fail loudly (because NaNs happen a lot here!).

Benches as follows relative to main before bumping fearless_simd on my x86 machine (i7-13700k).

render_strips/Ghostscript_Tiger_simd
                        time:   [188.69 µs 189.12 µs 189.63 µs]
                        change: [-3.0455% -2.5888% -2.2385%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 50 measurements (14.00%)
  7 (14.00%) high mild
render_strips/paris-30k_simd
                        time:   [23.308 ms 23.373 ms 23.449 ms]
                        change: [-0.7675% +0.2120% +1.0973%] (p = 0.67 > 0.05)
                        No change in performance detected.
Found 3 outliers among 50 measurements (6.00%)
  2 (4.00%) high mild
  1 (2.

…e of `_mm_max_ps` semantics)

Copying from the code comment:

> We know `ymin` and `ymax` are finite. We require the `max` operation to pick `ymin`
> if its first operand is NaN. On x86, that's precisely the semantics of `_mm_max_ps`,
> which `f32x4::max` emits. For AArch64, we do require the `f32x4::max_precise`
> semantics (as `max` returns NaN if either operand is NaN); however, the precise
> version should be comparatively less expensive than on x86. For `min`, we then know
> both operands are finite, so we can unambigously use the relaxed version.

This may be somewhat controversial, as it's `cfg`s behavior based on the
target architecture, and makes use of undocumented `fearless_simd`
behavior. The good part is that if this ever breaks, the tests will
fail loudly (because NaNs happen a lot here!).

Benches as follows on my x86 machine (i7-13700k):

```
render_strips/Ghostscript_Tiger_simd
                        time:   [188.69 µs 189.12 µs 189.63 µs]
                        change: [-3.0455% -2.5888% -2.2385%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 50 measurements (14.00%)
  7 (14.00%) high mild
render_strips/paris-30k_simd
                        time:   [23.308 ms 23.373 ms 23.449 ms]
                        change: [-0.7675% +0.2120% +1.0973%] (p = 0.67 > 0.05)
                        No change in performance detected.
Found 3 outliers among 50 measurements (6.00%)
  2 (4.00%) high mild
  1 (2.00%) high severe
```
Copy link
Collaborator

@LaurenzV LaurenzV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is reasonable, especially because we control fearless_simd (so if it ever changes there, we will be aware), and the tests do cover this. :)

// both operands are finite, so we can unambigously use the relaxed version. If this
// ever breaks, tests should fail loudly, because NaNs happen a lot here!
trait F32x4MaxExt {
fn max_if_nan_take_second(self, rhs: Self) -> Self;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it maybe be max_if_one_nan_take_the_other to make the intention more clear? Up to you 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree! I've named it max_if_first_nan_take_second as those are the semantics we need. The name max_if_one_nan_take_the_other doesn't quite work, because those are not the semantics that _mm_max_ps has: its semantics could be named max_if_either_nan_take_second.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants