Add dotproduct assembly documentation and godbolt links#270

Open

miguelraz wants to merge 2 commits intorust-lang:masterfrom

miguelraz:assemblyexamples

Contributor

miguelraz commented Apr 1, 2022 •

edited

Loading

Not yet finished but I wanted to save my work for a bit.

Adding a bunch of text to README.md, with some (may I say) nicely curated Rust godbolt links and displays.

stdsimd docs don't yet have a "voice/tone", let me know if it needs a course correction.

miguelraz changed the title ~~Add dotproduct assembly~~ Add dotproduct assembly documentation and godbolt links


          assembly godbolt in README.md

c97e141

miguelraz force-pushed the assemblyexamples branch from 36ef56b to c97e141 Compare

April 2, 2022 20:28

programmerjake requested changes

View reviewed changes

crates/core_simd/examples/README.md


		This example code takes the dot product of two vectors. You are supposed to mulitply each pair of elements and add them all together.

		The easiest way to inspect the assembly of the `scalar` code versions (the non-SIMD versions) is to [click this link](https://rust.godbolt.org/z/xM9Mxb14n) for a mise en place of what is going on.

Member

programmerjake Apr 25, 2022

I think it would be better to avoid non-english phrases since not everyone knows French (i guess? idk what that phrase means).

crates/core_simd/examples/README.md


		```

		1. SIMD comes in many flavors (instructions sets). These (like `sse`, `sse4.1`, `avx2`) describe the hardware capabilities of your current CPU. That is, if you don't have `avx512`, you physically do not have a SIMD vector that can hold 512 bytes at a time at most on your CPU.

Member

programmerjake Apr 25, 2022

Suggested change

      
            1. SIMD comes in many flavors (instructions sets). These (like `sse`, `sse4.1`, `avx2`) describe the hardware capabilities of your current CPU. That is, if you don't have `avx512`, you physically do not have a SIMD vector that can hold 512 bytes at a time at most on your CPU. 
          
            1. SIMD comes in many flavors (instructions sets). These (like `sse`, `sse4.1`, `avx2`) describe the hardware capabilities of your current CPU. That is, if you don't have `avx512`, you physically do not have any SIMD vector registers that can hold 512-bits at a time on your CPU.

crates/core_simd/examples/README.md

+              ```
+. SIMD comes in many flavors (instructions sets). These (like `sse`, `sse4.1`, `avx2`) describe the hardware capabilities of your current CPU. That is, if you don't have `avx512`, you physically do not have a SIMD vector that can hold 512 bytes at a time at most on your CPU.
+. You can switch between different instruction sets by changing the `#![target-feature(...)]` macro above the function, as well as declaring it unsafe.

Member

programmerjake Apr 25, 2022

Suggested change

      
            2. You can switch between different instruction sets by changing the `#![target-feature(...)]` macro above the function, as well as declaring it unsafe.
          
            2. You can switch between different instruction sets by both changing the `#![target-feature(...)]` macro above the function and declaring it unsafe.

declaring it unsafe by itself doesn't change the target features.

crates/core_simd/examples/README.md

+. SIMD comes in many flavors (instructions sets). These (like `sse`, `sse4.1`, `avx2`) describe the hardware capabilities of your current CPU. That is, if you don't have `avx512`, you physically do not have a SIMD vector that can hold 512 bytes at a time at most on your CPU.
+. You can switch between different instruction sets by changing the `#![target-feature(...)]` macro above the function, as well as declaring it unsafe.
+. Inside Godbolt, you can hover over an instruction to display a tooltip of what it says. Try hovering your mouse over `mulps` and reading what it says.

Member

programmerjake Apr 25, 2022

I suggest phrasing in terms of "what the instruction does" rather than "what it says".

crates/core_simd/examples/README.md

+. You can switch between different instruction sets by changing the `#![target-feature(...)]` macro above the function, as well as declaring it unsafe.
+. Inside Godbolt, you can hover over an instruction to display a tooltip of what it says. Try hovering your mouse over `mulps` and reading what it says.
+              We need to find a way to reduce the amount of *data movement*. We're not doing enough work for all the moving floats into and out of the `xmm` registers. This isn't surprising if we stop and try to look at the code for a bit: `dot_prod_simd_0` is loading 4 floats into `xmm` `a`, then the corresponding 4 floats from `b`, multiplying them (the efficient part), and then doing a `reduce_sum`. In general, SIMD reductions inside a tight loop are a perf anti-pattern, and you should try and figure out a way to make those reductions `element-wise` and not `vector-wise`. This is what we see in the following snippet:

Member

programmerjake Apr 25, 2022

element-wise vs. vector-wise reductions -- not clear, should be rephrased, maybe by describing what they do rather than naming them.

crates/core_simd/examples/README.md


		-----

		Now we will exploit the `mul_add` instruction. Open [this link to view the snippets side by side once again](https://rust.godbolt.org/z/vPTqG13vK). We've started off with a simple computation: adding and multiplying. Even though the arithmetic operations are not complicated, the performance payoff can come form knowing specific hardware capabilities like `mul_add`: in a single instruction, it can multiply 2 SIMD vectors and add them into a 3rd, which can cut swaths in the data movement overheads `xmm` registers can carry. Other instructions like inverse square roots are available (which are very popular for physics calculations), and it can get oodles more complex depending on the problem - there's published algorithms with `shuffles`, `swizzles` and `casts` for [decoding UTF8](https://arxiv.org/pdf/2010.03090.pdf), all in SIMD registers and with fancy table lookups. We won't talk about those here, but we just want to point out that firstly, reading the books can pay off drastically, and second, we're starting small to show the concepts, like using `mul_add` in the next snippet:

Member

programmerjake Apr 25, 2022

"can cut swaths in the data movement overheads xmm registers can carry" -- unclear, should be rephrased.

Urhengulas reviewed

View reviewed changes

Urhengulas left a comment

I just spotted two potential typos while reading through your PR 😄

crates/core_simd/examples/README.md

+              }
+              ```
+              In `dot_prod_simd_1`, we tried out the `fold` patter from our previous `scalar` code snippet examples. This pattern, when implemented via SIMD instructions naively, means that for every `f32x4` `element`-wise multiplication, we accumulate into a (initially `0` valued `f32x4` SIMD vector) and then finally do a `reduce_sum` at the end to get the final result. This

Urhengulas Apr 26, 2022

Suggested change

      
            In `dot_prod_simd_1`, we tried out the `fold` patter from our previous `scalar` code snippet examples. This pattern, when implemented via SIMD instructions naively, means that for every `f32x4` `element`-wise multiplication, we accumulate into a (initially `0` valued `f32x4` SIMD vector) and then finally do a `reduce_sum` at the end to get the final result. This
          
            In `dot_prod_simd_1`, we tried out the `fold` pattern from our previous `scalar` code snippet examples. This pattern, when implemented via SIMD instructions naively, means that for every `f32x4` `element`-wise multiplication, we accumulate into a (initially `0` valued `f32x4` SIMD vector) and then finally do a `reduce_sum` at the end to get the final result. This

crates/core_simd/examples/README.md


		-----

		Now we will exploit the `mul_add` instruction. Open [this link to view the snippets side by side once again](https://rust.godbolt.org/z/vPTqG13vK). We've started off with a simple computation: adding and multiplying. Even though the arithmetic operations are not complicated, the performance payoff can come form knowing specific hardware capabilities like `mul_add`: in a single instruction, it can multiply 2 SIMD vectors and add them into a 3rd, which can cut swaths in the data movement overheads `xmm` registers can carry. Other instructions like inverse square roots are available (which are very popular for physics calculations), and it can get oodles more complex depending on the problem - there's published algorithms with `shuffles`, `swizzles` and `casts` for [decoding UTF8](https://arxiv.org/pdf/2010.03090.pdf), all in SIMD registers and with fancy table lookups. We won't talk about those here, but we just want to point out that firstly, reading the books can pay off drastically, and second, we're starting small to show the concepts, like using `mul_add` in the next snippet:

Urhengulas Apr 26, 2022

Probably this should be "can come from", not "can come form".

Suggested change

      
            Now we will exploit the `mul_add` instruction. Open [this link to view the snippets side by side once again](https://rust.godbolt.org/z/vPTqG13vK). We've started off with a simple computation: adding and multiplying. Even though the arithmetic operations are not complicated, the performance payoff can come form knowing specific hardware capabilities like `mul_add`: in a single instruction, it can multiply 2 SIMD vectors and add them into a 3rd, which can cut swaths in the data movement overheads `xmm` registers can carry. Other instructions like inverse square roots are available (which are very popular for physics calculations), and it can get oodles more complex depending on the problem - there's published algorithms with `shuffles`, `swizzles` and `casts` for [decoding UTF8](https://arxiv.org/pdf/2010.03090.pdf), all in SIMD registers and with fancy table lookups. We won't talk about those here, but we just want to point out that firstly, reading the books can pay off drastically, and second, we're starting small to show the concepts, like using `mul_add` in the next snippet:
          
            Now we will exploit the `mul_add` instruction. Open [this link to view the snippets side by side once again](https://rust.godbolt.org/z/vPTqG13vK). We've started off with a simple computation: adding and multiplying. Even though the arithmetic operations are not complicated, the performance payoff can come from knowing specific hardware capabilities like `mul_add`: in a single instruction, it can multiply 2 SIMD vectors and add them into a 3rd, which can cut swaths in the data movement overheads `xmm` registers can carry. Other instructions like inverse square roots are available (which are very popular for physics calculations), and it can get oodles more complex depending on the problem - there's published algorithms with `shuffles`, `swizzles` and `casts` for [decoding UTF8](https://arxiv.org/pdf/2010.03090.pdf), all in SIMD registers and with fancy table lookups. We won't talk about those here, but we just want to point out that firstly, reading the books can pay off drastically, and second, we're starting small to show the concepts, like using `mul_add` in the next snippet:


          Merge branch 'rust-lang:master' into assemblyexamples

2167d64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet