Add dotproduct assembly documentation and godbolt links#270
Add dotproduct assembly documentation and godbolt links#270miguelraz wants to merge 2 commits intorust-lang:masterfrom
Conversation
36ef56b to
c97e141
Compare
|
|
||
| This example code takes the dot product of two vectors. You are supposed to mulitply each pair of elements and add them all together. | ||
|
|
||
| The easiest way to inspect the assembly of the `scalar` code versions (the non-SIMD versions) is to [click this link](https://rust.godbolt.org/z/xM9Mxb14n) for a *mise en place* of what is going on. |
There was a problem hiding this comment.
I think it would be better to avoid non-english phrases since not everyone knows French (i guess? idk what that phrase means).
|
|
||
| ``` | ||
|
|
||
| 1. SIMD comes in many flavors (instructions sets). These (like `sse`, `sse4.1`, `avx2`) describe the hardware capabilities of your current CPU. That is, if you don't have `avx512`, you physically do not have a SIMD vector that can hold 512 bytes at a time at most on your CPU. |
There was a problem hiding this comment.
| 1. SIMD comes in many flavors (instructions sets). These (like `sse`, `sse4.1`, `avx2`) describe the hardware capabilities of your current CPU. That is, if you don't have `avx512`, you physically do not have a SIMD vector that can hold 512 bytes at a time at most on your CPU. | |
| 1. SIMD comes in many flavors (instructions sets). These (like `sse`, `sse4.1`, `avx2`) describe the hardware capabilities of your current CPU. That is, if you don't have `avx512`, you physically do not have any SIMD vector registers that can hold 512-bits at a time on your CPU. |
| ``` | ||
|
|
||
| 1. SIMD comes in many flavors (instructions sets). These (like `sse`, `sse4.1`, `avx2`) describe the hardware capabilities of your current CPU. That is, if you don't have `avx512`, you physically do not have a SIMD vector that can hold 512 bytes at a time at most on your CPU. | ||
| 2. You can switch between different instruction sets by changing the `#![target-feature(...)]` macro above the function, as well as declaring it unsafe. |
There was a problem hiding this comment.
| 2. You can switch between different instruction sets by changing the `#![target-feature(...)]` macro above the function, as well as declaring it unsafe. | |
| 2. You can switch between different instruction sets by both changing the `#![target-feature(...)]` macro above the function and declaring it unsafe. |
declaring it unsafe by itself doesn't change the target features.
|
|
||
| 1. SIMD comes in many flavors (instructions sets). These (like `sse`, `sse4.1`, `avx2`) describe the hardware capabilities of your current CPU. That is, if you don't have `avx512`, you physically do not have a SIMD vector that can hold 512 bytes at a time at most on your CPU. | ||
| 2. You can switch between different instruction sets by changing the `#![target-feature(...)]` macro above the function, as well as declaring it unsafe. | ||
| 3. Inside Godbolt, you can hover over an instruction to display a tooltip of what it says. Try hovering your mouse over `mulps` and reading what it says. |
There was a problem hiding this comment.
I suggest phrasing in terms of "what the instruction does" rather than "what it says".
| 2. You can switch between different instruction sets by changing the `#![target-feature(...)]` macro above the function, as well as declaring it unsafe. | ||
| 3. Inside Godbolt, you can hover over an instruction to display a tooltip of what it says. Try hovering your mouse over `mulps` and reading what it says. | ||
|
|
||
| We need to find a way to reduce the amount of *data movement*. We're not doing enough work for all the moving floats into and out of the `xmm` registers. This isn't surprising if we stop and try to look at the code for a bit: `dot_prod_simd_0` is loading 4 floats into `xmm` `a`, then the corresponding 4 floats from `b`, multiplying them (the efficient part), and then doing a `reduce_sum`. In general, SIMD reductions inside a tight loop are a perf anti-pattern, and you should try and figure out a way to make those reductions `element-wise` and not `vector-wise`. This is what we see in the following snippet: |
There was a problem hiding this comment.
element-wise vs. vector-wise reductions -- not clear, should be rephrased, maybe by describing what they do rather than naming them.
|
|
||
| ----- | ||
|
|
||
| Now we will exploit the `mul_add` instruction. Open [this link to view the snippets side by side once again](https://rust.godbolt.org/z/vPTqG13vK). We've started off with a simple computation: adding and multiplying. Even though the arithmetic operations are not complicated, the performance payoff can come form knowing specific hardware capabilities like `mul_add`: in a single instruction, it can multiply 2 SIMD vectors and add them into a 3rd, which can cut swaths in the data movement overheads `xmm` registers can carry. Other instructions like inverse square roots are available (which are very popular for physics calculations), and it can get oodles more complex depending on the problem - there's published algorithms with `shuffles`, `swizzles` and `casts` for [decoding UTF8](https://arxiv.org/pdf/2010.03090.pdf), all in SIMD registers and with fancy table lookups. We won't talk about those here, but we just want to point out that firstly, reading the books can pay off drastically, and second, we're starting small to show the concepts, like using `mul_add` in the next snippet: |
There was a problem hiding this comment.
"can cut swaths in the data movement overheads xmm registers can carry" -- unclear, should be rephrased.
Urhengulas
left a comment
There was a problem hiding this comment.
I just spotted two potential typos while reading through your PR 😄
| } | ||
| ``` | ||
|
|
||
| In `dot_prod_simd_1`, we tried out the `fold` patter from our previous `scalar` code snippet examples. This pattern, when implemented via SIMD instructions naively, means that for every `f32x4` `element`-wise multiplication, we accumulate into a (initially `0` valued `f32x4` SIMD vector) and then finally do a `reduce_sum` at the end to get the final result. This |
There was a problem hiding this comment.
| In `dot_prod_simd_1`, we tried out the `fold` patter from our previous `scalar` code snippet examples. This pattern, when implemented via SIMD instructions naively, means that for every `f32x4` `element`-wise multiplication, we accumulate into a (initially `0` valued `f32x4` SIMD vector) and then finally do a `reduce_sum` at the end to get the final result. This | |
| In `dot_prod_simd_1`, we tried out the `fold` pattern from our previous `scalar` code snippet examples. This pattern, when implemented via SIMD instructions naively, means that for every `f32x4` `element`-wise multiplication, we accumulate into a (initially `0` valued `f32x4` SIMD vector) and then finally do a `reduce_sum` at the end to get the final result. This |
|
|
||
| ----- | ||
|
|
||
| Now we will exploit the `mul_add` instruction. Open [this link to view the snippets side by side once again](https://rust.godbolt.org/z/vPTqG13vK). We've started off with a simple computation: adding and multiplying. Even though the arithmetic operations are not complicated, the performance payoff can come form knowing specific hardware capabilities like `mul_add`: in a single instruction, it can multiply 2 SIMD vectors and add them into a 3rd, which can cut swaths in the data movement overheads `xmm` registers can carry. Other instructions like inverse square roots are available (which are very popular for physics calculations), and it can get oodles more complex depending on the problem - there's published algorithms with `shuffles`, `swizzles` and `casts` for [decoding UTF8](https://arxiv.org/pdf/2010.03090.pdf), all in SIMD registers and with fancy table lookups. We won't talk about those here, but we just want to point out that firstly, reading the books can pay off drastically, and second, we're starting small to show the concepts, like using `mul_add` in the next snippet: |
There was a problem hiding this comment.
Probably this should be "can come from", not "can come form".
| Now we will exploit the `mul_add` instruction. Open [this link to view the snippets side by side once again](https://rust.godbolt.org/z/vPTqG13vK). We've started off with a simple computation: adding and multiplying. Even though the arithmetic operations are not complicated, the performance payoff can come form knowing specific hardware capabilities like `mul_add`: in a single instruction, it can multiply 2 SIMD vectors and add them into a 3rd, which can cut swaths in the data movement overheads `xmm` registers can carry. Other instructions like inverse square roots are available (which are very popular for physics calculations), and it can get oodles more complex depending on the problem - there's published algorithms with `shuffles`, `swizzles` and `casts` for [decoding UTF8](https://arxiv.org/pdf/2010.03090.pdf), all in SIMD registers and with fancy table lookups. We won't talk about those here, but we just want to point out that firstly, reading the books can pay off drastically, and second, we're starting small to show the concepts, like using `mul_add` in the next snippet: | |
| Now we will exploit the `mul_add` instruction. Open [this link to view the snippets side by side once again](https://rust.godbolt.org/z/vPTqG13vK). We've started off with a simple computation: adding and multiplying. Even though the arithmetic operations are not complicated, the performance payoff can come from knowing specific hardware capabilities like `mul_add`: in a single instruction, it can multiply 2 SIMD vectors and add them into a 3rd, which can cut swaths in the data movement overheads `xmm` registers can carry. Other instructions like inverse square roots are available (which are very popular for physics calculations), and it can get oodles more complex depending on the problem - there's published algorithms with `shuffles`, `swizzles` and `casts` for [decoding UTF8](https://arxiv.org/pdf/2010.03090.pdf), all in SIMD registers and with fancy table lookups. We won't talk about those here, but we just want to point out that firstly, reading the books can pay off drastically, and second, we're starting small to show the concepts, like using `mul_add` in the next snippet: |
Not yet finished but I wanted to save my work for a bit.
Adding a bunch of text to README.md, with some (may I say) nicely curated Rust godbolt links and displays.
stdsimddocs don't yet have a "voice/tone", let me know if it needs a course correction.