Multithreaded array initialization by carstenbauer · Pull Request #68 · omlins/ParallelStencil.jl

carstenbauer · 2022-10-04T14:54:15Z

For better performance on systems with multiple NUMA domains. See my extensive comment on discourse.

With this PR, I get about 40% speedup for this example (with USE_GPU=false) when using a full AMD Zen3 CPU (64 cores, 4 NUMA domains) of Noctua 2.

Timings (s) before

╭───────────┬─────────┬─────────┬─────────╮
│ # Threads │       1 │       8 │      64 │
├───────────┼─────────┼─────────┼─────────┤
│   compact │ 12.8708 │ 2.42357 │ 2.43713 │
│    spread │ 12.8708 │ 2.38331 │  3.3897 │
╰───────────┴─────────┴─────────┴─────────╯

Timings (s) after

╭───────────┬─────────┬─────────┬─────────╮
│ # Threads │       1 │       8 │      64 │
├───────────┼─────────┼─────────┼─────────┤
│   compact │ 12.8762 │ 2.41895 │ 1.51899 │
│    spread │ 12.8762 │ 2.35042 │ 2.08579 │
╰───────────┴─────────┴─────────┴─────────╯

Speedup in %

╭───────────┬─────┬─────┬──────╮
│ # Threads │   1 │   8 │   64 │
├───────────┼─────┼─────┼──────┤
│   compact │ 0.0 │ 0.0 │ 38.0 │
│    spread │ 0.0 │ 1.0 │ 38.0 │
╰───────────┴─────┴─────┴──────╯

NOTES:

We see that the changes have essentially no impact on the single threaded case but give speedups when run with many threads (on a multi-NUMA domain system).
We see that if we stay within one NUMA domain (e.g. 8 threads) we don't observe a speedup (as expected).
compact and spread indicate the thread pinning strategy.
Ideally, the access pattern of the parallel initialization should match the access pattern of the stencil as much as possible. In this PR, I just do the "trivial" parallel initialization. (In principle, one could think about passing the custom user kernel to @zeros and co, analyze its structure and then initialize "accordingly". But that's difficult...)

cc @luraess @omlins

PS: Working on it at the GPU4GEO Hackathon in the Schwarzwald 😉

luraess · 2022-10-04T17:29:46Z

Thanks for the contribution. I guess having something in PS for the Threads backend to control pinning and threads to cores mapping (or have an close to optimal default solution) would be great! Especially for AMD cpus with many NUMA regions where this becomes significant.

carstenbauer · 2022-10-05T08:11:38Z

BTW, @omlins, depending on how easy/difficult it would be to give me test access to Piz Daint I could run some benchmarks there as well.

omlins · 2022-10-06T16:13:02Z

@carstenbauer, as Ludovic told you probably already, Piz Daint does not have any AMD CPUs. Thus, for testing this Superzack, Ludovic's cluster, will be better.

carstenbauer · 2022-10-21T12:26:03Z

I quickly tested another example, namely https://github.com/omlins/ParallelStencil.jl/blob/main/miniapps/acoustic3D.jl (with the visualization/animation part commented out. Same configuration as above, i.e. a 64 core node of Noctua 2 with 64 Julia threads that I pinned compactly. Below are the timings of the acoustic3D() function before and with this PR.

# Before PR: 44.315157 seconds (779.52 k allocations: 840.038 MiB, 1.09% gc time)
# With PR: 18.557505 seconds (791.20 k allocations: 840.475 MiB, 2.71% gc time)

This corresponds to about a 2.4x speedup. (cc @luraess)

omlins · 2022-12-12T10:35:27Z

This relates also to #53 (comment)

carstenbauer · 2023-07-03T07:38:45Z

What's holding back merging this?

ranocha · 2023-09-13T09:57:35Z

Bump

luraess · 2023-10-19T09:21:49Z

@omlins bump

carstenbauer added 4 commits October 4, 2022 11:54

first try

ca855db

kinda works

da12413

works

04c2b11

undo

0234451

omlins self-requested a review October 4, 2022 17:46

carstenbauer marked this pull request as ready for review December 2, 2022 11:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreaded array initialization#68

Multithreaded array initialization#68
carstenbauer wants to merge 4 commits intoomlins:mainfrom
carstenbauer:cb/parallelinit

carstenbauer commented Oct 4, 2022 •

edited

Loading

Uh oh!

luraess commented Oct 4, 2022

Uh oh!

carstenbauer commented Oct 5, 2022

Uh oh!

omlins commented Oct 6, 2022 •

edited

Loading

Uh oh!

carstenbauer commented Oct 21, 2022 •

edited

Loading

Uh oh!

omlins commented Dec 12, 2022

Uh oh!

carstenbauer commented Jul 3, 2023

Uh oh!

ranocha commented Sep 13, 2023

Uh oh!

luraess commented Oct 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

carstenbauer commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luraess commented Oct 4, 2022

Uh oh!

carstenbauer commented Oct 5, 2022

Uh oh!

omlins commented Oct 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carstenbauer commented Oct 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

omlins commented Dec 12, 2022

Uh oh!

carstenbauer commented Jul 3, 2023

Uh oh!

ranocha commented Sep 13, 2023

Uh oh!

luraess commented Oct 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

carstenbauer commented Oct 4, 2022 •

edited

Loading

omlins commented Oct 6, 2022 •

edited

Loading

carstenbauer commented Oct 21, 2022 •

edited

Loading