Interface ectrans with GPU backend#252
Conversation
|
Tagging @fmahebert FYI |
|
Hi @wdeconinck, I'm wondering what the current status of this PR is? The reason that I ask is that I've tried to build and run it locally with a GPU enabled build of ecTrans but I'm getting errors running some of the Atlas trans tests. For example, in the atlas/src/tests/trans/test_trans.cc Line 377 in f988397 dist_spec).
I've build ecTrans using the NVHPC/25.1 compilers and the HPC-X MPI implementation (OpenMPI 4.1.7) that the SDK comes with and all the ecTrans tests pass (CPU and GPU). However, when I link Atlas to ecTrans and run the tests I get failures as I mentioned above. I'm starting to wonder if perhaps I might not be building things correctly or I'm missing some runtime flag. Would you be able to share how you've built this branch of Atlas and ecTrans? |
|
Hi @l90lpa I have just tested this with NVHPC 22.11 and saw no issues like that. My loaded modules:
Note I am not using the openmpi that came with the SDK here. I built following projects with these cmake options: |
cd69b7e to
537cc8c
Compare
|
Now rebased on latest release. |
|
Private downstream CI failed. |
Hi @wdeconinck, thanks for getting back to me and sharing your build set-up! I'll try to recreate a similar environment and see if I have better luck. |
|
Hi @wdeconinck, thanks again for sharing your build environment. I was able to get Atlas+ecTrans working using NVHPC 22.11. However, I've been having trouble building some of our code (and dependencies) with NVHPC 22.11 compilers, and so I was wondering if you have a build environment with a recent version of NVHPC that you know works? The reason I ask is because I seem to get test failures when I move to newer versions of NVHPC as mentioned above. |
|
I could reproduce some issues with nvidia/24.5. The issues seem not to stem from using ectrans-gpu. |
|
I have managed to compile atlas with nvidia/24.5 and nvidia/24.11 using #278. I have rebased this branch including these changes. It should now work. Another thing... By default all atlas tests are run with floating-point-exception trapping enabled. if(x!=0) atan2(y,x)because the masking in vectorised code comes after the signal has been sent with AVX2. For this reason it may be required to turn off floating-point-exception trapping (only for running the tests). You can do this in the environment with export ATLAS_FPE=0 |
537cc8c to
01b66a4
Compare
When the feature "ECTRANS_GPU" is enabled, atlas will now offload all possible spectral transforms to ectrans with GPU backend.
Note that as of now not all functionality is implemented, and a not-implemented exception will be thrown.
The unit-tests by default ignore the not implemented features, triggered by such exception.
The workings of the exception handling depends on a ectrans pull request: ecmwf-ifs/ectrans#193
Without the ectrans pull requests the tests will compile but abort/crash at run-time.