Split torch tool for cpu, cuda and rocm#10390
Split torch tool for cpu, cuda and rocm#10390smuzaffar merged 1 commit intoIB/CMSSW_16_1_X/py312from
Conversation
|
enable gpu |
|
A new Pull Request was created by @smuzaffar for branch IB/CMSSW_16_1_X/py312. @akritkbehera, @cmsbuild, @iarspider, @raoatifshad, @smuzaffar can you please review it and eventually sign? Thanks. |
|
cms-bot internal usage |
|
please test with cms-sw/cmssw-config#117 using full cmssw |
|
-1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b5dec0/51700/summary.html Failed External BuildI found compilation error when building: set_property could not find TARGET torch_cuda. Perhaps it has not yet been created. -- Configuring incomplete, errors occurred! error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.GwcHyT (%build) RPM build errors: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.GwcHyT (%build) * The action "build-install-external+pytorch-cluster+1.6.3-24509dcecfe3ecca5dcdbeba91f58f58" was not completed successfully because The following dependencies could not complete: |
|
Pull request #10390 was updated. |
|
please test |
|
-1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b5dec0/51750/summary.html Failed External BuildI found compilation error when building: + DUP_BIN=torchrun + set +x torchrun:py3-torch torchrun:py3-torch-cuda ERROR: Duplicate python binaries found. Please cleanup and make sure only one binary is available. error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.MUT1X4 (%install) RPM build errors: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.MUT1X4 (%install) * The action "build-srpm-cms+cmssw-tools+6.0-87f515c1e140fba3b0dce87520911fe2" was not completed successfully because The following dependencies could not complete: |
|
Pull request #10390 was updated. |
|
please test with cms-sw/cmssw-config#117 using full cmssw |
|
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b5dec0/51850/summary.html Comparison SummarySummary:
|
|
Pull request #10390 was updated. |
35a63f5 to
72b83fd
Compare
|
Pull request #10390 was updated. |
|
please test with cms-sw/cms-bot#2702 using full cmssw |
|
+externals good to go in Python 3.12 based IBs ( i.e. PY312 and GCC15 IBs) |
|
This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_16_1_X/py312 IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @sextonkennedy, @ftenchini (and backports should be raised in the release meeting by the corresponding L2) |
This PR proposes to build Torch separately for
cpu, cuda and rocm. For now torch-rocm is not enabled as it requires full distribution of rocm (nearly 11GB)[a]. We should be able to build most/all of [a] from sources using https://github.com/ROCm/rocm-libraries .For now this PR proposes to split torch in cpu and cuda part as a proof of concept that we can dynamically select env depending on available resources. Currently this PR proposes the following
py3-torch: Only cpu c++ API and python interfacepy3-torch-cuda: cuda and cpu c++ API and python interfacepy3-torch-<extensions>(e.g.cluster,scatterandsparse): cpu only python interfacepy3-torch-<extensions>-cuda(e.g.cluster,scatterandsparse): cuda and cpu python interfaceAs we do not have
py3-torch-rocmyet ( which needs full 11GB of rocm distribution), on amdgpu hosts we fall back to cpu only interface. SCRAM will dynamically set the env e.g. on<path>/lib<path>/lib/python3.12/site-packages<path>/lib/scram_cuda:<path>/lib(where /lib/scram_cuda contains libs for cuda and cpu torch)<path>/lib/python3.12/site-packages/scram_cuda:<path>/lib/python3.12/site-packagesfor rocm, for now we fall back to cpu only libs/python modules.
I also have added triton and aotriton spec so that if we have to build the torch-rocm fully from sources then we can use aotirton. For now torch-rocm downloads prebuild aotriton bundle at build time
This need cms-sw/cmssw-config#117 too
[a]