[PY312] Build PyTorch with rocm, with cuda or without accelerators#10123
[PY312] Build PyTorch with rocm, with cuda or without accelerators#10123iarspider wants to merge 5 commits intoIB/CMSSW_16_1_X/py312from
Conversation
|
please test |
|
A new Pull Request was created by @iarspider for branch IB/CMSSW_16_0_X/py312. @akritkbehera, @iarspider, @smuzaffar can you please review it and eventually sign? Thanks. |
|
cms-bot internal usage |
|
-1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-f6000c/48668/summary.html External BuildI found compilation error when building: patching file torch/csrc/profiler/events.h Hunk #1 FAILED at 11. 1 out of 2 hunks FAILED -- saving rejects to file torch/csrc/profiler/events.h.rej patching file torch/csrc/profiler/orchestration/observer.h Hunk #1 succeeded at 23 (offset -1 lines). error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.ai8irA (%prep) RPM build errors: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.ai8irA (%prep) * The action "install-external+pytorch+2.6.0-46fb6e29091bb0ed685e2e887eb98b78" was not completed successfully because The following dependencies could not complete: |
|
please test |
|
Pull request #10123 was updated. |
|
-1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-f6000c/48671/summary.html External BuildI found compilation warning when building: See details on the summary page. |
7e4997c to
5c7e8a1
Compare
|
please test |
|
Pull request #10123 was updated. |
|
-1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-f6000c/48672/summary.html External BuildI found compilation warning when building: See details on the summary page. |
5c7e8a1 to
b51cbc0
Compare
b51cbc0 to
6254b79
Compare
|
Pull request #10123 was updated. |
|
please test |
|
-1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-f6000c/48677/summary.html External BuildI found compilation error when building: provides a separate development package or SDK, be sure it has been installed. -- Configuring incomplete, errors occurred! error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.3vuYk2 (%build) RPM build errors: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.3vuYk2 (%build) * The action "build-external+pytorch-cluster+1.6.3-9764097db7206957e6b17daaf095593e" was not completed successfully because Failed to build pytorch-cluster. Log file in /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc13/external/pytorch-cluster/1.6.3-9764097db7206957e6b17daaf095593e/log. Final lines of the log file: |
|
please test |
|
Pull request #10123 was updated. |
|
-1 Failed Tests: Build BuildI found compilation error when building: >> Entering Package PhysicsTools/PyTorch
>> Leaving Package PhysicsTools/PyTorch
>> Package PhysicsTools/PyTorch built
>> Compiling src/PhysicsTools/PyTorch/test/testTorch.cc
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc13/external/gcc/13.4.0-6908cfdf803923e783448096ca4f0923/bin/c++ -c -DCMS_MICRO_ARCH='x86-64-v3' -DGNU_GCC -D_GNU_SOURCE -DCMSSW_GIT_HASH='CMSSW_16_0_PY312_X_2025-10-08-2300' -DPROJECT_NAME='CMSSW' -DPROJECT_VERSION='CMSSW_16_0_PY312_X_2025-10-08-2300' -Isrc -Ipoison -I/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc13/cms/cmssw/CMSSW_16_0_PY312_X_2025-10-08-2300/src -O3 -pthread -pipe -Werror=main -Werror=pointer-arith -Werror=overlength-strings -Wno-vla -Werror=overflow -std=c++20 -ftree-vectorize -Werror=array-bounds -Werror=format-contains-nul -Werror=type-limits -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -Wno-error=array-bounds -Warray-bounds -fuse-ld=bfd -march=x86-64-v3 -felide-constructors -fmessage-length=0 -Wall -Wno-non-template-friend -Wno-long-long -Wreturn-type -Wextra -Wpessimizing-move -Wclass-memaccess -Wno-cast-function-type -Wno-unused-but-set-parameter -Wno-ignored-qualifiers -Wno-unused-parameter -Wunused -Wparentheses -Werror=return-type -Werror=missing-braces -Werror=unused-value -Werror=unused-label -Werror=address -Werror=format -Werror=sign-compare -Werror=write-strings -Werror=delete-non-virtual-dtor -Werror=strict-aliasing -Werror=narrowing -Werror=unused-but-set-variable -Werror=reorder -Werror=unused-variable -Werror=conversion-null -Werror=return-local-addr -Wnon-virtual-dtor -Werror=switch -fdiagnostics-show-option -Wno-unused-local-typedefs -Wno-attributes -Wno-psabi -DBOOST_DISABLE_ASSERTS -flto=auto -fipa-icf -flto-odr-type-merging -fno-fat-lto-objects -Wodr -fPIC -MMD -MF tmp/el8_amd64_gcc13/src/PhysicsTools/PyTorch/test/testTorch/testTorch.cc.d src/PhysicsTools/PyTorch/test/testTorch.cc -o tmp/el8_amd64_gcc13/src/PhysicsTools/PyTorch/test/testTorch/testTorch.cc.o
src/PhysicsTools/PyTorch/test/testTorch.cc:2:10: fatal error: torch/torch.h: No such file or directory
2 | #include
| ^~~~~~~~~~~~~~~
compilation terminated.
gmake: *** [tmp/el8_amd64_gcc13/src/PhysicsTools/PyTorch/test/testTorch/testTorch.cc.o] Error 1
>> Building binary testTorch
|
|
Pull request #10123 was updated. |
|
Hi @iarspider @smuzaffar! Just for my understanding, what is the current technical issue with the simultaneous ROCm and CUDA support for PyTorch compilation? Is there something we can do to help on this issue? Thanks a lot and sorry for asking again |
|
My understanding is that PyTorch does not support building for both CUDA and ROCm at the same time. The idea would be to have 2 or 3 independent builds from the same source tree (1 for CUDA, 1 for ROCm, maybe 1 for CPU-only), and somehow pick the correct one based on what CMSSW wants to use (CUDA, ROCm, or CPU-only). I'm not sure if there was any progress on this front, thought. I think if you or the ML group wants to help, that would be very welcome :-) |
|
@valsdav one can't build ROCm and CUDA versions from a single source tree because for ROCm the build process modifies the source code heavily (renames classes, methods, files) before actual compilation happens. We never tested if we can manually merge ROCm and CUDA installations after building, but I would expect a lot of headache and weird side effects to happen if someone tries. So, as @fwyzard said, the idea is to have 2 or 3 builds (CPU, CUDA, ROCm) and somehow switch them at runtime. |
@valsdav , with scram dynamic runtime env we should be able to get torch for cpu/cuda/rocm in cmssw. I have a rough idea how to do it, let me first get toruch 2.10.0 in PY312 IBs and then I will see what we can do to get toruch build or rocm/cuda |
I think the PyTorch backend selection logic should honor CMSSW's |
|
this has been replaced by #10390 |
Continuation of #10056