Skip to content

[PY312] Build PyTorch with rocm, with cuda or without accelerators#10123

Closed
iarspider wants to merge 5 commits intoIB/CMSSW_16_1_X/py312from
pytorch-rocm-5
Closed

[PY312] Build PyTorch with rocm, with cuda or without accelerators#10123
iarspider wants to merge 5 commits intoIB/CMSSW_16_1_X/py312from
pytorch-rocm-5

Conversation

@iarspider
Copy link
Contributor

Continuation of #10056

@iarspider
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @iarspider for branch IB/CMSSW_16_0_X/py312.

@akritkbehera, @iarspider, @smuzaffar can you please review it and eventually sign? Thanks.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 14, 2025

cms-bot internal usage

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-f6000c/48668/summary.html
COMMIT: 9eb1b4d
CMSSW: CMSSW_16_0_PY312_X_2025-10-08-2300/el8_amd64_gcc13
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10123/48668/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

patching file torch/csrc/profiler/events.h
Hunk #1 FAILED at 11.
1 out of 2 hunks FAILED -- saving rejects to file torch/csrc/profiler/events.h.rej
patching file torch/csrc/profiler/orchestration/observer.h
Hunk #1 succeeded at 23 (offset -1 lines).
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.ai8irA (%prep)

RPM build errors:
Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.ai8irA (%prep)

* The action "install-external+pytorch+2.6.0-46fb6e29091bb0ed685e2e887eb98b78" was not completed successfully because The following dependencies could not complete:


@iarspider
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

Pull request #10123 was updated.

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-f6000c/48671/summary.html
COMMIT: 7e4997c
CMSSW: CMSSW_16_0_PY312_X_2025-10-08-2300/el8_amd64_gcc13
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10123/48671/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation warning when building: See details on the summary page.

@iarspider
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

Pull request #10123 was updated.

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-f6000c/48672/summary.html
COMMIT: 5c7e8a1
CMSSW: CMSSW_16_0_PY312_X_2025-10-08-2300/el8_amd64_gcc13
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10123/48672/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation warning when building: See details on the summary page.

@cmsbuild
Copy link
Contributor

Pull request #10123 was updated.

@iarspider
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-f6000c/48677/summary.html
COMMIT: 6e28101
CMSSW: CMSSW_16_0_PY312_X_2025-10-08-2300/el8_amd64_gcc13
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10123/48677/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

provides a separate development package or SDK, be sure it has been
installed.


-- Configuring incomplete, errors occurred!
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.3vuYk2 (%build)

RPM build errors:
Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.3vuYk2 (%build)

* The action "build-external+pytorch-cluster+1.6.3-9764097db7206957e6b17daaf095593e" was not completed successfully because Failed to build pytorch-cluster. Log file in /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc13/external/pytorch-cluster/1.6.3-9764097db7206957e6b17daaf095593e/log. Final lines of the log file:


@iarspider
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

Pull request #10123 was updated.

@cmsbuild
Copy link
Contributor

-1

Failed Tests: Build
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-f6000c/48678/summary.html
COMMIT: 775e743
CMSSW: CMSSW_16_0_PY312_X_2025-10-08-2300/el8_amd64_gcc13
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10123/48678/install.sh to create a dev area with all the needed externals and cmssw changes.

Build

I found compilation error when building:

>> Entering Package PhysicsTools/PyTorch
>> Leaving Package PhysicsTools/PyTorch
>> Package PhysicsTools/PyTorch built
>> Compiling  src/PhysicsTools/PyTorch/test/testTorch.cc
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc13/external/gcc/13.4.0-6908cfdf803923e783448096ca4f0923/bin/c++ -c -DCMS_MICRO_ARCH='x86-64-v3' -DGNU_GCC -D_GNU_SOURCE -DCMSSW_GIT_HASH='CMSSW_16_0_PY312_X_2025-10-08-2300' -DPROJECT_NAME='CMSSW' -DPROJECT_VERSION='CMSSW_16_0_PY312_X_2025-10-08-2300' -Isrc -Ipoison -I/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc13/cms/cmssw/CMSSW_16_0_PY312_X_2025-10-08-2300/src -O3 -pthread -pipe -Werror=main -Werror=pointer-arith -Werror=overlength-strings -Wno-vla -Werror=overflow -std=c++20 -ftree-vectorize -Werror=array-bounds -Werror=format-contains-nul -Werror=type-limits -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -Wno-error=array-bounds -Warray-bounds -fuse-ld=bfd -march=x86-64-v3 -felide-constructors -fmessage-length=0 -Wall -Wno-non-template-friend -Wno-long-long -Wreturn-type -Wextra -Wpessimizing-move -Wclass-memaccess -Wno-cast-function-type -Wno-unused-but-set-parameter -Wno-ignored-qualifiers -Wno-unused-parameter -Wunused -Wparentheses -Werror=return-type -Werror=missing-braces -Werror=unused-value -Werror=unused-label -Werror=address -Werror=format -Werror=sign-compare -Werror=write-strings -Werror=delete-non-virtual-dtor -Werror=strict-aliasing -Werror=narrowing -Werror=unused-but-set-variable -Werror=reorder -Werror=unused-variable -Werror=conversion-null -Werror=return-local-addr -Wnon-virtual-dtor -Werror=switch -fdiagnostics-show-option -Wno-unused-local-typedefs -Wno-attributes -Wno-psabi -DBOOST_DISABLE_ASSERTS -flto=auto -fipa-icf -flto-odr-type-merging -fno-fat-lto-objects -Wodr -fPIC -MMD -MF tmp/el8_amd64_gcc13/src/PhysicsTools/PyTorch/test/testTorch/testTorch.cc.d src/PhysicsTools/PyTorch/test/testTorch.cc -o tmp/el8_amd64_gcc13/src/PhysicsTools/PyTorch/test/testTorch/testTorch.cc.o
src/PhysicsTools/PyTorch/test/testTorch.cc:2:10: fatal error: torch/torch.h: No such file or directory
    2 | #include 
      |          ^~~~~~~~~~~~~~~
compilation terminated.
gmake: *** [tmp/el8_amd64_gcc13/src/PhysicsTools/PyTorch/test/testTorch/testTorch.cc.o] Error 1
>> Building binary testTorch


@cmsbuild
Copy link
Contributor

Pull request #10123 was updated.

@smuzaffar smuzaffar changed the base branch from IB/CMSSW_16_0_X/py312 to IB/CMSSW_16_1_X/py312 December 18, 2025 13:47
@valsdav
Copy link
Contributor

valsdav commented Feb 23, 2026

Hi @iarspider @smuzaffar! Just for my understanding, what is the current technical issue with the simultaneous ROCm and CUDA support for PyTorch compilation? Is there something we can do to help on this issue?

Thanks a lot and sorry for asking again

@fwyzard
Copy link
Contributor

fwyzard commented Feb 24, 2026

My understanding is that PyTorch does not support building for both CUDA and ROCm at the same time.

The idea would be to have 2 or 3 independent builds from the same source tree (1 for CUDA, 1 for ROCm, maybe 1 for CPU-only), and somehow pick the correct one based on what CMSSW wants to use (CUDA, ROCm, or CPU-only).

I'm not sure if there was any progress on this front, thought. I think if you or the ML group wants to help, that would be very welcome :-)

@iarspider
Copy link
Contributor Author

@valsdav one can't build ROCm and CUDA versions from a single source tree because for ROCm the build process modifies the source code heavily (renames classes, methods, files) before actual compilation happens. We never tested if we can manually merge ROCm and CUDA installations after building, but I would expect a lot of headache and weird side effects to happen if someone tries. So, as @fwyzard said, the idea is to have 2 or 3 builds (CPU, CUDA, ROCm) and somehow switch them at runtime.

@smuzaffar
Copy link
Contributor

Hi @iarspider @smuzaffar! Just for my understanding, what is the current technical issue with the simultaneous ROCm and CUDA support for PyTorch compilation? Is there something we can do to help on this issue?

@valsdav , with scram dynamic runtime env we should be able to get torch for cpu/cuda/rocm in cmssw. I have a rough idea how to do it, let me first get toruch 2.10.0 in PY312 IBs and then I will see what we can do to get toruch build or rocm/cuda

@makortel
Copy link
Contributor

with scram dynamic runtime env we should be able to get torch for cpu/cuda/rocm in cmssw.

I think the PyTorch backend selection logic should honor CMSSW's process.options.accelerators configuration parameter (https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideEDMParametersForModules#The_options_Parameter_Set).

@smuzaffar
Copy link
Contributor

this has been replaced by #10390

@smuzaffar smuzaffar closed this Mar 11, 2026
@smuzaffar smuzaffar deleted the pytorch-rocm-5 branch March 11, 2026 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants