Skip to content

Split torch tool for cpu, cuda and rocm#10390

Merged
smuzaffar merged 1 commit intoIB/CMSSW_16_1_X/py312from
torch-cpu-gpu
Mar 10, 2026
Merged

Split torch tool for cpu, cuda and rocm#10390
smuzaffar merged 1 commit intoIB/CMSSW_16_1_X/py312from
torch-cpu-gpu

Conversation

@smuzaffar
Copy link
Contributor

@smuzaffar smuzaffar commented Mar 2, 2026

This PR proposes to build Torch separately for cpu, cuda and rocm. For now torch-rocm is not enabled as it requires full distribution of rocm (nearly 11GB)[a]. We should be able to build most/all of [a] from sources using https://github.com/ROCm/rocm-libraries .

For now this PR proposes to split torch in cpu and cuda part as a proof of concept that we can dynamically select env depending on available resources. Currently this PR proposes the following

  • py3-torch: Only cpu c++ API and python interface
  • py3-torch-cuda: cuda and cpu c++ API and python interface
  • py3-torch-<extensions> (e.g. cluster, scatter and sparse): cpu only python interface
  • py3-torch-<extensions>-cuda (e.g. cluster, scatter and sparse): cuda and cpu python interface

As we do not have py3-torch-rocm yet ( which needs full 11GB of rocm distribution), on amdgpu hosts we fall back to cpu only interface. SCRAM will dynamically set the env e.g. on

  • cpu only host:
    • LD_LIBRARY_PATH: <path>/lib
    • PYTHON3PATH:<path>/lib/python3.12/site-packages
  • host with cuda:
    • LD_LIBRARY_PATH: <path>/lib/scram_cuda:<path>/lib(where /lib/scram_cuda contains libs for cuda and cpu torch)
    • PYTHON3PATH: <path>/lib/python3.12/site-packages/scram_cuda:<path>/lib/python3.12/site-packages

for rocm, for now we fall back to cpu only libs/python modules.

I also have added triton and aotriton spec so that if we have to build the torch-rocm fully from sources then we can use aotirton. For now torch-rocm downloads prebuild aotriton bundle at build time

This need cms-sw/cmssw-config#117 too

[a]

 hipblas-common-devel
  miopen-hip miopen-hip-devel
  hipfft hipfft-devel
  hipsparse hipsparse-devel
  rocprim-devel hipcub-devel
  rocthrust-devel
  hipsolver hipsolver-devel
  rocsolver  rocsolver-devel
  hipblaslt hipblaslt-devel
  hipsparselt hipsparselt-devel
  roctrace roctrac-devel
  rocblas rocblas-devel

@smuzaffar
Copy link
Contributor Author

enable gpu

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 2, 2026

A new Pull Request was created by @smuzaffar for branch IB/CMSSW_16_1_X/py312.

@akritkbehera, @cmsbuild, @iarspider, @raoatifshad, @smuzaffar can you please review it and eventually sign? Thanks.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 2, 2026

cms-bot internal usage

@smuzaffar
Copy link
Contributor Author

please test with cms-sw/cmssw-config#117 using full cmssw

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 2, 2026

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b5dec0/51700/summary.html
COMMIT: af9e68a
CMSSW: CMSSW_16_1_PY312_X_2026-03-01-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/10390/51700/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed External Build

I found compilation error when building:

set_property could not find TARGET torch_cuda.  Perhaps it has not yet been
created.


-- Configuring incomplete, errors occurred!
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.GwcHyT (%build)

RPM build errors:
Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.GwcHyT (%build)

* The action "build-install-external+pytorch-cluster+1.6.3-24509dcecfe3ecca5dcdbeba91f58f58" was not completed successfully because The following dependencies could not complete:


@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 4, 2026

Pull request #10390 was updated.

@smuzaffar
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 4, 2026

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b5dec0/51750/summary.html
COMMIT: 0739911
CMSSW: CMSSW_16_1_PY312_X_2026-03-03-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/10390/51750/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed External Build

I found compilation error when building:

+ DUP_BIN=torchrun
+ set +x
torchrun:py3-torch
torchrun:py3-torch-cuda
ERROR: Duplicate python binaries found. Please cleanup and make sure only one binary is available.
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.MUT1X4 (%install)

RPM build errors:
Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.MUT1X4 (%install)

* The action "build-srpm-cms+cmssw-tools+6.0-87f515c1e140fba3b0dce87520911fe2" was not completed successfully because The following dependencies could not complete:


@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 4, 2026

Pull request #10390 was updated.

@smuzaffar
Copy link
Contributor Author

please test with cms-sw/cmssw-config#117 using full cmssw

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 9, 2026

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b5dec0/51850/summary.html
COMMIT: aef3014
CMSSW: CMSSW_16_1_PY312_X_2026-03-08-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10390/51850/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 42 lines to the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 2976 differences found in the comparisons
  • DQMHistoTests: Total files compared: 53
  • DQMHistoTests: Total histograms compared: 4181048
  • DQMHistoTests: Total failures: 6556
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4174472
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
  • Checked 227 log files, 198 edm output root files, 53 DQM output files
  • TriggerResults: no differences found

@cmsbuild
Copy link
Contributor

Pull request #10390 was updated.

@cmsbuild
Copy link
Contributor

Pull request #10390 was updated.

@smuzaffar
Copy link
Contributor Author

please test with cms-sw/cms-bot#2702 using full cmssw

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b5dec0/51878/summary.html
COMMIT: 72b83fd
CMSSW: CMSSW_16_1_PY312_X_2026-03-09-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/10390/51878/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b5dec0/51878/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b5dec0/51878/git-merge-result

Comparison Summary

Summary:

@smuzaffar
Copy link
Contributor Author

+externals

good to go in Python 3.12 based IBs ( i.e. PY312 and GCC15 IBs)

@smuzaffar smuzaffar merged commit c2dffee into IB/CMSSW_16_1_X/py312 Mar 10, 2026
12 checks passed
@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_16_1_X/py312 IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @sextonkennedy, @ftenchini (and backports should be raised in the release meeting by the corresponding L2)
Notice This PR was tested with additional Pull Request(s), please also merge them if necessary: cms-sw/cms-bot#2702

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants