-
Notifications
You must be signed in to change notification settings - Fork 4
Update to torch=2.10 and rocm=7.1 and Pin Versions #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
4ee7724
Update versions and pin distconv & ccl. Add separate install for pypi
michaelmckinsey1 5f3833e
Ensure libfabric
michaelmckinsey1 b48c7bd
Don't need spindle off anymore
michaelmckinsey1 40247e3
Enforce cray-mpich 9.1.0
michaelmckinsey1 ba671a5
patch all so files
michaelmckinsey1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,4 @@ | ||
| ml load python/3.11.5 && python3 -m venv .venvs/scaffoldvenv-matrix && source .venvs/scaffoldvenv-matrix/bin/activate && pip install --upgrade pip | ||
| ml cuda/12.6.0 gcc/12.1.1 mvapich2/2.3.7 | ||
| ml cuda/12.9.1 gcc/13.3.1 mvapich2/2.3.7 | ||
| export LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH | ||
| pip install --no-binary=mpi4py -e .[cuda] --prefix=.venvs/scaffoldvenv-matrix --extra-index-url https://download.pytorch.org/whl/cu126 2>&1 | tee install.log | ||
| pip install --no-binary=mpi4py -e .[cuda] --prefix=.venvs/scaffoldvenv-matrix --extra-index-url https://download.pytorch.org/whl/cu129 2>&1 | tee install.log |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| #!/bin/bash | ||
|
|
||
| # Exit if target directory already exists | ||
| if [ -d "aws-ofi-nccl.git" ]; then | ||
| echo "Directory 'aws-ofi-nccl.git' already exists. Exiting to avoid overwrite." | ||
| return 1 2>/dev/null || exit 1 | ||
| fi | ||
|
|
||
| rocm_version=7.1.0 | ||
|
|
||
| module swap PrgEnv-cray PrgEnv-gnu | ||
| module load rocm/$rocm_version | ||
|
|
||
| git clone --recursive --branch v1.18.0 https://github.com/aws/aws-ofi-nccl.git aws-ofi-nccl.git | ||
|
|
||
| cd aws-ofi-nccl.git | ||
|
|
||
| installdir=$(pwd)/install | ||
|
|
||
| ./autogen.sh | ||
|
|
||
| export LD_LIBRARY_PATH=$PWD/../rccl/install/lib:/opt/rocm-$rocm_version/lib:$LD_LIBRARY_PATH | ||
|
|
||
| #CC=hipcc CXX=hipcc CFLAGS=-I$PWD/../rccl/install/include/rccl ./configure \ | ||
| ./configure \ | ||
| --with-libfabric=/opt/cray/libfabric/2.1 \ | ||
| --with-rocm=$ROCM_PATH \ | ||
| --prefix=$installdir | ||
|
|
||
| make | ||
| make install |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| . install-rccl.sh | ||
| ml load python/3.11.5 && python3 -m venv .venvs/scaffoldvenv-tuo-pypi && source .venvs/scaffoldvenv-tuo-pypi/bin/activate && pip install --upgrade pip | ||
| ml cce/21.0.0 cray-mpich/9.1.0 rocm/7.1.0 rccl/fast-env-slows-mpi | ||
| pip install -e .[rocm] --prefix=.venvs/scaffoldvenv-tuo-pypi --extra-index-url https://download.pytorch.org/whl/rocm7.1 2>&1 | tee install.log |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,25 @@ | ||
| ml load python/3.11.5 && python3 -m venv .venvs/scaffoldvenv-tuo && source .venvs/scaffoldvenv-tuo/bin/activate && pip install --upgrade pip | ||
| ml load rocm/6.4.2 rccl/fast-env-slows-mpi libfabric | ||
| ml cce/21.0.0 cray-mpich/9.1.0 rocm/7.1.0 rccl/fast-env-slows-mpi | ||
| pip install -e .[rocmwci] --prefix=.venvs/scaffoldvenv-tuo 2>&1 | tee install.log | ||
| # Needed until new wheel exists for torch using mpich 9.1.0 | ||
| TORCH_LIB_DIR=".venvs/scaffoldvenv-tuo/lib/python3.11/site-packages/torch/lib" | ||
| OLD="libmpi_gnu_112.so.12" | ||
| NEW="libmpi_gnu.so.12" | ||
| cd "$TORCH_LIB_DIR" || exit 1 | ||
| # Patch every file that has OLD in its DT_NEEDED | ||
| for f in *.so*; do | ||
| [ -f "$f" ] || continue | ||
|
|
||
| if patchelf --print-needed "$f" 2>/dev/null | grep -Fxq "$OLD"; then | ||
| echo "Patching $f" | ||
| patchelf --replace-needed "$OLD" "$NEW" "$f" | ||
| fi | ||
| done | ||
| echo | ||
| echo "Verification (should show no $OLD):" | ||
| for f in *.so*; do | ||
| [ -f "$f" ] || continue | ||
| if patchelf --print-needed "$f" 2>/dev/null | grep -Fxq "$OLD"; then | ||
| echo "STILL NEEDS $OLD -> $f" | ||
| fi | ||
| done | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| #!/bin/bash | ||
|
|
||
| # flux: --exclusive | ||
| # flux: -N 1 | ||
| # flux: -g=1 | ||
| # flux: -t 60m | ||
| # flux: -qpdebug | ||
| # flux: -B fractale | ||
|
|
||
| ml cce/21.0.0 cray-mpich/9.1.0 rocm/7.1.0 rccl/fast-env-slows-mpi | ||
|
|
||
| . .venvs/scaffoldvenv-tuo-pypi/bin/activate | ||
|
|
||
| # Use ccl plugin that we manually built with install-rccl.sh | ||
| export NCCL_NET_PLUGIN=../aws-ofi-nccl.git/install/lib/librccl-net.so | ||
| export NCCL_NET="AWS Libfabric" | ||
|
|
||
| torchrun-hpc -N 1 -n 1 $(which scaffold) generate_fractals -c $(pwd)/ScaFFold/configs/benchmark_default.yml | ||
|
|
||
| # Uncomment if you want torch profiling | ||
| #export PROFILE_TORCH=ON | ||
|
|
||
| torchrun-hpc -N 1 -n 4 --gpus-per-proc 1 $(which scaffold) benchmark -c $(pwd)/ScaFFold/configs/benchmark_default.yml | ||
| #torchrun-hpc -N 2 -n 4 --gpus-per-proc 1 $(which scaffold) benchmark -c $(pwd)/ScaFFold/configs/benchmark_default.yml |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7,15 +7,16 @@ | |
| # flux: -qpdebug | ||
| # flux: -B fractale | ||
|
|
||
| ml rocm/6.4.2 rccl/fast-env-slows-mpi | ||
| ml cce/21.0.0 cray-mpich/9.1.0 rocm/7.1.0 rccl/fast-env-slows-mpi | ||
|
|
||
| . .venvs/scaffoldvenv-tuo/bin/activate | ||
|
|
||
| # Avoid spindle error | ||
| export SPINDLE_FLUXOPT=off | ||
| # (1) Avoid libmagma error | ||
| # (2) Removing libmpi may cause segfault on mpi4py import | ||
| export LD_PRELOAD="/opt/rocm-7.1.0/llvm/lib/libomp.so /opt/cray/pe/mpich/9.1.0/ofi/gnu/11.2/lib/libmpi_gnu.so.12" | ||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The second path will go away once an updated wheel is provided hopefully soon |
||
|
|
||
| # Avoid libmagma error | ||
| export LD_PRELOAD=/opt/rocm-6.4.2/llvm/lib/libomp.so | ||
| # Ensure using libfabric. NCCL_NET_PLUGIN should be unecessary to set for WCI wheel. | ||
| export NCCL_NET="AWS Libfabric" | ||
|
|
||
| torchrun-hpc -N 1 -n 1 $(which scaffold) generate_fractals -c $(pwd)/ScaFFold/configs/benchmark_default.yml | ||
|
|
||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will go away once an updated wheel is provided hopefully soon