Skip to content

Follow Up on new wheel torch==2.10.0+rocm710 #20

@michaelmckinsey1

Description

@michaelmckinsey1

There are some hacks that need to be cleaned up, because the current wheel torch==2.10.0+rocm710 comes with libraries linked to cray-mpich/9.0.1 which causes segfaults. We replace the paths with paths to cray-mpich/9.1.0.

cleanup1 #17 (comment)
cleanup2 #17 (comment)

The hacks replace shared library links for libmpi_gnu_112.so.12 in .venvs/scaffoldvenv-tuo/lib/python3.11/site-packages/torch/lib/ to libmpi_gnu.so.12, which is the name for 9.1. Then in the jobscript we can LD_PRELOAD /opt/cray/pe/mpich/9.1.0/ofi/gnu/11.2/lib/libmpi_gnu.so.12 to use the correct libmpi.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions