Skip to content

super repo for rocm systems projects

Notifications You must be signed in to change notification settings

modularml/rocm-systems

Β 
Β 

Repository files navigation

Modular rocSHMEM Fork

This fork enables the GDA backend (device-initiated RDMA over NIC) to work with Mojo in thread-per-gpu mode, without openmpi or ucx dependencies.

Quickstart

To build for Mojo and MAX, run the build_rocshmem.sh script, once it completes it will ask if you want to upload to S3, make sure you're logged in with aws sso login. Then you can update common.MODULE.bazel in the modular monorepo with the hash and URL it provides. If you want to test locally, press n on the prompt to upload to S3, and it'll provide a hash and local URL for common.MODULE.bazel.

Features

  • Support one GPU per thread instead of one GPU per process, enabling integration with MAX.
  • Each thread can now manage its own GPU with thread-local state for device IDs, initialization counters, and device state registration.
  • Separate host shared library librocshmem_host.so and device bitcode librocshmem_device.bc, and function to initialize device state into constant memory. Required for integration with Mojo.
  • Multi-node TCP bootstrap: Create unique IDs from IP address and port for RDMA key exchange across nodes without MPI or process launcher dependency.
  • Support for GDA backend with all RDMA drivers (ionic, mlx5, bnxt) but prioritizes ionic.
  • Fixed symmetric heap allocating to pinned host memory, and then copying to device memory, instead allocates directly to device.
  • Set sensible default environment variables to reduce Mojo-side configuration.

Testing changes

There is a convenience script to build with the GDA backend and ionic driver, then build the test, then pass arguments to the test binary at projects/rocshmem/tests/build_and_run_multi_node_test.sh.

On Node 0 (server)

First find node 0 IP address for the eno0 network management interface that enables TCP bootstraping e.g.:

ip addr show eno0

9: eno0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    inet 10.24.8.110/21 brd 10.24.15.255 scope global eno0

We'd use 10.24.8.110.

Replace <NODE0_IP> with the eno0 IP below:

./build_and_run_multi_node_test.sh --run --node 0 --total-nodes 2 --gpus-per-node 8 --server <NODE0_IP> --port 12345

On Node 1 (client)

./build_and_run_multi_node_test.sh --run --node 1 --total-nodes 2 --gpus-per-node 8 --server <NODE0_IP> --port 12345

What Happens:

  1. Node 0 creates a unique ID with server ip/port and validation data
  2. Node 1 creates an identical unique ID
  3. Both nodes then launch 8 threads (one per GPU)
  4. All 16 PEs (8 per node) initialize rocSHMEM with an identical unique ID
  5. The TCP bootstrap process shares RDMA keys etc. between the nodes and threads
  6. Each PE performs a put operation to the next PE in a ring pattern
  7. Data flows across all GPUs on both nodes via RDMA

To rebuild and run the multi-node test with debug information, add the debug flag on both the server and client e.g.:

./build_and_run_multi_node_test.sh --debug --run --node 0 --total-nodes 2 --gpus-per-node 8 --server <NODE0_IP> --port 12345

Original ROCm Systems README.md

Welcome to the ROCm Systems super-repo. This repository consolidates multiple ROCm systems projects into a single repository to streamline development, CI, and integration. The first set of projects focuses on requirements for building PyTorch.

Super-repo Status and CI Health

This table provides the current status of the migration of specific ROCm systems projects as well as a pointer to their current CI health.

Key:

  • Completed: Fully migrated and integrated. This super-repo should be considered the source of truth for this project. The old repo may still be used for release activities.
  • In Progress: Ongoing migration, tests, or integration. Please refrain from submitting new pull requests on the individual repo of the project, and develop on the super-repo.
  • Pending: Not yet started or in the early planning stages. The individual repo should be considered the source of truth for this project.
Component Source of Truth Migration Status Azure CI Status Component CI Status
amdsmi EMU Pending
aqlprofile Public Completed Azure Pipelines CodeQL
Continuous Integration
clr Public Completed Azure Pipelines
hip Public Completed Azure Pipelines
hipother Public Completed Azure Pipelines
hip-tests Public Completed Azure Pipelines
rdc Public Completed Azure Pipelines
rocm-core Public Completed Azure Pipelines
rocminfo Public Completed Azure Pipelines
rocm-smi-lib Public Completed Azure Pipelines
rocprofiler Public Completed Azure Pipelines
rocprofiler-compute Public Completed Azure Pipelines Formatting
 rhel-8
tarball
ubuntu jammy
rocprofiler-register Public Completed Azure Pipelines Continuous Integration
rocprofiler-sdk Public Completed Azure Pipelines Code Coverage Integration
CodeQL
Continuous Integration
Documentation
Formatting
Python Linting
Restrictions
Release Compatibility
rocprofiler-systems Public Completed Azure Pipelines Containers
rocprofiler-systems GHCR Packages for CI Images
CPack
Formatting
OpenSUSE
Python Linting
RedHat Linux
Ubuntu Jammy
Ubuntu Noble
rocr-runtime Public Completed Azure Pipelines
roctracer Public Completed Azure Pipelines

Tentative migration schedule

Component Tentative Date

*Remaining schedule to be determined.

TheRock CI Status

Note TheRock CI performs multi-component testing on top of builds leveraging TheRock build system.

The Rock CI


Nomenclature

Project names have been standardized to match the casing and punctuation of released packages. This removes inconsistent camel-casing and underscores used in legacy repositories.

Structure

The repository is organized as follows:

projects/
  amdsmi/
  aqlprofile/
  clr/
  hip/
  hipother/
  hip-tests/
  rccl/
  rdc/
  rocm-core
  rocminfo/
  rocmsmilib/
  rocprofiler/
  rocprofiler-compute/
  rocprofiler-register/
  rocprofiler-sdk/
  rocprofiler-systems/
  rocrruntime/
  rocshmem/
  roctracer/
  • Each folder under projects/ corresponds to a ROCm systems project that was previously maintained in a standalone GitHub repository and released as distinct packages.
  • Each folder under shared/ contains code that existed in its own repository and is used as a dependency by multiple projects, but does not produce its own distinct packages in previous ROCm releases.

Goals

  • Enable unified build and test workflows across ROCm libraries.
  • Facilitate shared tooling, CI, and contributor experience.
  • Improve integration, visibility, and collaboration across ROCm library teams.

Getting Started

To begin contributing or building, see the CONTRIBUTING.md guide. It includes setup instructions, sparse-checkout configuration, development workflow, and pull request guidelines.

License

This super-repo contains multiple subprojects, each of which retains the license under which it was originally published.

πŸ“ Refer to the LICENSE, LICENSE.md, or LICENSE.txt file within each projects/ or shared/ directory for specific license terms. πŸ“„ Refer to the header notice in individual files outside projects/ or shared/ folders for their specific license terms.

Note: The root of this repository does not define a unified license across all components.

Questions or Feedback?

We're happy to help!

About

super repo for rocm systems projects

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 52.5%
  • C 38.0%
  • Python 5.1%
  • CMake 2.7%
  • Shell 0.5%
  • Rust 0.4%
  • Other 0.8%