Skip to content

[DistInf] Enable RDMA over Ionic AINICs for MoRI EP disaggregated inference#147

Draft
raviguptaamd wants to merge 2 commits intoROCm:developfrom
raviguptaamd:feat/ionic-rdma-mori-ep
Draft

[DistInf] Enable RDMA over Ionic AINICs for MoRI EP disaggregated inference#147
raviguptaamd wants to merge 2 commits intoROCm:developfrom
raviguptaamd:feat/ionic-rdma-mori-ep

Conversation

@raviguptaamd
Copy link
Copy Markdown
Contributor

Summary

  • Enable MoRI IO KV cache transfer over Ionic RDMA NICs on clusters where public IPs are not routable between compute nodes
  • Mount host RDMA libraries (libionic.so, libibverbs, librdmacm and provider directories) into the container so MORI IO can discover Ionic NICs
  • Set VLLM_HOST_IP to each node's overlay IP so MoRIIO control plane (ZMQ handshake, block allocation notifications, proxy registration) routes through the routable overlay network instead of unreachable public IPs
  • Pass through MORI RDMA env vars (MORI_IB_GID_INDEX, MORI_RDMA_DEVICES, MORI_IO_LOG_LEVEL) from the launcher into the container
  • Switch from Docker to Podman for rootless container execution
  • Use --overlap on srun commands to avoid blocking the SLURM job step
  • Prefer 10.x.x.x overlay IPs for MASTER_ADDR and inter-node communication
  • Prefer MODEL_DIR for model path resolution before standard paths
  • Add PYTHONUNBUFFERED=1 for real-time Python log output
  • Add launch_mori_1p1d.sh convenience launcher for 1P/1D benchmarks
  • Update Dockerfile to install MORI from pinned commit on main

Problem

On clusters with Ionic AINICs (back-end RDMA) and Broadcom NICs (front-end overlay network), the MoRIIO connector's get_ip() returns the public IP which is not routable between compute nodes. This causes the decode node to be unable to send block allocation notifications back to the prefill node, creating a circular deadlock where both sides hang indefinitely waiting for KV transfer.

Solution

  1. Set VLLM_HOST_IP per node to the overlay IP (10.x.x.x) — get_ip() checks this env var first
  2. Mount host Ionic RDMA libraries into the container so mori::io::RdmaManager can discover Ionic NICs
  3. Pass MORI_IB_GID_INDEX=1 to select the correct RoCE v2 GID for Ionic

Test Plan

  • DeepSeek-V3 1P/1D on 2x MI355X (8 GPUs each) with Ionic AINICs
  • Full benchmark suite: ISL/OSL 1024/1024, 8192/1024, 1024/8192
  • Concurrency sweep: 8, 16, 32, 64, 128, 256, 512
  • All requests successful, 0 failures across all configurations
  • RDMA over Ionic confirmed via RdmaBackend logs (nic=ionic)
  • MoRIIO handshake, KV transfer, and write worker all functional

Made with Cursor

Ravi Gupta added 2 commits April 16, 2026 04:11
…erence

Enable MoRI IO KV cache transfer over Ionic RDMA NICs on clusters where
public IPs are not routable between compute nodes. Key changes:

- Mount host RDMA libraries (libionic, libibverbs, librdmacm) and provider
  directory into the container so MORI IO can discover Ionic NICs
- Set VLLM_HOST_IP to each node's overlay IP so MoRIIO control plane
  (ZMQ handshake, block allocation notifications, proxy registration)
  routes through the overlay network instead of unreachable public IPs
- Pass through MORI RDMA env vars (MORI_IB_GID_INDEX, MORI_RDMA_DEVICES,
  MORI_IO_LOG_LEVEL) from the launcher into the container
- Switch from docker to podman for rootless container execution
- Use --overlap on srun commands to avoid blocking the SLURM job step
- Prefer 10.x.x.x overlay IPs for MASTER_ADDR and inter-node comms
- Prefer MODEL_DIR for model path resolution before standard paths
- Add PYTHONUNBUFFERED=1 for real-time Python log output
- Add launch_mori_1p1d.sh convenience launcher for 1P/1D benchmarks
- Update Dockerfile to install MORI from pinned commit on main

Tested: DeepSeek-V3 1P/1D on 2x MI355X nodes with Ionic AINICs,
full benchmark suite (ISL/OSL: 1024/1024, 8192/1024, 1024/8192,
concurrency: 8-512), all requests successful with 0 failures.

Made-with: Cursor
…Ionic AINIC

Extends the Ionic AINIC RDMA support to multi-node disaggregated
inference with 2 Prefill + 2 Decode nodes (DP=16).

Key changes:
- Remove 1P/1D restriction from run_xPyD_models.slurm and
  vllm_disagg_mori_ep.sh to allow xP>1 / yD>1 topologies
- Add --ulimit memlock=-1:-1 to podman for large RDMA memory
  registrations (>32GB) required by MoRI IO
- Pass NCCL_IB_HCA, NCCL_IB_GID_INDEX, NCCL_NET_GDR_LEVEL,
  NCCL_CROSS_NIC, and MORI_SOCKET_IFNAME into containers for
  proper multi-node RCCL and MoRI bootstrap over Ionic AINICs
- Add apply_moriio_2pd_patches.sh for runtime vLLM patches
  (PR vllm-project/vllm#39276) fixing engine_id collisions and
  MoRIIO robustness in multi-node DP configurations
- Restrict --kv-transfer-config to master nodes only (child
  nodes join via --headless and participate in EP all-to-all)
- Add launch_mori_2p2d.sh example launcher for 2P/2D benchmarks

Tested on AAC MI355X cluster with Ionic RDMA NICs achieving
balanced RDMA traffic across all 4 nodes and 1,344 tok/s total
throughput on DeepSeek-V3-5layer.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant