Skip to content

Releases: ROCm/mori

v1.1.1

24 Apr 08:12
b8af93f

Choose a tag to compare

Highlights

  • Profiler Support — New profiler integration for EP intranode dispatch/combine with JIT kernel support and proper packaging
  • EP Auto-Tuning — New tuning config system that stores benchmark-derived optimal launch parameters (block_num, warp_per_block, etc.) in JSON files, auto-loaded at runtime based on GPU arch, EP size, token count, and hidden dim. Enable with MORI_EP_LAUNCH_CONFIG_MODE=AUTO, with fallback to hardcoded defaults
  • AINIC Environment Auto-Detection Tool — Two new scripts for AINIC (ionic) NIC setup and validation:
    • env_check.sh — 6-step validation (firmware, QoS/DSCP/PFC, DCQCN, intra/inter-node BW & latency). Usage: ./tools/env_check.sh [peer_ip] (omit peer_ip for intra-node checks only)
    • env_setup.sh — configures DCQCN and exports MORI_RDMA_SL/MORI_RDMA_TC. Usage: source tools/env_setup.sh [mori|dcqcn|all]
  • SGLang EPLB Support — New call_local_expert_count API for SGLang Expert-Parallel Load Balancing, exposing per-rank local expert counts via op.local_expert_count

What's Changed

EP (Expert Parallelism)

  • Add profiler support to intranode dispatch/combine (#285)
  • Add tuning config system for dispatch/combine (#242)
  • Add sgl EPLB support (#254)
  • Fix EP index overflow (#260)

IO / RDMA

  • Improve RDMA flush diagnostics and log hygiene (#277)
  • Disable auto XGMI backend for cross-process IPC transfers (#267)
  • Avoid QPN collisions in multi-NIC setups (#261)

Shmem

  • Add FlyDSL integration for mori shmem device API (#280)

Profiler

  • Fix profile mode detection and doc corrections (#281)
  • Fix profile mode for JIT kernel (#265)
  • Enable JAX gfx950 CI tests and wrap profiler macros (#282)
  • Include tools/profiler in JIT sources (#270)

UMBP

  • Release GIL for all UMBPClient methods to fix distributed-mode scheduler starvation (#262)

Packaging & Build

  • Only pull SPDK submodule when BUILD_UMBP=ON (#283)

Tools

  • Auto detect ionic env problem for AINIC platforms (#278)

CI & Benchmark

  • Add MI355X-AINIC platform with podman, fix build for non-mlx5 NICs (#263)
  • Add MI300X_BNXT platform with Thor2 NIC support (#291)
  • Switch intranode tests to MI355X-AINIC-TW (#288)
  • Add shmem benchmark tests (#272, #290)

Doc

  • Update arch image & add UMBP description (#287)

v1.1.0

13 Apr 07:08
c4d5877

Choose a tag to compare

Highlights

  • Universal Wheel Packaging — Torch-free pybind with framework-agnostic GPU tensor interop, host/device separation (zero hipcc at install), and JIT hardware auto-detection for GPU arch and NIC type
  • EP Enhancements — C++ launch API, async kernel optimization, and memory footprint reduction
  • JAX Support — JAX integration via XLA FFI custom calls

EP (Expert Parallelism)

  • C++ launch API with AOT + JIT cache support (#195)
  • Async kernel optimization (#185) and memory footprint reduction (#245)
  • Runtime hidden_dim (#189), runtime dtype (#221), independent combine dtype (#239)
  • float8_e8m0fnu support (#215)
  • Fix intranode overflow when token >= 65536 (#252), async kernel recv launch bug (#256)

IO / RDMA / XGMI

  • SQ-depth admission control with robust batch posting/rollback (#188)
  • Optimize XGMI batch_write/batch_read for non-contiguous buffer transfers (#205)
  • JIT-compile XGMI scatter/gather kernel (#232)
  • Harden RDMA CQE handling (#241)
  • Fix XGMI cross-process GPU routing and polling hangs (#247)
  • Hidden-device XGMI IPC support (#258)
  • Runtime dlopen for vendor dv libraries, removing compile-time linking (#237)

Shmem

  • GET (remote read) device API with blocking and non-blocking modes (#193)
  • Increase default static heap to 4GB and VMM heap to 16GB (#204)
  • Expose mori_shmem_free_tensor for triton-dist low-latency kernel (#224)

Packaging & Build

  • Universal wheel with host/device separation and JIT hardware auto-detection (#182)
  • CMake C++ integration, lazy imports, optional MPI dependency (#203)
  • Build mori as a pip package (#236)
  • Unified NIC detection logic (#186, #212)

JAX

  • XLA FFI custom calls integration (#226)
  • JAX intranode test job (#243)

UMBP (WIP)

  • Local client with DRAM/SSD tiered storage (#191)
  • Segmented SSD log with pluggable IO backend (#206)
  • Control plane implementation (#180) and distributed integration (#209)
  • SPDK storage backend (#213) and multitenant SPDK proxy (#223)

CI & Tooling

  • Overhaul CI workflow with hybrid container and multi-runner support (#217)
  • Pre-commit CI with auto-fix bot (#219, #230)
  • Sphinx documentation website (#181)

release v0.1.0

30 Mar 06:26

Choose a tag to compare

release v0.1.0