Releases: ROCm/mori
Releases · ROCm/mori
v1.1.1
Highlights
- Profiler Support — New profiler integration for EP intranode dispatch/combine with JIT kernel support and proper packaging
- EP Auto-Tuning — New tuning config system that stores benchmark-derived optimal launch parameters (block_num, warp_per_block, etc.) in JSON files, auto-loaded at runtime based on GPU arch, EP size, token count, and hidden dim. Enable with
MORI_EP_LAUNCH_CONFIG_MODE=AUTO, with fallback to hardcoded defaults - AINIC Environment Auto-Detection Tool — Two new scripts for AINIC (ionic) NIC setup and validation:
env_check.sh— 6-step validation (firmware, QoS/DSCP/PFC, DCQCN, intra/inter-node BW & latency). Usage:./tools/env_check.sh [peer_ip](omitpeer_ipfor intra-node checks only)env_setup.sh— configures DCQCN and exportsMORI_RDMA_SL/MORI_RDMA_TC. Usage:source tools/env_setup.sh [mori|dcqcn|all]
- SGLang EPLB Support — New
call_local_expert_countAPI for SGLang Expert-Parallel Load Balancing, exposing per-rank local expert counts viaop.local_expert_count
What's Changed
EP (Expert Parallelism)
- Add profiler support to intranode dispatch/combine (#285)
- Add tuning config system for dispatch/combine (#242)
- Add sgl EPLB support (#254)
- Fix EP index overflow (#260)
IO / RDMA
- Improve RDMA flush diagnostics and log hygiene (#277)
- Disable auto XGMI backend for cross-process IPC transfers (#267)
- Avoid QPN collisions in multi-NIC setups (#261)
Shmem
- Add FlyDSL integration for mori shmem device API (#280)
Profiler
- Fix profile mode detection and doc corrections (#281)
- Fix profile mode for JIT kernel (#265)
- Enable JAX gfx950 CI tests and wrap profiler macros (#282)
- Include tools/profiler in JIT sources (#270)
UMBP
- Release GIL for all UMBPClient methods to fix distributed-mode scheduler starvation (#262)
Packaging & Build
- Only pull SPDK submodule when BUILD_UMBP=ON (#283)
Tools
- Auto detect ionic env problem for AINIC platforms (#278)
CI & Benchmark
- Add MI355X-AINIC platform with podman, fix build for non-mlx5 NICs (#263)
- Add MI300X_BNXT platform with Thor2 NIC support (#291)
- Switch intranode tests to MI355X-AINIC-TW (#288)
- Add shmem benchmark tests (#272, #290)
Doc
- Update arch image & add UMBP description (#287)
v1.1.0
Highlights
- Universal Wheel Packaging — Torch-free pybind with framework-agnostic GPU tensor interop, host/device separation (zero hipcc at install), and JIT hardware auto-detection for GPU arch and NIC type
- EP Enhancements — C++ launch API, async kernel optimization, and memory footprint reduction
- JAX Support — JAX integration via XLA FFI custom calls
EP (Expert Parallelism)
- C++ launch API with AOT + JIT cache support (#195)
- Async kernel optimization (#185) and memory footprint reduction (#245)
- Runtime hidden_dim (#189), runtime dtype (#221), independent combine dtype (#239)
- float8_e8m0fnu support (#215)
- Fix intranode overflow when token >= 65536 (#252), async kernel recv launch bug (#256)
IO / RDMA / XGMI
- SQ-depth admission control with robust batch posting/rollback (#188)
- Optimize XGMI batch_write/batch_read for non-contiguous buffer transfers (#205)
- JIT-compile XGMI scatter/gather kernel (#232)
- Harden RDMA CQE handling (#241)
- Fix XGMI cross-process GPU routing and polling hangs (#247)
- Hidden-device XGMI IPC support (#258)
- Runtime dlopen for vendor dv libraries, removing compile-time linking (#237)
Shmem
- GET (remote read) device API with blocking and non-blocking modes (#193)
- Increase default static heap to 4GB and VMM heap to 16GB (#204)
- Expose
mori_shmem_free_tensorfor triton-dist low-latency kernel (#224)
Packaging & Build
- Universal wheel with host/device separation and JIT hardware auto-detection (#182)
- CMake C++ integration, lazy imports, optional MPI dependency (#203)
- Build mori as a pip package (#236)
- Unified NIC detection logic (#186, #212)
JAX
UMBP (WIP)
- Local client with DRAM/SSD tiered storage (#191)
- Segmented SSD log with pluggable IO backend (#206)
- Control plane implementation (#180) and distributed integration (#209)
- SPDK storage backend (#213) and multitenant SPDK proxy (#223)
CI & Tooling
release v0.1.0
release v0.1.0