This repository contains the implementation and experimental code for the paper:
UniOrch: A Unified Mixed Framework for High-Efficiency LLM Training on Heterogeneous AI Chips
Submitted to IEEE Transactions on Parallel and Distributed Systems (TPDS)
UniOrch is a holistic coordination framework for achieving unified control over heterogeneous AI chips (GPUs, NPUs, DCUs) in data center environments. It addresses the challenges of virtualization overhead, protocol fragmentation, and network partitioning that hinder efficient Large Language Model (LLM) training.
-
Hardware Unification Layer: Consolidates GPUs, NPUs, and DCUs into a bare-metal cloud through programmable gateways and BGP EVPN overlay networks.
-
Software Standardization Engine: PyTorch-based Hardware Abstraction Layer (HAL) that masks hardware differences through standardized CUDA/CANN/RoCm mapping.
-
TCCL (Transformer Collective Communication Library): Unifies NCCL, HCCL, and OpenMPI protocols for efficient parameter synchronization across heterogeneous clusters.
-
HHEM (Heterogeneous Hybrid Estimation Model): Core scheduling engine with a two-stage cost model combining static analysis with dynamic runtime feedback.
UniOrch/
├── src/ # Source code
│ ├── hhem/ # HHEM scheduler implementation
│ │ ├── __init__.py # Module initialization
│ │ ├── cost_model.py # Two-stage cost model
│ │ ├── scheduler.py # Algorithm 1: HHEM_Scheduler
│ │ └── hybrid_placement.py # Algorithm 2: Transformer-Graph Hybrid Placement
│ ├── tccl/ # TCCL communication library
│ │ ├── __init__.py # Module initialization
│ │ └── collective_comm.py # Collective communication primitives & fault tolerance
│ └── training/ # Training modules
│ ├── __init__.py # Module initialization
│ └── mixed_training.py # Mixed Parallel Training (DP × PP × TP)
├── experiments/ # Experimental scripts
│ ├── cost_model/
│ │ └── evaluate_two_stage_model.py # Two-stage cost model evaluation
│ ├── collective_comm/
│ │ └── benchmark_collective_ops.py # Communication performance benchmark
│ ├── verify_all_algorithms.py # Comprehensive algorithm verification
│ └── generate_paper_figures.py # Paper figure generation
├── data/ # Experimental data (JSON format)
│ ├── two_stage_model_evaluation.json # Cost model evaluation results
│ ├── collective_comm_benchmark_results.json # Communication benchmark results
│ └── verification_results.json # Algorithm verification results
├── figures/ # Generated figures (11 PNG files)
│ ├── network_topology.png # Network topology diagram
│ ├── two_stage_cost_model_evaluation.png # Cost model evaluation
│ ├── two_stage_model_detailed.png # Detailed two-stage model
│ ├── collective_comm_comparison.png # Communication comparison
│ ├── collective_comm_detailed.png # Detailed communication analysis
│ ├── tccl_speedup.png # TCCL speedup results
│ ├── fault_tolerance.png # Fault tolerance demonstration
│ ├── mixed_parallel_strategy.png # Mixed parallelism strategy
│ ├── scalability_analysis.png # Scalability analysis
│ ├── hardware_comparison_table.png # Hardware comparison
│ └── training_performance_stats.png # Training performance statistics
├── logs/ # Execution logs
│ ├── hhem_cost_model.log # Cost model experiment log
│ ├── collective_comm.log # Communication experiment log
│ └── paper_figures.log # Figure generation log
├── requirements.txt # Python dependencies
├── run_all_experiments.sh # One-click verification script
└── README.md # This file
- Python 3.8+
- NumPy >= 1.20.0
- Matplotlib >= 3.4.0
# Clone the repository
git clone https://github.com/wangjingyi34/UniOrch.git
cd UniOrch
# Install dependencies
pip install -r requirements.txtRun all experiments and verify all algorithms with a single command:
chmod +x run_all_experiments.sh
./run_all_experiments.shThis will:
- Evaluate the HHEM two-stage cost model
- Benchmark collective communication performance
- Verify all 5 core algorithms
# Evaluate the two-stage cost model
python experiments/cost_model/evaluate_two_stage_model.py
# Benchmark collective communication
python experiments/collective_comm/benchmark_collective_ops.py
# Generate paper figures
python experiments/generate_paper_figures.py
# Verify all algorithms
python experiments/verify_all_algorithms.pyTransformer-aware heterogeneous scheduling algorithm with:
- Node priority scoring:
S_i = α*(1-Util_i) + β*W_i - Atomic allocation and rollback mechanism
- Time budget constraint checking
from src.hhem.scheduler import HHEMScheduler, SchedulingTask, HeterogeneousNode
# Create scheduler
scheduler = HHEMScheduler()
# Add heterogeneous nodes
scheduler.add_node(HeterogeneousNode(
node_id="NPU_910ProB_Node_0",
chip_type="NPU_910ProB",
tflops=320.0,
memory_gb=64.0,
num_chips=8
))
# Schedule task
result = scheduler.schedule(task)Hybrid placement algorithm for Transformer models:
- Layer profiling and affinity scoring
- Cluster assignment optimization
- Pipeline construction and RDMA scheduling
from src.hhem.hybrid_placement import HybridPlacementEngine, TransformerModel
# Create placement engine
engine = HybridPlacementEngine()
# Create model and hardware profiles
model = TransformerModel(name="GLM-130B", num_layers=70, hidden_size=12288)
# Execute placement
result = engine.place(model, hardware_profiles)Sophisticated cost estimation combining:
- Stage 1 - Static Analysis: Offline analysis using operator FLOPs and hardware specs
- Stage 2 - Dynamic Feedback: Runtime correction with learned factors
Cost_effective = Cost_static × α_runtime
Combines three parallelism strategies:
- Data Parallelism (DP): Distributes data across devices
- Pipeline Parallelism (PP): Splits model layers across stages
- Tensor Parallelism (TP): Shards individual layers
from src.training.mixed_training import MixedParallelTrainer, TrainingConfig
config = TrainingConfig(
model_name="GLM-130B",
dp_degree=4,
pp_degree=4,
tp_degree=2,
global_batch_size=256
)
trainer = MixedParallelTrainer(config)
results = trainer.train(num_batches=10)Unified communication library with:
- Protocol conversion (NCCL, HCCL, OpenMPI)
- Topology-aware ring algorithm
- RDMA acceleration
- Fault tolerance (heartbeat monitoring, CRC32 checksums)
All 5 core algorithms have been verified:
| Algorithm | File | Status |
|---|---|---|
| HHEM_Scheduler | src/hhem/scheduler.py |
✅ PASSED |
| Hybrid Placement | src/hhem/hybrid_placement.py |
✅ PASSED |
| Two-Stage Cost Model | src/hhem/cost_model.py |
✅ PASSED |
| Mixed Parallel Training | src/training/mixed_training.py |
✅ PASSED |
| TCCL Collective Comm | src/tccl/collective_comm.py |
✅ PASSED |
- Resource Utilization: 92% (35% improvement over baseline)
- Cross-chip Latency: 15ms (42% reduction)
- Allreduce Efficiency: 2.1 GB/s (75% improvement)
- Workload Imbalance: Reduced by 45%
- GLM-130B
- BLOOM-176B
- LayoutLMv2
- YOLOv5
| File | Description |
|---|---|
data/two_stage_model_evaluation.json |
Cost model evaluation metrics |
data/collective_comm_benchmark_results.json |
Communication performance data |
data/verification_results.json |
Algorithm verification results |
| Figure | Description |
|---|---|
network_topology.png |
Spine-Leaf network topology |
two_stage_cost_model_evaluation.png |
Two-stage model performance |
collective_comm_comparison.png |
TCCL vs baseline comparison |
fault_tolerance.png |
Fault tolerance mechanism |
mixed_parallel_strategy.png |
DP×PP×TP strategy layout |
scalability_analysis.png |
Scalability bottleneck analysis |
If you use UniOrch in your research, please cite:
@article{wang2025uniorch,
title={UniOrch: A Unified Mixed Framework for High-Efficiency LLM Training on Heterogeneous AI Chips},
author={Wang, Jingyi and Wang, Jia and Gao, Jun and others},
journal={IEEE Transactions on Parallel and Distributed Systems},
year={2025},
note={Under Review}
}This project is licensed under the MIT License - see the LICENSE file for details.
This work was supported by the deployment verification at China Construction Bank. We thank the anonymous reviewers for their valuable feedback.
For questions or issues, please open an issue on GitHub or contact the authors at: