UniOrch: A Unified Mixed Framework for High-Efficiency LLM Training on Heterogeneous AI Chips

This repository contains the implementation and experimental code for the paper:

UniOrch: A Unified Mixed Framework for High-Efficiency LLM Training on Heterogeneous AI Chips

Submitted to IEEE Transactions on Parallel and Distributed Systems (TPDS)

Overview

UniOrch is a holistic coordination framework for achieving unified control over heterogeneous AI chips (GPUs, NPUs, DCUs) in data center environments. It addresses the challenges of virtualization overhead, protocol fragmentation, and network partitioning that hinder efficient Large Language Model (LLM) training.

Key Components

Hardware Unification Layer: Consolidates GPUs, NPUs, and DCUs into a bare-metal cloud through programmable gateways and BGP EVPN overlay networks.
Software Standardization Engine: PyTorch-based Hardware Abstraction Layer (HAL) that masks hardware differences through standardized CUDA/CANN/RoCm mapping.
TCCL (Transformer Collective Communication Library): Unifies NCCL, HCCL, and OpenMPI protocols for efficient parameter synchronization across heterogeneous clusters.
HHEM (Heterogeneous Hybrid Estimation Model): Core scheduling engine with a two-stage cost model combining static analysis with dynamic runtime feedback.

Repository Structure

UniOrch/
├── src/                                    # Source code
│   ├── hhem/                               # HHEM scheduler implementation
│   │   ├── __init__.py                     # Module initialization
│   │   ├── cost_model.py                   # Two-stage cost model
│   │   ├── scheduler.py                    # Algorithm 1: HHEM_Scheduler
│   │   └── hybrid_placement.py             # Algorithm 2: Transformer-Graph Hybrid Placement
│   ├── tccl/                               # TCCL communication library
│   │   ├── __init__.py                     # Module initialization
│   │   └── collective_comm.py              # Collective communication primitives & fault tolerance
│   └── training/                           # Training modules
│       ├── __init__.py                     # Module initialization
│       └── mixed_training.py               # Mixed Parallel Training (DP × PP × TP)
├── experiments/                            # Experimental scripts
│   ├── cost_model/
│   │   └── evaluate_two_stage_model.py     # Two-stage cost model evaluation
│   ├── collective_comm/
│   │   └── benchmark_collective_ops.py     # Communication performance benchmark
│   ├── verify_all_algorithms.py            # Comprehensive algorithm verification
│   └── generate_paper_figures.py           # Paper figure generation
├── data/                                   # Experimental data (JSON format)
│   ├── two_stage_model_evaluation.json     # Cost model evaluation results
│   ├── collective_comm_benchmark_results.json  # Communication benchmark results
│   └── verification_results.json           # Algorithm verification results
├── figures/                                # Generated figures (11 PNG files)
│   ├── network_topology.png                # Network topology diagram
│   ├── two_stage_cost_model_evaluation.png # Cost model evaluation
│   ├── two_stage_model_detailed.png        # Detailed two-stage model
│   ├── collective_comm_comparison.png      # Communication comparison
│   ├── collective_comm_detailed.png        # Detailed communication analysis
│   ├── tccl_speedup.png                    # TCCL speedup results
│   ├── fault_tolerance.png                 # Fault tolerance demonstration
│   ├── mixed_parallel_strategy.png         # Mixed parallelism strategy
│   ├── scalability_analysis.png            # Scalability analysis
│   ├── hardware_comparison_table.png       # Hardware comparison
│   └── training_performance_stats.png      # Training performance statistics
├── logs/                                   # Execution logs
│   ├── hhem_cost_model.log                 # Cost model experiment log
│   ├── collective_comm.log                 # Communication experiment log
│   └── paper_figures.log                   # Figure generation log
├── requirements.txt                        # Python dependencies
├── run_all_experiments.sh                  # One-click verification script
└── README.md                               # This file

Installation

Requirements

Python 3.8+
NumPy >= 1.20.0
Matplotlib >= 3.4.0

Setup

# Clone the repository
git clone https://github.com/wangjingyi34/UniOrch.git
cd UniOrch

# Install dependencies
pip install -r requirements.txt

Quick Start

One-Click Verification

Run all experiments and verify all algorithms with a single command:

chmod +x run_all_experiments.sh
./run_all_experiments.sh

This will:

Evaluate the HHEM two-stage cost model
Benchmark collective communication performance
Verify all 5 core algorithms

Running Individual Experiments

# Evaluate the two-stage cost model
python experiments/cost_model/evaluate_two_stage_model.py

# Benchmark collective communication
python experiments/collective_comm/benchmark_collective_ops.py

# Generate paper figures
python experiments/generate_paper_figures.py

# Verify all algorithms
python experiments/verify_all_algorithms.py

Core Algorithms

Algorithm 1: HHEM_Scheduler (`src/hhem/scheduler.py`)

Transformer-aware heterogeneous scheduling algorithm with:

Node priority scoring: S_i = α*(1-Util_i) + β*W_i
Atomic allocation and rollback mechanism
Time budget constraint checking

from src.hhem.scheduler import HHEMScheduler, SchedulingTask, HeterogeneousNode

# Create scheduler
scheduler = HHEMScheduler()

# Add heterogeneous nodes
scheduler.add_node(HeterogeneousNode(
    node_id="NPU_910ProB_Node_0",
    chip_type="NPU_910ProB",
    tflops=320.0,
    memory_gb=64.0,
    num_chips=8
))

# Schedule task
result = scheduler.schedule(task)

Algorithm 2: Transformer-Graph Hybrid Placement (`src/hhem/hybrid_placement.py`)

Hybrid placement algorithm for Transformer models:

Layer profiling and affinity scoring
Cluster assignment optimization
Pipeline construction and RDMA scheduling

from src.hhem.hybrid_placement import HybridPlacementEngine, TransformerModel

# Create placement engine
engine = HybridPlacementEngine()

# Create model and hardware profiles
model = TransformerModel(name="GLM-130B", num_layers=70, hidden_size=12288)

# Execute placement
result = engine.place(model, hardware_profiles)

Two-Stage Cost Model (`src/hhem/cost_model.py`)

Sophisticated cost estimation combining:

Stage 1 - Static Analysis: Offline analysis using operator FLOPs and hardware specs
Stage 2 - Dynamic Feedback: Runtime correction with learned factors

Cost_effective = Cost_static × α_runtime

Mixed Parallel Training (`src/training/mixed_training.py`)

Combines three parallelism strategies:

Data Parallelism (DP): Distributes data across devices
Pipeline Parallelism (PP): Splits model layers across stages
Tensor Parallelism (TP): Shards individual layers

from src.training.mixed_training import MixedParallelTrainer, TrainingConfig

config = TrainingConfig(
    model_name="GLM-130B",
    dp_degree=4,
    pp_degree=4,
    tp_degree=2,
    global_batch_size=256
)

trainer = MixedParallelTrainer(config)
results = trainer.train(num_batches=10)

TCCL Collective Communication (`src/tccl/collective_comm.py`)

Unified communication library with:

Protocol conversion (NCCL, HCCL, OpenMPI)
Topology-aware ring algorithm
RDMA acceleration
Fault tolerance (heartbeat monitoring, CRC32 checksums)

Algorithm Verification Results

All 5 core algorithms have been verified:

Algorithm	File	Status
HHEM_Scheduler	`src/hhem/scheduler.py`	✅ PASSED
Hybrid Placement	`src/hhem/hybrid_placement.py`	✅ PASSED
Two-Stage Cost Model	`src/hhem/cost_model.py`	✅ PASSED
Mixed Parallel Training	`src/training/mixed_training.py`	✅ PASSED
TCCL Collective Comm	`src/tccl/collective_comm.py`	✅ PASSED

Experimental Results

Performance Highlights

Resource Utilization: 92% (35% improvement over baseline)
Cross-chip Latency: 15ms (42% reduction)
Allreduce Efficiency: 2.1 GB/s (75% improvement)
Workload Imbalance: Reduced by 45%

Supported Models

GLM-130B
BLOOM-176B
LayoutLMv2
YOLOv5

Data Files

File	Description
`data/two_stage_model_evaluation.json`	Cost model evaluation metrics
`data/collective_comm_benchmark_results.json`	Communication performance data
`data/verification_results.json`	Algorithm verification results

Generated Figures

Figure	Description
`network_topology.png`	Spine-Leaf network topology
`two_stage_cost_model_evaluation.png`	Two-stage model performance
`collective_comm_comparison.png`	TCCL vs baseline comparison
`fault_tolerance.png`	Fault tolerance mechanism
`mixed_parallel_strategy.png`	DP×PP×TP strategy layout
`scalability_analysis.png`	Scalability bottleneck analysis

Citation

If you use UniOrch in your research, please cite:

@article{wang2025uniorch,
  title={UniOrch: A Unified Mixed Framework for High-Efficiency LLM Training on Heterogeneous AI Chips},
  author={Wang, Jingyi and Wang, Jia and Gao, Jun and others},
  journal={IEEE Transactions on Parallel and Distributed Systems},
  year={2025},
  note={Under Review}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This work was supported by the deployment verification at China Construction Bank. We thank the anonymous reviewers for their valuable feedback.

Contact

For questions or issues, please open an issue on GitHub or contact the authors at:

langkexiaoyi@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UniOrch: A Unified Mixed Framework for High-Efficiency LLM Training on Heterogeneous AI Chips

Overview

Key Components

Repository Structure

Installation

Requirements

Setup

Quick Start

One-Click Verification

Running Individual Experiments

Core Algorithms

Algorithm 1: HHEM_Scheduler (`src/hhem/scheduler.py`)

Algorithm 2: Transformer-Graph Hybrid Placement (`src/hhem/hybrid_placement.py`)

Two-Stage Cost Model (`src/hhem/cost_model.py`)

Mixed Parallel Training (`src/training/mixed_training.py`)

TCCL Collective Communication (`src/tccl/collective_comm.py`)

Algorithm Verification Results

Experimental Results

Performance Highlights

Supported Models

Data Files

Generated Figures

Citation

License

Acknowledgments

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
experiments		experiments
figures		figures
logs		logs
src		src
README.md		README.md
requirements.txt		requirements.txt
run_all_experiments.sh		run_all_experiments.sh

Folders and files

Latest commit

History

Repository files navigation

UniOrch: A Unified Mixed Framework for High-Efficiency LLM Training on Heterogeneous AI Chips

Overview

Key Components

Repository Structure

Installation

Requirements

Setup

Quick Start

One-Click Verification

Running Individual Experiments

Core Algorithms

Algorithm 1: HHEM_Scheduler (src/hhem/scheduler.py)

Algorithm 2: Transformer-Graph Hybrid Placement (src/hhem/hybrid_placement.py)

Two-Stage Cost Model (src/hhem/cost_model.py)

Mixed Parallel Training (src/training/mixed_training.py)

TCCL Collective Communication (src/tccl/collective_comm.py)

Algorithm Verification Results

Experimental Results

Performance Highlights

Supported Models

Data Files

Generated Figures

Citation

License

Acknowledgments

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Algorithm 1: HHEM_Scheduler (`src/hhem/scheduler.py`)

Algorithm 2: Transformer-Graph Hybrid Placement (`src/hhem/hybrid_placement.py`)

Two-Stage Cost Model (`src/hhem/cost_model.py`)

Mixed Parallel Training (`src/training/mixed_training.py`)

TCCL Collective Communication (`src/tccl/collective_comm.py`)

Packages