Skip to content

wangjingyi34/UniOrch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UniOrch: A Unified Mixed Framework for High-Efficiency LLM Training on Heterogeneous AI Chips

License: MIT Python 3.8+ Verified

This repository contains the implementation and experimental code for the paper:

UniOrch: A Unified Mixed Framework for High-Efficiency LLM Training on Heterogeneous AI Chips

Submitted to IEEE Transactions on Parallel and Distributed Systems (TPDS)

Overview

UniOrch is a holistic coordination framework for achieving unified control over heterogeneous AI chips (GPUs, NPUs, DCUs) in data center environments. It addresses the challenges of virtualization overhead, protocol fragmentation, and network partitioning that hinder efficient Large Language Model (LLM) training.

Key Components

  1. Hardware Unification Layer: Consolidates GPUs, NPUs, and DCUs into a bare-metal cloud through programmable gateways and BGP EVPN overlay networks.

  2. Software Standardization Engine: PyTorch-based Hardware Abstraction Layer (HAL) that masks hardware differences through standardized CUDA/CANN/RoCm mapping.

  3. TCCL (Transformer Collective Communication Library): Unifies NCCL, HCCL, and OpenMPI protocols for efficient parameter synchronization across heterogeneous clusters.

  4. HHEM (Heterogeneous Hybrid Estimation Model): Core scheduling engine with a two-stage cost model combining static analysis with dynamic runtime feedback.

Repository Structure

UniOrch/
├── src/                                    # Source code
│   ├── hhem/                               # HHEM scheduler implementation
│   │   ├── __init__.py                     # Module initialization
│   │   ├── cost_model.py                   # Two-stage cost model
│   │   ├── scheduler.py                    # Algorithm 1: HHEM_Scheduler
│   │   └── hybrid_placement.py             # Algorithm 2: Transformer-Graph Hybrid Placement
│   ├── tccl/                               # TCCL communication library
│   │   ├── __init__.py                     # Module initialization
│   │   └── collective_comm.py              # Collective communication primitives & fault tolerance
│   └── training/                           # Training modules
│       ├── __init__.py                     # Module initialization
│       └── mixed_training.py               # Mixed Parallel Training (DP × PP × TP)
├── experiments/                            # Experimental scripts
│   ├── cost_model/
│   │   └── evaluate_two_stage_model.py     # Two-stage cost model evaluation
│   ├── collective_comm/
│   │   └── benchmark_collective_ops.py     # Communication performance benchmark
│   ├── verify_all_algorithms.py            # Comprehensive algorithm verification
│   └── generate_paper_figures.py           # Paper figure generation
├── data/                                   # Experimental data (JSON format)
│   ├── two_stage_model_evaluation.json     # Cost model evaluation results
│   ├── collective_comm_benchmark_results.json  # Communication benchmark results
│   └── verification_results.json           # Algorithm verification results
├── figures/                                # Generated figures (11 PNG files)
│   ├── network_topology.png                # Network topology diagram
│   ├── two_stage_cost_model_evaluation.png # Cost model evaluation
│   ├── two_stage_model_detailed.png        # Detailed two-stage model
│   ├── collective_comm_comparison.png      # Communication comparison
│   ├── collective_comm_detailed.png        # Detailed communication analysis
│   ├── tccl_speedup.png                    # TCCL speedup results
│   ├── fault_tolerance.png                 # Fault tolerance demonstration
│   ├── mixed_parallel_strategy.png         # Mixed parallelism strategy
│   ├── scalability_analysis.png            # Scalability analysis
│   ├── hardware_comparison_table.png       # Hardware comparison
│   └── training_performance_stats.png      # Training performance statistics
├── logs/                                   # Execution logs
│   ├── hhem_cost_model.log                 # Cost model experiment log
│   ├── collective_comm.log                 # Communication experiment log
│   └── paper_figures.log                   # Figure generation log
├── requirements.txt                        # Python dependencies
├── run_all_experiments.sh                  # One-click verification script
└── README.md                               # This file

Installation

Requirements

  • Python 3.8+
  • NumPy >= 1.20.0
  • Matplotlib >= 3.4.0

Setup

# Clone the repository
git clone https://github.com/wangjingyi34/UniOrch.git
cd UniOrch

# Install dependencies
pip install -r requirements.txt

Quick Start

One-Click Verification

Run all experiments and verify all algorithms with a single command:

chmod +x run_all_experiments.sh
./run_all_experiments.sh

This will:

  1. Evaluate the HHEM two-stage cost model
  2. Benchmark collective communication performance
  3. Verify all 5 core algorithms

Running Individual Experiments

# Evaluate the two-stage cost model
python experiments/cost_model/evaluate_two_stage_model.py

# Benchmark collective communication
python experiments/collective_comm/benchmark_collective_ops.py

# Generate paper figures
python experiments/generate_paper_figures.py

# Verify all algorithms
python experiments/verify_all_algorithms.py

Core Algorithms

Algorithm 1: HHEM_Scheduler (src/hhem/scheduler.py)

Transformer-aware heterogeneous scheduling algorithm with:

  • Node priority scoring: S_i = α*(1-Util_i) + β*W_i
  • Atomic allocation and rollback mechanism
  • Time budget constraint checking
from src.hhem.scheduler import HHEMScheduler, SchedulingTask, HeterogeneousNode

# Create scheduler
scheduler = HHEMScheduler()

# Add heterogeneous nodes
scheduler.add_node(HeterogeneousNode(
    node_id="NPU_910ProB_Node_0",
    chip_type="NPU_910ProB",
    tflops=320.0,
    memory_gb=64.0,
    num_chips=8
))

# Schedule task
result = scheduler.schedule(task)

Algorithm 2: Transformer-Graph Hybrid Placement (src/hhem/hybrid_placement.py)

Hybrid placement algorithm for Transformer models:

  • Layer profiling and affinity scoring
  • Cluster assignment optimization
  • Pipeline construction and RDMA scheduling
from src.hhem.hybrid_placement import HybridPlacementEngine, TransformerModel

# Create placement engine
engine = HybridPlacementEngine()

# Create model and hardware profiles
model = TransformerModel(name="GLM-130B", num_layers=70, hidden_size=12288)

# Execute placement
result = engine.place(model, hardware_profiles)

Two-Stage Cost Model (src/hhem/cost_model.py)

Sophisticated cost estimation combining:

  • Stage 1 - Static Analysis: Offline analysis using operator FLOPs and hardware specs
  • Stage 2 - Dynamic Feedback: Runtime correction with learned factors
Cost_effective = Cost_static × α_runtime

Mixed Parallel Training (src/training/mixed_training.py)

Combines three parallelism strategies:

  • Data Parallelism (DP): Distributes data across devices
  • Pipeline Parallelism (PP): Splits model layers across stages
  • Tensor Parallelism (TP): Shards individual layers
from src.training.mixed_training import MixedParallelTrainer, TrainingConfig

config = TrainingConfig(
    model_name="GLM-130B",
    dp_degree=4,
    pp_degree=4,
    tp_degree=2,
    global_batch_size=256
)

trainer = MixedParallelTrainer(config)
results = trainer.train(num_batches=10)

TCCL Collective Communication (src/tccl/collective_comm.py)

Unified communication library with:

  • Protocol conversion (NCCL, HCCL, OpenMPI)
  • Topology-aware ring algorithm
  • RDMA acceleration
  • Fault tolerance (heartbeat monitoring, CRC32 checksums)

Algorithm Verification Results

All 5 core algorithms have been verified:

Algorithm File Status
HHEM_Scheduler src/hhem/scheduler.py ✅ PASSED
Hybrid Placement src/hhem/hybrid_placement.py ✅ PASSED
Two-Stage Cost Model src/hhem/cost_model.py ✅ PASSED
Mixed Parallel Training src/training/mixed_training.py ✅ PASSED
TCCL Collective Comm src/tccl/collective_comm.py ✅ PASSED

Experimental Results

Performance Highlights

  • Resource Utilization: 92% (35% improvement over baseline)
  • Cross-chip Latency: 15ms (42% reduction)
  • Allreduce Efficiency: 2.1 GB/s (75% improvement)
  • Workload Imbalance: Reduced by 45%

Supported Models

  • GLM-130B
  • BLOOM-176B
  • LayoutLMv2
  • YOLOv5

Data Files

File Description
data/two_stage_model_evaluation.json Cost model evaluation metrics
data/collective_comm_benchmark_results.json Communication performance data
data/verification_results.json Algorithm verification results

Generated Figures

Figure Description
network_topology.png Spine-Leaf network topology
two_stage_cost_model_evaluation.png Two-stage model performance
collective_comm_comparison.png TCCL vs baseline comparison
fault_tolerance.png Fault tolerance mechanism
mixed_parallel_strategy.png DP×PP×TP strategy layout
scalability_analysis.png Scalability bottleneck analysis

Citation

If you use UniOrch in your research, please cite:

@article{wang2025uniorch,
  title={UniOrch: A Unified Mixed Framework for High-Efficiency LLM Training on Heterogeneous AI Chips},
  author={Wang, Jingyi and Wang, Jia and Gao, Jun and others},
  journal={IEEE Transactions on Parallel and Distributed Systems},
  year={2025},
  note={Under Review}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This work was supported by the deployment verification at China Construction Bank. We thank the anonymous reviewers for their valuable feedback.

Contact

For questions or issues, please open an issue on GitHub or contact the authors at:

About

UniOrch: A Unified Mixed Framework for High-Efficiency LLM Training on Heterogeneous AI Chips

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors