- [2025/6]SynerFuse publishes framework of Release 1.0, which can provide heterogeneous pipeline parallelism and heterogeneous data parallelism capabilities.
- [2025/5] China Mobile has established the repository for SynerFuse and obtained the OpenSSF certification badge.
Currently, the “resource wall” between different GPUs makes it difficult to build one heterogeneous resource pool for Large-scale models training. Heterogeneous distributed training becomes a pressing challenge for the industry to solve. We brought up a cross-architecture unified heterogeneous training framework SynerFuse to deal with the problem.
SynerFuse enables multiple LLMs deployed and trained on multiple types of GPUs. The Inhomogeneous Task Distribution(ITD) algorithm for heterogeneous training task splitting is innovatively proposed, which supports heterogeneous data parallelism(DP) and heterogeneous pipeline parallelism(PP), and realizes the adaptive adjustment of parameters such as size and number of micro batches in DP, stages and layers in PP on heterogeneous GPUs.
We’ve verified our capability on Nvidia and other 4 types of GPUs. The acceleration ratio reached 95%, loss converges to 1.8 and PPL curve converged normally.
SynerFuse is recommended to run in a containerized environment based on NVIDIA's NGC PyTorch image, which ensures compatibility with GPU-accelerated libraries.
- Ubuntu 20.04 / 22.04
- Python 3.10+
- PyTorch 2.4.0+ (with GPU support)
- NVIDIA GPUs with a recent driver (the NGC container includes CUDA and cuDNN)
We recommend using NGC's PyTorch container (e.g., nvcr.io/nvidia/pytorch:24.02-py3) for setup.
-
Create and start the container:
docker run -itd \ --name synerfuse \ --gpus all \ --network=host \ --ipc=host \ --privileged \ nvcr.io/nvidia/pytorch:24.07-py3 \ bash
-
Access the container:
docker exec -it synerfuse bash -
Clone the repository:
git clone https://github.com/SynerFuse/SynerFuse
-
Install dependencies:
pip install sentencepiece
Before running training tasks, prepare the following data files:
- Tokenizer Model: Download or prepare a tokenizer model file (e.g.,
tokenizer.model). - Training Dataset: Prepare training data in Megatron-LM format (
.binand.idxfiles).
Update the following paths in run.sh according to your data location:
TOKENIZER_PATH=/path/to/your/tokenizer.model
DATA_PATH=/path/to/your/datasetConfigure directories for checkpoints and TensorBoard logs in run.sh:
CHECKPOINT_PATH=/path/to/your/checkpoints
TENSORBOARD_LOGS_PATH=/path/to/your/tensorboard_logsMake sure these directories exist inside the container, for example:
mkdir -p /workspace/checkpoints
mkdir -p /workspace/tensorboard_logsSynerFuse provides a simple bash script for running pre-training tasks. To start distributed heterogeneous training:
cd SynerFuse
bash run.shThe script launches a distributed training job with default configurations.
Monitor training progress through the console output and (optionally) TensorBoard if enabled.
The --hetero-pipeline-stages parameter configures different layer counts for different pipeline stages.
Format: n0 layers_0_0 layers_0_1 ... n1 layers_1_0 layers_1_1 ...
n0: Number of devices in the 0-th heterogeneous stage, followed by layer counts for each device in that stage (layers_0_0,layers_0_1, ...)n1: Number of devices in the 1-st heterogeneous stage, followed by layer counts for each device in that stage (layers_1_0,layers_1_1, ...)- Additional stages follow the same pattern:
n2 layers_2_0 layers_2_1 ...
Constraints:
Where:
-
$k$ is the number of heterogeneous stages -
$n_i$ is the number of devices in the i-th stage -
$\text{layers}_{i,j}$ is the layer count for the j-th device in the i-th stage
This parameter enables assigning more layers to more powerful GPUs and fewer layers to less powerful GPUs in a heterogeneous GPU cluster.
Example: For a pipeline with 2 stages and 8 layers in total:
--num-layers 8
--pipeline-model-parallel-size 2
--hetero-pipeline-stages 1 6 1 2This configuration specifies:
- Stage 0: 1 device with 6 layers (
layers_0_0 = 6) - Stage 1: 1 device with 2 layers (
layers_1_0 = 2)
Configure heterogeneous DP using three parameters:
--use-tp-pp-dp-mapping: Changes communication group order to enable heterogeneous DP--micro-batch-size-per-dp: Sets micro-batch size for different DP groups--num-micro-batches-per-dp: Sets number of micro-batches for different DP groups
Format for --micro-batch-size-per-dp: n0 mbs0 n1 mbs1 ...
n0, n1, ...: Number of consecutive devices within a DP groupmbs0, mbs1, ...: Micro-batch size for the corresponding device group
Format for --num-micro-batches-per-dp: n0 nmb0 n1 nmb1 ...
n0, n1, ...: Number of consecutive devices within a DP groupnmb0, nmb1, ...: Number of micro-batches for the corresponding device group
Constraints:
Example: For data-parallel-size=2 and global-batch-size=32:
--global-batch-size 32
--use-tp-pp-dp-mapping
--micro-batch-size-per-dp 1 6 1 2
--num-micro-batches-per-dp 1 4 1 4This configuration creates two DP groups: one device with micro-batch size 6 and 4 micro-batches (6×4=24), and another device with micro-batch size 2 and 4 micro-batches (2×4=8), summing to 32.
Modify variables in run.sh to customize training:
- Model Size: Adjust
HIDDEN_SIZE,NUM_LAYERS,NUM_HEADS, etc. - Training Steps: Set
TRAIN_STEPSto control iterations. - Batch Sizes: Modify
MBS(micro-batch size) andGBS(global batch size). - Parallelism: Adjust
TP(tensor parallel size) andPP(pipeline parallel size). - Heterogeneous Configuration: Modify
--hetero-pipeline-stagesand related parameters.
- Ensure available GPUs match
CUDA_VISIBLE_DEVICESandGPUS_PER_NODEconfigurations. - The training script uses
torchrunfor distributed training. - Check console output for training progress and potential errors.
- TensorBoard logs are saved to
TENSORBOARD_LOGS_PATHif configured.