English | 简体中文
A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models.
🚀 Up to 5.04× training speedup · 🌐 Native NVIDIA GPU & Kunlun XPU support
📖 Quick Start · 📊 Benchmark · 🤖 Supported Models · 🚀 Roadmap
🐉 LoongForge is part of Baidu Baige's Loong open-source series — named after the traditional Chinese loong boat (龙舟), a symbol of coordinated power and forward momentum.
LoongForge is a unified training framework for LLMs, VLMs, diffusion, and embodied models, covering pre-training, continued pre-training, and SFT. Built upon Megatron-LM with deep systemic enhancements across model coverage, training performance, and hardware support, it delivers significant speedups over mainstream open-source baselines.
Before going open-source, LoongForge was developed as AIAK-Training-LLM, Baidu Baige's training acceleration stack. It has supported production training for enterprise customers across Education, Computer Vision, and Embodied AI, typically delivering 30%~50% speedup over customer baselines, with the largest production runs reaching 5,000+ XPUs.
- [2026/05] ⚡ Accelerated Wan 2.2 training by 116%, and added CP and data packing support.
- [2026/05] ✨ Added training support for Kimi K2.5 / K2.6, and introduced INT4 / NVFP4 PTQ.
- [2026/05] 🎉 v0.1.0 — first official tagged release of LoongForge.
- [2026/05] 🌟 Powered the training and public release of LLaVA-OneVision-2.0.
- [2026/05] 🤖 Expanded VLA coverage with GR00T N1.6; 60%+ speedup on Pi0.5 and GR00T training.
- [2026/04] 🧩 Added training support for MiniMax-M2.7 on both NVIDIA GPU and Kunlun XPU.
- [2026/04] 🚀 LoongForge source code publicly available on GitHub. [blog]
- [2025/10] 🌟 Powered the training and public release of LLaVA-OneVision-1.5 under AIAK-Training-LLM, the predecessor of LoongForge. [blog]
See the full documentation for installation, tutorials, and advanced usage — English · 中文.
1. Install — via Docker (prebuilt images coming soon) or source build:
- NVIDIA GPU: Installation Guide
- Kunlun XPU: Installation Guide
2. Launch your first training run — follow a tutorial for your target hardware and modality:
- NVIDIA GPU: LLM · VLM · VLA · Diffusion (WAN)
- Kunlun XPU: Kunlun XPU Tutorials
3. Explore — browse configs/models/ and examples/ / examples_xpu/ for ready-to-run scripts.
- 🧩 Flexible Multi-Modal Composition — Configuration-driven assembly of VLMs from interchangeable ViT and LLM components.
- ⚡ Heterogeneous Parallelism — Independent TP / DP / recompute per model component (e.g., ViT vs. LLM) for optimal throughput and memory. [blog]
- 🔀 Decoupled Encoder-Decoder Training — Separates ViT and LLM into independent tasks, eliminating encoder-induced pipeline bubbles.
- ⚖️ DP Load Balancing — Load-aware data redistribution mitigates sequence-packing imbalance, improving multi-node scaling efficiency. [blog]
- 🚀 MoE-Native Optimization — Overlapped All2All / activation offload / compute, with further memory reduction beyond upstream Megatron-LM on DeepSeek-V3, Qwen3-MoE, etc.
- 🔬 Adaptive FP8 Training — End-to-end FP8 for LLMs and VLMs with standard blockwise FP8; optional adaptive mode picks per-operator precision by GEMM shape and efficiency.
- 🔧 Custom Fused Operators — Fused kernels like FusedDSA for DSA-style models — TileLang version open-sourced, high-performance CUDA version available on Baidu Baige platform.
- 🔁 Flexible Checkpointing — Offline bidirectional Megatron ↔ HuggingFace conversion plus native online HF load/save — no format barriers across your workflow.
- 🧰 Versatile Pipelines & Data Tools — Out-of-the-box Pretrain / MidTrain / SFT / LoRA, with built-in dataset format conversion and sequence packing.
- 🌐 Heterogeneous Hardware — Native support for NVIDIA GPUs and Kunlun XPUs via a minimally-intrusive plugin design.
📖 Deep-dive: LLM features · VLM features
Measured on v0.1.1 across LLM, VLM, VLA and DIT workloads against mainstream open-source training baselines:
📋 Detailed configurations & footnotes
| Model | Type | Baseline | Configuration | Speedup |
|---|---|---|---|---|
| Qwen3-30B-A3B | MoE | Megatron-LM† | 32 × A800‡ · GBS 1024 · 32K | 1.16× |
| DeepSeek-V3.2 Lite § | MoE + DSA | Megatron-LM† | Reduced-layer · GBS 128 · 8K | 5.04× |
| Qwen3-VL-30B-A3B | VLM | VeOmni† | 32 × A800‡ · GBS 128 · 32K | 1.45× |
| GR00T N1.6 | VLA | LeRobot† | 8 × A800‡ · GBS 128 · 224×224 | 2.31× |
| Pi0.5 | VLA | OpenPI† | 8 × A800‡ · GBS 112 · 224×224 | 1.65× |
§ Due to test-bed scale limits, DeepSeek-V3.2 was validated separately on a reduced-layer configuration — LoongForge's DSA CUDA kernel optimizations still deliver ~5× speedup over Megatron-LM and reach 64K sequence (baseline OOMs beyond 8K).
† Numbers reflect baseline and LoongForge versions at the time of measurement, and may evolve as implementations change.
‡ Validation on additional hardware is rolling out in upcoming releases.
- LLaVA-OneVision-2.0 — Next-generation multimodal model, with new VideoCaption and Spatial datasets.
- LLaVA-OneVision-1.5 — Fully open framework for democratized multimodal training.
- Qianfan-VL — Domain-Enhanced Vision-Language Models for Enterprise, 3B to 70B parameters.
LoongForge supports a broad range of state-of-the-art models across LLM, VLM, diffusion, and VLA.
| Modality | Architectures | Models |
|---|---|---|
| LLM | DeepSeek-V2 | deepseek-v2-lite, deepseek-v2 |
| DeepSeek-V3 | deepseek-v3, deepseek-v32 | |
| LLaMA2 | llama2-7b, llama2-13b, llama2-70b | |
| LLaMA3 | llama3-8b, llama3-70b | |
| LLaMA3.1 | llama3.1-8b, llama3.1-70b, llama3.1-405b | |
| Qwen | qwen-1.8b → qwen-72b | |
| Qwen1.5 | qwen1.5-0.5b → qwen1.5-72b | |
| Qwen2 | qwen2-0.5b → qwen2-72b | |
| Qwen2.5 | qwen2.5-0.5b → qwen2.5-72b | |
| Qwen3 | qwen3-0.6b → qwen3-480b-a35b, qwen3-coder-30b-a3b | |
| Qwen3-Next | qwen3-next-80b-a3b | |
| MiniMax | minimax-m2.1, minimax-m2.5, minimax-m2.7 | |
| MIMO | mimo-7b | |
| GLM | glm5 | |
| VLM | Qwen2.5-VL | qwen2.5-vl-3b → qwen2.5-vl-72b |
| Qwen3-VL | qwen3-vl-30b-a3b, qwen3-vl-235b-a22b | |
| Qwen3.5 | qwen3.5-0.8b → qwen3.5-397b-a17b | |
| Qwen3.6 | qwen3.6-27b, qwen3.6-35b-a3b | |
| Kimi-K2.5 | kimi-k2.5, kimi-k2.6 | |
| ERNIE4.5-VL | ernie4.5vl-28b-a3b | |
| LLaVA-OneVision-1.5 | llava-onevision-1.5-4b | |
| InternVL2.5 | internvl2.5-8b → internvl2.5-78b | |
| InternVL3.5 | internvl3.5-8b → internvl3.5-241b-a28b | |
| CustomCombinedModel | Flexible ViT + LLM backbone configuration (example) | |
| Diffusion | WAN2.2 | wan2.2_i2v_a14b |
| VLA | Pi | pi0.5 |
| GR00T | groot-n1.6 |
📁 Directory tree
LoongForge/
├── loongforge/ # Core training framework
│ ├── train/ # Training entry points & trainers
│ │ ├── pretrain/ # Pretrain (LLM, VLM)
│ │ ├── sft/ # SFT (LLM, VLM, InternVL, ERNIE)
│ │ ├── diffusion/ # Diffusion (WAN)
│ │ └── embodied/ # Embodied AI (Pi0.5, GR00T)
│ ├── models/ # Unified model abstractions
│ │ ├── foundation/ # LLM backbones (LLaMA, Qwen, DeepSeek, ...)
│ │ ├── encoder/ # Vision encoders (ViT, Qwen-VL, InternVL, ...)
│ │ ├── omni_models/ # Multi-modal composition
│ │ ├── diffusion/ # Diffusion models (WAN)
│ │ ├── embodied/ # Embodied models (Pi0.5, GR00T)
│ │ └── common/ # Shared layers and utilities
│ ├── data/ # Data pipelines (multi-modal, video, DP balance)
│ ├── tokenizer/ # Tokenizers
│ └── utils/ # Config map, constants, etc.
├── third_party/Loong-Megatron/ # Patched Megatron-LM (git submodule)
├── configs/ # Hydra YAML configs (models, data)
├── examples/ # GPU launch scripts
├── examples_xpu/ # Kunlun XPU launch scripts
├── tools/ # Checkpoint conversion, data preprocessing
├── ops/ # Custom fused operators (incl. open-sourced TileLang)
├── patches/ # TransformerEngine patches
├── docker/ # Dockerfiles (GPU & XPU)
├── tests/ # E2E test suite (YAML-driven)
└── docs/ # Documentation
We warmly welcome community contributions — bug reports, feature proposals, and PRs alike. Please read our Contributing Guidelines before submitting.
LoongForge is released under the Apache License 2.0. Some files are derived from third-party open-source projects; please refer to the specific file headers for their respective copyright and attribution.
@software{LoongForge2026,
title = {LoongForge: A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models},
author = {{The LoongForge Authors}},
year = {2026},
url = {https://github.com/baidu-baige/LoongForge}
}LoongForge is built upon NVIDIA's Megatron-LM. We also drew inspiration from several excellent open-source projects, including but not limited to HuggingFace Transformers, LLaMA-Factory, and Megatron-Bridge. We sincerely thank these communities for their outstanding contributions.
Open a GitHub issue for questions, feedback, or feature requests. You can also join our developer community:
- WeChat — Scan QR code to join
- Slack — Join here
