[CICD] Refactor workflows, Add integration_tests, Switch to FlagCICD metax runner by BrianPei · Pull Request #60 · flagos-ai/TransformerEngine-FL

BrianPei · 2026-04-17T07:34:52Z

Description

Refactors CI/CD workflows to support both CUDA (NVIDIA A100) and Metax (C500) platforms, removes obsolete workflows, and fixes several platform-specific test failures. Add functional testing, and log reporting, with significant workflow simplification, and Metax platform use BAAI runner configs.

Type of change

New feature (non-breaking change which adds functionality)
Infra/Build change (changes to CI/CD workflows or build scripts)
Code refactoring
Bug fix
Documentation change
Breaking change

Changes

Workflow cleanup: Removed 7 obsolete workflows; extracted lint into a standalone reusable lint_common.yml (runs in parallel); add integration_tests_common.yml
Platform refactoring: Added per-platform setup scripts (setup_cuda.sh / setup_metax.sh); switched Metax config to BAAI online environment; removed unsupported test types (JAX distributed) from Metax matrix
Bug fixes:
- Metax: skip incompatible distributed test files (test_numerics, test_torch_fsdp2, etc.) to prevent torchrun SIGSEGV
- Metax: replace nvidia-smi-only FP8 detection with platform-aware check
- CUDA: fix libcudart load failure when runtime is pip-installed (add proper fallback chain in _load_cudart() and try_load_lib)

Checklist

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in CI workflow setup steps
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added/updated tests that prove my feature works on CUDA and Metax platform
New and existing unit tests pass locally on CUDA and Metax platform

…ost-Apr-02)

Copilot

Pull request overview

This PR refactors the CI/CD pipeline to run a simplified, reusable test suite across both CUDA (A100) and Metax (C500) environments (using BAAI runner configs), while also addressing CUDA runtime library loading issues and adding an MCore (Megatron-LM-FL) integration test.

Changes:

Refactors GitHub Actions into reusable workflows (lint/unit/integration), removes several legacy/disabled workflows, and centralizes per-platform configuration under .github/configs/*.yml.
Adds/updates QA scripts for Metax/CUDA stability (skipping incompatible distributed tests on Metax; improving CUDA runtime discoverability).
Introduces a Megatron-LM-FL MCore integration test path and wiring in CI.

Reviewed changes

Copilot reviewed 27 out of 28 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`transformer_engine/plugin/core/backends/vendor/cuda/cuda.py`	Adds CUDA library preloading with extra search locations (env, pip-installed NVIDIA libs, ldconfig).
`transformer_engine/common/__init__.py`	Adds `_load_cudart()` fallback chain and wires it into core library load sequence.
`qa/L1_pytorch_mcore_integration/test.sh`	Reworks the MCore integration test runner (repo sync + `torchrun` + tensorboard/checkpoint output).
`qa/L1_pytorch_mcore_integration/test_bak.sh`	Adds a backup/legacy version of the integration script.
`qa/L1_pytorch_distributed_unittest/test.sh`	Adds CUDA runtime preload + Metax skip list + makes debug numerics optional if `nvdlfw_inspect` is unavailable.
`qa/L0_pytorch_unittest/test.sh`	Adjusts Metax skip patterns (removes `test_cuda_graphs.py` from skip list).
`qa/L0_pytorch_debug_unittest/test.sh`	Updates Metax skip matching patterns and adds local run hints.
`.github/workflows/unit_tests_common.yml`	Simplifies into a reusable unit-test workflow using `setup_script` + `build_env`; adds distributed unit tests to matrix.
`.github/workflows/integration_tests_common.yml`	Introduces a reusable integration-test workflow (MCore integration).
`.github/workflows/all_tests_common.yml`	Centralizes platform config loading, adds lint job, and orchestrates unit + integration flows.
`.github/workflows/all_tests_cuda.yml`	Enables unit + integration runs for CUDA via the common workflow.
`.github/workflows/all_tests_metax.yml`	Enables unit + integration runs for Metax via the common workflow.
`.github/workflows/lint_common.yml`	Adds a standalone reusable lint workflow.
`.github/workflows/qa-l1-te-cpp-pytorch-tests.yml`	Updates checkout behavior and adds the MCore integration test step.
`.github/scripts/setup_cuda.sh`	Adds CUDA environment setup script for workflow reuse.
`.github/scripts/setup_metax.sh`	Adds Metax environment setup script for workflow reuse.
`.github/configs/cuda.yml`	Switches to BAAI runner labels, adds `setup_script`, and defines CUDA build env vars.
`.github/configs/metax.yml`	Switches to BAAI runner labels/image, adds `setup_script`, and defines Metax build env vars.
`.github/workflows/upload-ci-logs.yml`	Removes legacy log upload workflow.
`.github/workflows/trigger-ci.yml`	Removes legacy trigger workflow.
`.github/workflows/lint.yml`	Removes legacy lint workflow.
`.github/workflows/functional_tests_common.yml`	Removes legacy functional training workflow.
`.github/workflows/qa-format.yml`	Removes legacy format-check workflow.
`.github/workflows/deploy_nightly_docs.yml`	Removes legacy nightly docs workflow.
`.github/workflows/blossom-ci.yml`	Removes legacy Blossom CI workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Darryl233 · 2026-04-17T11:03:10Z

You can disable workflows forked from the original NVIDIA repository, but do not remove them.

Darryl233 · 2026-04-17T11:03:48Z

I think we have an original lint.yml. Can we use the original one?

Darryl233 · 2026-04-20T06:04:41Z

+        env:
+          TE_PATH: ${{ github.workspace }}
+          TE_FL_PREFER: vendor
+          MCORE_REPO_URL: https://github.com/BrianPei/Megatron-LM-FL.git


Should not use the forked repo

OK, so we will edit here after Megatron-FL updated, otherwise it can't support funtional_tests

Darryl233 · 2026-04-20T06:10:44Z


      - name: Execute Tests
-        if: steps.should_run.outputs.should_run == 'true'
+        if: inputs.setup_script != ''


Is this necessary in Execute Tests step？

Darryl233 · 2026-04-20T06:12:28Z

+          bash $GITHUB_WORKSPACE/${{ inputs.setup_script }}
+
+      - name: Execute Tests
+        if: inputs.setup_script != ''


Is this necessary in Execute Tests step？

Darryl233 · 2026-04-20T06:13:26Z

+        env:
+          TE_PATH: .
+          TE_FL_PREFER: vendor
+          MCORE_REPO_URL: https://github.com/BrianPei/Megatron-LM-FL.git


Should not use the forked repo

Darryl233 · 2026-04-20T07:46:13Z

If you feel it's hard to split L0, L!, L2 into unit tests, functional tests and integration tests, I suggest following the original test division pattern. Names of the division pattern are not the main concern

Darryl233 · 2026-04-20T08:21:44Z

    if [ "$PLATFORM" = "metax" ]; then
        case "$test_path" in
-            *"test_numerics.py" | *"test_api_features.py" | *"test_sanity.py")
+            *tests/pytorch/test_numerics.py | *tests/pytorch/test_sanity.py)


Consider configuring ignore rules for future use later.

Consider configuring ignore rules for future use later.

Updated

Darryl233 · 2026-04-20T08:23:57Z

 : ${XML_LOG_DIR:=/logs}
 mkdir -p "$XML_LOG_DIR"

+# The current CUDA 12.8 test container hits a fused-attention runtime loader


Can you give a specific example to explain this issue?

This workaround is added to avoid runtime loader issues when the fused attention path enters the CUDA backend, throwing a "Cannot load any libcudart.so.* library" error.

Darryl233 · 2026-04-20T08:48:40Z

-: ${MCORE_PATH:=${TE_PATH}/qa/L1_pytorch_mcore_integration/Megatron-LM}
+: "${TE_PATH:=$(cd -- "${SCRIPT_DIR}/../.." && pwd)}"
+: "${MCORE_PATH:=/workspace/Megatron-LM-FL}"
+: "${MCORE_REPO_URL:=https://github.com/BrianPei/Megatron-LM-FL.git}"


Do not use the forked repo

CLAassistant · 2026-04-21T07:58:49Z

All committers have signed the CLA.

Darryl233 · 2026-04-22T06:48:55Z

    if not skip_cuda_build():
        _CUDNN_LIB_CTYPES = _load_cudnn()
        _NVRTC_LIB_CTYPES = _load_nvrtc()
+        _CUDART_LIB_CTYPES = _load_nvidia_cuda_library("cuda_runtime")


Remove unnecessary changes.

Remove unnecessary changes.

Updated this part by restoring the original ordering here, so this block now only keeps the necessary changes.

Darryl233

LGTM

xmhubj · 2026-04-24T03:54:56Z

 hardware_name: cuda
 display_name: 'NVIDIA CUDA (A100)'

+# CI image for BAAI env


remove BAAI here?

xmhubj · 2026-04-24T03:55:14Z

+#   - nvidia
+#   - gpu-8
+
+# Runner labels for BAAI env


xmhubj · 2026-04-24T03:57:14Z

-  - X64
-  - metax
-  - dev
+# CI image for BAAI env


xmhubj · 2026-04-24T03:59:29Z

+    needs: 
+      - unit_tests
+    runs-on: ubuntu-latest
+    if: always() && inputs.run_unit_tests


Is always() necessary?

It's necessary. If we remove the always(), when any unit_tests failed, this step will be skipped. Then the all_tests_complete check step may be skipped too, the whole workflow will show a wrong result

BrianPei added 2 commits April 17, 2026 15:18

CI: refactor workflows, add multi-platform support, fix QA scripts (p…

260dd5a

…ost-Apr-02)

remove duplicate scripts

95c00ff

Copilot AI review requested due to automatic review settings April 17, 2026 07:34

Copilot started reviewing on behalf of BrianPei April 17, 2026 07:35 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Darryl233 reviewed Apr 17, 2026

View reviewed changes

BrianPei added 3 commits April 18, 2026 11:31

restore nv original ymls

263a761

modify all_tests to pipeline workflow

f4ae7cd

remove custom lint

f45e9d0

Darryl233 changed the title ~~[CICD] Refactor workflows, Add integration_tests, Switch to BAAI metax runner~~ [CICD] Refactor workflows, Add integration_tests, Switch to FlagCICD metax runner Apr 20, 2026

Darryl233 reviewed Apr 20, 2026

View reviewed changes

remove unnecessary if condition

25e1bf4

Darryl233 reviewed Apr 20, 2026

View reviewed changes

Comment thread .github/workflows/qa-l1-te-cpp-pytorch-tests.yml

Darryl233 reviewed Apr 20, 2026

View reviewed changes

Comment thread qa/L0_pytorch_debug_unittest/test.sh Outdated

Darryl233 reviewed Apr 20, 2026

View reviewed changes

Darryl233 reviewed Apr 21, 2026

View reviewed changes

Comment thread transformer_engine/common/__init__.py

Darryl233 reviewed Apr 21, 2026

View reviewed changes

Comment thread transformer_engine/plugin/core/backends/vendor/cuda/cuda.py

qqjxzxq and others added 4 commits April 22, 2026 02:50

check conda & python

d1711bc

chore: clean debug leftovers and centralize metax ignore rules

4035436

turn back

e895124

add network config

5e281e8

zhoujiamei force-pushed the pr-0417 branch from b47f54b to 5e281e8 Compare April 22, 2026 02:52

qqjxzxq and others added 2 commits April 22, 2026 03:03

set network again

7720975

revert: keep cudart workaround at test layer

dc22a22

Darryl233 reviewed Apr 22, 2026

View reviewed changes

HermiaHuan and others added 11 commits April 22, 2026 15:37

chore: restore original cudart load ordering

db6a459

disable original qa-l1 & qa-l3 workflow

9467cce

fix format_check

24320f7

fix: apply black formatting with correct CI flags

26c6df9

remove torchrun standalone for integration_tests

fc2b4a3

updated build & plugin runner label

3cc6661

fix cuda build scripts

9401c7d

Add clean vscode-remote-container step on metax

9983ed6

fix depedences installation on Metax runner

0dcd0ce

set git safe directory for build

6f130c4

integration_tests job add strategy metrix

78a999a

Darryl233 approved these changes Apr 24, 2026

View reviewed changes

xmhubj reviewed Apr 24, 2026

View reviewed changes

BrianPei added 2 commits April 24, 2026 15:08

change integration repo to flagOS megatron-LM-FL

c72c2f9

excute tests step add activate conda

b505a6c

xmhubj approved these changes Apr 24, 2026

View reviewed changes

Darryl233 merged commit d7e9e7b into flagos-ai:main Apr 24, 2026
27 of 31 checks passed

Conversation

BrianPei commented Apr 17, 2026

Description

Type of change

Changes

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Darryl233 commented Apr 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Darryl233 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Darryl233 Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

CLAassistant commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Darryl233 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Darryl233 commented Apr 20, 2026 •

edited

Loading

Darryl233 Apr 20, 2026 •

edited

Loading

CLAassistant commented Apr 21, 2026 •

edited

Loading