Skip to content

[CICD] Refactor workflows, Add integration_tests, Switch to FlagCICD metax runner#60

Merged
Darryl233 merged 25 commits intoflagos-ai:mainfrom
BrianPei:pr-0417
Apr 24, 2026
Merged

[CICD] Refactor workflows, Add integration_tests, Switch to FlagCICD metax runner#60
Darryl233 merged 25 commits intoflagos-ai:mainfrom
BrianPei:pr-0417

Conversation

@BrianPei
Copy link
Copy Markdown
Collaborator

Description

Refactors CI/CD workflows to support both CUDA (NVIDIA A100) and Metax (C500) platforms, removes obsolete workflows, and fixes several platform-specific test failures. Add functional testing, and log reporting, with significant workflow simplification, and Metax platform use BAAI runner configs.


Type of change

  • New feature (non-breaking change which adds functionality)
  • Infra/Build change (changes to CI/CD workflows or build scripts)
  • Code refactoring
  • Bug fix
  • Documentation change
  • Breaking change

Changes

  • Workflow cleanup: Removed 7 obsolete workflows; extracted lint into a standalone reusable lint_common.yml (runs in parallel); add integration_tests_common.yml
  • Platform refactoring: Added per-platform setup scripts (setup_cuda.sh / setup_metax.sh); switched Metax config to BAAI online environment; removed unsupported test types (JAX distributed) from Metax matrix
  • Bug fixes:
    • Metax: skip incompatible distributed test files (test_numerics, test_torch_fsdp2, etc.) to prevent torchrun SIGSEGV
    • Metax: replace nvidia-smi-only FP8 detection with platform-aware check
    • CUDA: fix libcudart load failure when runtime is pip-installed (add proper fallback chain in _load_cudart() and try_load_lib)

Checklist

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in CI workflow setup steps
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added/updated tests that prove my feature works on CUDA and Metax platform
  • New and existing unit tests pass locally on CUDA and Metax platform

Copilot AI review requested due to automatic review settings April 17, 2026 07:34
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the CI/CD pipeline to run a simplified, reusable test suite across both CUDA (A100) and Metax (C500) environments (using BAAI runner configs), while also addressing CUDA runtime library loading issues and adding an MCore (Megatron-LM-FL) integration test.

Changes:

  • Refactors GitHub Actions into reusable workflows (lint/unit/integration), removes several legacy/disabled workflows, and centralizes per-platform configuration under .github/configs/*.yml.
  • Adds/updates QA scripts for Metax/CUDA stability (skipping incompatible distributed tests on Metax; improving CUDA runtime discoverability).
  • Introduces a Megatron-LM-FL MCore integration test path and wiring in CI.

Reviewed changes

Copilot reviewed 27 out of 28 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
transformer_engine/plugin/core/backends/vendor/cuda/cuda.py Adds CUDA library preloading with extra search locations (env, pip-installed NVIDIA libs, ldconfig).
transformer_engine/common/__init__.py Adds _load_cudart() fallback chain and wires it into core library load sequence.
qa/L1_pytorch_mcore_integration/test.sh Reworks the MCore integration test runner (repo sync + torchrun + tensorboard/checkpoint output).
qa/L1_pytorch_mcore_integration/test_bak.sh Adds a backup/legacy version of the integration script.
qa/L1_pytorch_distributed_unittest/test.sh Adds CUDA runtime preload + Metax skip list + makes debug numerics optional if nvdlfw_inspect is unavailable.
qa/L0_pytorch_unittest/test.sh Adjusts Metax skip patterns (removes test_cuda_graphs.py from skip list).
qa/L0_pytorch_debug_unittest/test.sh Updates Metax skip matching patterns and adds local run hints.
.github/workflows/unit_tests_common.yml Simplifies into a reusable unit-test workflow using setup_script + build_env; adds distributed unit tests to matrix.
.github/workflows/integration_tests_common.yml Introduces a reusable integration-test workflow (MCore integration).
.github/workflows/all_tests_common.yml Centralizes platform config loading, adds lint job, and orchestrates unit + integration flows.
.github/workflows/all_tests_cuda.yml Enables unit + integration runs for CUDA via the common workflow.
.github/workflows/all_tests_metax.yml Enables unit + integration runs for Metax via the common workflow.
.github/workflows/lint_common.yml Adds a standalone reusable lint workflow.
.github/workflows/qa-l1-te-cpp-pytorch-tests.yml Updates checkout behavior and adds the MCore integration test step.
.github/scripts/setup_cuda.sh Adds CUDA environment setup script for workflow reuse.
.github/scripts/setup_metax.sh Adds Metax environment setup script for workflow reuse.
.github/configs/cuda.yml Switches to BAAI runner labels, adds setup_script, and defines CUDA build env vars.
.github/configs/metax.yml Switches to BAAI runner labels/image, adds setup_script, and defines Metax build env vars.
.github/workflows/upload-ci-logs.yml Removes legacy log upload workflow.
.github/workflows/trigger-ci.yml Removes legacy trigger workflow.
.github/workflows/lint.yml Removes legacy lint workflow.
.github/workflows/functional_tests_common.yml Removes legacy functional training workflow.
.github/workflows/qa-format.yml Removes legacy format-check workflow.
.github/workflows/deploy_nightly_docs.yml Removes legacy nightly docs workflow.
.github/workflows/blossom-ci.yml Removes legacy Blossom CI workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/integration_tests_common.yml
Comment thread transformer_engine/common/__init__.py Outdated
Comment thread qa/L1_pytorch_mcore_integration/test_bak.sh
Comment thread .github/workflows/all_tests_common.yml
Comment thread .github/workflows/all_tests_common.yml
Comment thread .github/workflows/unit_tests_common.yml Outdated
Comment thread .github/workflows/integration_tests_common.yml
@Darryl233
Copy link
Copy Markdown
Collaborator

You can disable workflows forked from the original NVIDIA repository, but do not remove them.

Comment thread .github/workflows/lint_common.yml Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have an original lint.yml. Can we use the original one?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@Darryl233 Darryl233 changed the title [CICD] Refactor workflows, Add integration_tests, Switch to BAAI metax runner [CICD] Refactor workflows, Add integration_tests, Switch to FlagCICD metax runner Apr 20, 2026
env:
TE_PATH: ${{ github.workspace }}
TE_FL_PREFER: vendor
MCORE_REPO_URL: https://github.com/BrianPei/Megatron-LM-FL.git
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not use the forked repo

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so we will edit here after Megatron-FL updated, otherwise it can't support funtional_tests

Comment thread .github/workflows/unit_tests_common.yml Outdated

- name: Execute Tests
if: steps.should_run.outputs.should_run == 'true'
if: inputs.setup_script != ''
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary in Execute Tests step?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

bash $GITHUB_WORKSPACE/${{ inputs.setup_script }}

- name: Execute Tests
if: inputs.setup_script != ''
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary in Execute Tests step?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

env:
TE_PATH: .
TE_FL_PREFER: vendor
MCORE_REPO_URL: https://github.com/BrianPei/Megatron-LM-FL.git
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not use the forked repo

Comment thread .github/workflows/qa-l1-te-cpp-pytorch-tests.yml
@Darryl233
Copy link
Copy Markdown
Collaborator

Darryl233 commented Apr 20, 2026

If you feel it's hard to split L0, L!, L2 into unit tests, functional tests and integration tests, I suggest following the original test division pattern. Names of the division pattern are not the main concern

Comment thread qa/L0_pytorch_debug_unittest/test.sh Outdated
Comment thread qa/L0_pytorch_debug_unittest/test.sh Outdated
if [ "$PLATFORM" = "metax" ]; then
case "$test_path" in
*"test_numerics.py" | *"test_api_features.py" | *"test_sanity.py")
*tests/pytorch/test_numerics.py | *tests/pytorch/test_sanity.py)
Copy link
Copy Markdown
Collaborator

@Darryl233 Darryl233 Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider configuring ignore rules for future use later.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider configuring ignore rules for future use later.

Updated

: ${XML_LOG_DIR:=/logs}
mkdir -p "$XML_LOG_DIR"

# The current CUDA 12.8 test container hits a fused-attention runtime loader
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a specific example to explain this issue?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workaround is added to avoid runtime loader issues when the fused attention path enters the CUDA backend, throwing a "Cannot load any libcudart.so.* library" error.

Comment thread qa/L1_pytorch_mcore_integration/test.sh Outdated
: ${MCORE_PATH:=${TE_PATH}/qa/L1_pytorch_mcore_integration/Megatron-LM}
: "${TE_PATH:=$(cd -- "${SCRIPT_DIR}/../.." && pwd)}"
: "${MCORE_PATH:=/workspace/Megatron-LM-FL}"
: "${MCORE_REPO_URL:=https://github.com/BrianPei/Megatron-LM-FL.git}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use the forked repo

Comment thread transformer_engine/common/__init__.py
Comment thread transformer_engine/plugin/core/backends/vendor/cuda/cuda.py
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 21, 2026

CLA assistant check
All committers have signed the CLA.

Comment thread transformer_engine/common/__init__.py Outdated
if not skip_cuda_build():
_CUDNN_LIB_CTYPES = _load_cudnn()
_NVRTC_LIB_CTYPES = _load_nvrtc()
_CUDART_LIB_CTYPES = _load_nvidia_cuda_library("cuda_runtime")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unnecessary changes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unnecessary changes.

Updated this part by restoring the original ordering here, so this block now only keeps the necessary changes.

Copy link
Copy Markdown
Collaborator

@Darryl233 Darryl233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread .github/configs/cuda.yml Outdated
hardware_name: cuda
display_name: 'NVIDIA CUDA (A100)'

# CI image for BAAI env
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove BAAI here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment thread .github/configs/cuda.yml Outdated
# - nvidia
# - gpu-8

# Runner labels for BAAI env
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment thread .github/configs/metax.yml Outdated
- X64
- metax
- dev
# CI image for BAAI env
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

needs:
- unit_tests
runs-on: ubuntu-latest
if: always() && inputs.run_unit_tests
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is always() necessary?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's necessary. If we remove the always(), when any unit_tests failed, this step will be skipped. Then the all_tests_complete check step may be skipped too, the whole workflow will show a wrong result

@Darryl233 Darryl233 merged commit d7e9e7b into flagos-ai:main Apr 24, 2026
27 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants