[CICD] Refactor workflows, Add integration_tests, Switch to FlagCICD metax runner#60
[CICD] Refactor workflows, Add integration_tests, Switch to FlagCICD metax runner#60Darryl233 merged 25 commits intoflagos-ai:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors the CI/CD pipeline to run a simplified, reusable test suite across both CUDA (A100) and Metax (C500) environments (using BAAI runner configs), while also addressing CUDA runtime library loading issues and adding an MCore (Megatron-LM-FL) integration test.
Changes:
- Refactors GitHub Actions into reusable workflows (lint/unit/integration), removes several legacy/disabled workflows, and centralizes per-platform configuration under
.github/configs/*.yml. - Adds/updates QA scripts for Metax/CUDA stability (skipping incompatible distributed tests on Metax; improving CUDA runtime discoverability).
- Introduces a Megatron-LM-FL MCore integration test path and wiring in CI.
Reviewed changes
Copilot reviewed 27 out of 28 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
transformer_engine/plugin/core/backends/vendor/cuda/cuda.py |
Adds CUDA library preloading with extra search locations (env, pip-installed NVIDIA libs, ldconfig). |
transformer_engine/common/__init__.py |
Adds _load_cudart() fallback chain and wires it into core library load sequence. |
qa/L1_pytorch_mcore_integration/test.sh |
Reworks the MCore integration test runner (repo sync + torchrun + tensorboard/checkpoint output). |
qa/L1_pytorch_mcore_integration/test_bak.sh |
Adds a backup/legacy version of the integration script. |
qa/L1_pytorch_distributed_unittest/test.sh |
Adds CUDA runtime preload + Metax skip list + makes debug numerics optional if nvdlfw_inspect is unavailable. |
qa/L0_pytorch_unittest/test.sh |
Adjusts Metax skip patterns (removes test_cuda_graphs.py from skip list). |
qa/L0_pytorch_debug_unittest/test.sh |
Updates Metax skip matching patterns and adds local run hints. |
.github/workflows/unit_tests_common.yml |
Simplifies into a reusable unit-test workflow using setup_script + build_env; adds distributed unit tests to matrix. |
.github/workflows/integration_tests_common.yml |
Introduces a reusable integration-test workflow (MCore integration). |
.github/workflows/all_tests_common.yml |
Centralizes platform config loading, adds lint job, and orchestrates unit + integration flows. |
.github/workflows/all_tests_cuda.yml |
Enables unit + integration runs for CUDA via the common workflow. |
.github/workflows/all_tests_metax.yml |
Enables unit + integration runs for Metax via the common workflow. |
.github/workflows/lint_common.yml |
Adds a standalone reusable lint workflow. |
.github/workflows/qa-l1-te-cpp-pytorch-tests.yml |
Updates checkout behavior and adds the MCore integration test step. |
.github/scripts/setup_cuda.sh |
Adds CUDA environment setup script for workflow reuse. |
.github/scripts/setup_metax.sh |
Adds Metax environment setup script for workflow reuse. |
.github/configs/cuda.yml |
Switches to BAAI runner labels, adds setup_script, and defines CUDA build env vars. |
.github/configs/metax.yml |
Switches to BAAI runner labels/image, adds setup_script, and defines Metax build env vars. |
.github/workflows/upload-ci-logs.yml |
Removes legacy log upload workflow. |
.github/workflows/trigger-ci.yml |
Removes legacy trigger workflow. |
.github/workflows/lint.yml |
Removes legacy lint workflow. |
.github/workflows/functional_tests_common.yml |
Removes legacy functional training workflow. |
.github/workflows/qa-format.yml |
Removes legacy format-check workflow. |
.github/workflows/deploy_nightly_docs.yml |
Removes legacy nightly docs workflow. |
.github/workflows/blossom-ci.yml |
Removes legacy Blossom CI workflow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
You can disable workflows forked from the original NVIDIA repository, but do not remove them. |
There was a problem hiding this comment.
I think we have an original lint.yml. Can we use the original one?
| env: | ||
| TE_PATH: ${{ github.workspace }} | ||
| TE_FL_PREFER: vendor | ||
| MCORE_REPO_URL: https://github.com/BrianPei/Megatron-LM-FL.git |
There was a problem hiding this comment.
Should not use the forked repo
There was a problem hiding this comment.
OK, so we will edit here after Megatron-FL updated, otherwise it can't support funtional_tests
|
|
||
| - name: Execute Tests | ||
| if: steps.should_run.outputs.should_run == 'true' | ||
| if: inputs.setup_script != '' |
There was a problem hiding this comment.
Is this necessary in Execute Tests step?
| bash $GITHUB_WORKSPACE/${{ inputs.setup_script }} | ||
|
|
||
| - name: Execute Tests | ||
| if: inputs.setup_script != '' |
There was a problem hiding this comment.
Is this necessary in Execute Tests step?
| env: | ||
| TE_PATH: . | ||
| TE_FL_PREFER: vendor | ||
| MCORE_REPO_URL: https://github.com/BrianPei/Megatron-LM-FL.git |
There was a problem hiding this comment.
Should not use the forked repo
|
If you feel it's hard to split L0, L!, L2 into unit tests, functional tests and integration tests, I suggest following the original test division pattern. Names of the division pattern are not the main concern |
| if [ "$PLATFORM" = "metax" ]; then | ||
| case "$test_path" in | ||
| *"test_numerics.py" | *"test_api_features.py" | *"test_sanity.py") | ||
| *tests/pytorch/test_numerics.py | *tests/pytorch/test_sanity.py) |
There was a problem hiding this comment.
Consider configuring ignore rules for future use later.
There was a problem hiding this comment.
Consider configuring ignore rules for future use later.
Updated
| : ${XML_LOG_DIR:=/logs} | ||
| mkdir -p "$XML_LOG_DIR" | ||
|
|
||
| # The current CUDA 12.8 test container hits a fused-attention runtime loader |
There was a problem hiding this comment.
Can you give a specific example to explain this issue?
There was a problem hiding this comment.
This workaround is added to avoid runtime loader issues when the fused attention path enters the CUDA backend, throwing a "Cannot load any libcudart.so.* library" error.
| : ${MCORE_PATH:=${TE_PATH}/qa/L1_pytorch_mcore_integration/Megatron-LM} | ||
| : "${TE_PATH:=$(cd -- "${SCRIPT_DIR}/../.." && pwd)}" | ||
| : "${MCORE_PATH:=/workspace/Megatron-LM-FL}" | ||
| : "${MCORE_REPO_URL:=https://github.com/BrianPei/Megatron-LM-FL.git}" |
There was a problem hiding this comment.
Do not use the forked repo
| if not skip_cuda_build(): | ||
| _CUDNN_LIB_CTYPES = _load_cudnn() | ||
| _NVRTC_LIB_CTYPES = _load_nvrtc() | ||
| _CUDART_LIB_CTYPES = _load_nvidia_cuda_library("cuda_runtime") |
There was a problem hiding this comment.
Remove unnecessary changes.
There was a problem hiding this comment.
Remove unnecessary changes.
Updated this part by restoring the original ordering here, so this block now only keeps the necessary changes.
| hardware_name: cuda | ||
| display_name: 'NVIDIA CUDA (A100)' | ||
|
|
||
| # CI image for BAAI env |
| # - nvidia | ||
| # - gpu-8 | ||
|
|
||
| # Runner labels for BAAI env |
| - X64 | ||
| - metax | ||
| - dev | ||
| # CI image for BAAI env |
| needs: | ||
| - unit_tests | ||
| runs-on: ubuntu-latest | ||
| if: always() && inputs.run_unit_tests |
There was a problem hiding this comment.
It's necessary. If we remove the always(), when any unit_tests failed, this step will be skipped. Then the all_tests_complete check step may be skipped too, the whole workflow will show a wrong result
Description
Refactors CI/CD workflows to support both CUDA (NVIDIA A100) and Metax (C500) platforms, removes obsolete workflows, and fixes several platform-specific test failures. Add functional testing, and log reporting, with significant workflow simplification, and Metax platform use BAAI runner configs.
Type of change
Changes
lint_common.yml(runs in parallel); addintegration_tests_common.ymlsetup_cuda.sh/setup_metax.sh); switched Metax config to BAAI online environment; removed unsupported test types (JAX distributed) from Metax matrixtest_numerics,test_torch_fsdp2, etc.) to preventtorchrunSIGSEGVnvidia-smi-only FP8 detection with platform-aware checklibcudartload failure when runtime is pip-installed (add proper fallback chain in_load_cudart()andtry_load_lib)Checklist