Skip to content

Commit 36d0dfe

Browse files
authored
Consolidate CI test jobs: merge GPU smoke test and add Python version matrix (#8)
* Consolidate CI test jobs: merge GPU smoke test and add Python version matrix - Remove separate smoke-test-gpu job (was serial after gpu-tests, increasing pipeline time). Smoke training+inference now runs in the same gpu-tests job. - Replace python-compat matrix (6 jobs, SKIP_TESTS=1) with two focused job groups that actually run tests: * gpu-tests: matrix over Python 3.11/3.12/3.13 on GPU runners — installs train deps, runs full test suite (CPU+GPU), then smoke training+inference. * inference-tests: matrix over Python 3.11/3.12/3.13 on CPU — installs inference deps, runs tests with pre-trained models (GPU tests auto-skip). Reduces total jobs from 11 to 9 while increasing actual test coverage. Made-with: Cursor * Fix GPU CI: set DEBIAN_FRONTEND=noninteractive to prevent tzdata hang The deadsnakes PPA pulls in tzdata as a dependency, which triggers an interactive timezone configuration prompt in the container. This caused all 3 GPU matrix jobs to hang for 45 minutes until timeout. Made-with: Cursor * Add pull_request trigger and gate GPU jobs to push/merge_group only Without the pull_request trigger, CI never fires on PRs — checks aren't even planned (e.g. PR #9 shows zero checks). GPU jobs are gated to push/merge_group events to avoid consuming self-hosted GPU runners on every PR update. Made-with: Cursor * Remove event gate on GPU jobs so they run on PRs too GPU jobs complete in ~5-10 minutes and serve as a useful pre-merge check. Made-with: Cursor * Remove pull-request/[0-9]+ from push trigger to fix duplicate CI runs The copy-pr-bot creates pull-request/N branches for each PR, which matched the push trigger and caused every CI job to run twice (once from pull_request, once from push). The pull_request trigger already covers PRs targeting main, so the push pattern is redundant. Made-with: Cursor * Fix GPU CI: gate on event type, restore push trigger for copy-pr-bot NVIDIA self-hosted runners block pull_request events outright. GPU CI must run via push events — either to main or to pull-request/[0-9]+ branches created by copy-pr-bot for PR testing. - Restore "pull-request/[0-9]+" in push trigger - Gate gpu-tests with if: github.event_name != 'pull_request' - CPU jobs (inference-tests, unit-tests, etc.) still run on pull_request Made-with: Cursor * Remove pull-request/[0-9]+ push pattern and pull_request gate on GPU jobs Simplify triggers: all jobs (including GPU) run on pull_request, push to main, and merge_group. The pull-request/[0-9]+ branch convention is not used by contributors. Made-with: Cursor * Merge unit-tests + inference-tests, gate GPU jobs from pull_request - Combine unit-tests (py3.12) and inference-tests (py3.11/3.12/3.13) into a single unit-tests matrix job across all three Python versions. Both ran identical test suites with inference requirements. - Re-add if: github.event_name != 'pull_request' on gpu-tests since NVIDIA self-hosted runners block pull_request events entirely. GPU CI runs on push to main and merge_group. Made-with: Cursor * Split GPU tests into separate workflow to avoid skipped PR noise NVIDIA self-hosted runners block pull_request events, so GPU jobs in the main CI workflow always showed as a single "Skipped" entry with unresolved matrix names on every PR. Move GPU jobs to ci-gpu.yml (triggers: push to main, merge_group, workflow_dispatch). The main ci.yml keeps CPU jobs only (triggers: pull_request, push to main, merge_group, workflow_dispatch). Made-with: Cursor * Enable GPU CI on PRs via copy-pr-bot push trigger Add pull-request/[0-9]+ to ci-gpu.yml push trigger so GPU tests run when copy-pr-bot creates the corresponding branch for a PR. Made-with: Cursor * Fix smoke test step: use bash shell for source command The container default shell is sh, which doesn't have the source builtin. Explicitly set shell: bash for the venv activation step. Made-with: Cursor * Install gcc in GPU container for torch.compile/inductor The smoke training step uses torch.compile which invokes the inductor backend, requiring a C compiler. The ubuntu:22.04 container doesn't ship with gcc. Made-with: Cursor * Switch CPU jobs to NVIDIA self-hosted linux-amd64-cpu4 runners Use nv-cpu-general runner group instead of GitHub-hosted ubuntu-latest. Also restore pull-request/[0-9]+ push trigger in case self-hosted CPU runners block pull_request events (same as GPU runners). Made-with: Cursor * Remove pull_request trigger since all runners are NVIDIA self-hosted NVIDIA self-hosted runners block pull_request events. All CI (CPU and GPU) now runs via copy-pr-bot push to pull-request/[0-9]+ branches. Made-with: Cursor
1 parent 6f78996 commit 36d0dfe

2 files changed

Lines changed: 110 additions & 119 deletions

File tree

.github/workflows/ci-gpu.yml

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
3+
#
4+
# NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
5+
# property and proprietary rights in and to this material, related
6+
# documentation and any modifications thereto. Any use, reproduction,
7+
# disclosure or distribution of this material and related documentation
8+
# without an express license agreement from NVIDIA CORPORATION or
9+
# its affiliates is strictly prohibited.
10+
11+
# GPU tests live in a separate workflow because NVIDIA self-hosted runners
12+
# block pull_request events entirely. Keeping them here avoids a confusing
13+
# "Skipped" entry with unresolved matrix names on every PR.
14+
15+
name: CI / GPU
16+
17+
on:
18+
workflow_dispatch:
19+
push:
20+
branches:
21+
- main
22+
- "pull-request/[0-9]+"
23+
merge_group:
24+
types:
25+
- checks_requested
26+
27+
concurrency:
28+
group: ${{ github.workflow }}-${{ github.ref }}
29+
cancel-in-progress: true
30+
31+
env:
32+
PIP_NO_CACHE_DIR: "1"
33+
PIP_DISABLE_PIP_VERSION_CHECK: "1"
34+
PIP_PREFER_BINARY: "1"
35+
36+
jobs:
37+
gpu-tests:
38+
runs-on: linux-amd64-gpu-rtxpro6000-latest-1
39+
container:
40+
image: ubuntu:22.04
41+
options: -u root --security-opt seccomp=unconfined --shm-size 16g
42+
env:
43+
NVIDIA_VISIBLE_DEVICES: ${{ env.NVIDIA_VISIBLE_DEVICES }}
44+
timeout-minutes: 45
45+
strategy:
46+
fail-fast: false
47+
matrix:
48+
python-version: ["3.11", "3.12", "3.13"]
49+
name: "gpu / py${{ matrix.python-version }}"
50+
steps:
51+
- name: Setup proxy cache
52+
uses: nv-gha-runners/setup-proxy-cache@main
53+
with:
54+
enable-apt: true
55+
56+
- name: Install system dependencies
57+
run: |
58+
export DEBIAN_FRONTEND=noninteractive
59+
apt-get update
60+
apt-get install -y git git-lfs gcc software-properties-common
61+
add-apt-repository -y ppa:deadsnakes/ppa
62+
apt-get update
63+
apt-get install -y \
64+
python${{ matrix.python-version }} \
65+
python${{ matrix.python-version }}-venv \
66+
python${{ matrix.python-version }}-dev
67+
git lfs install
68+
69+
- uses: actions/checkout@v4
70+
with:
71+
lfs: true
72+
73+
- name: Verify GPU
74+
run: nvidia-smi
75+
76+
- name: Install dependencies and run tests
77+
run: bash code/scripts/check_python_compat.sh
78+
env:
79+
PYTHON_BIN: python${{ matrix.python-version }}
80+
MODE: train
81+
SKIP_TESTS: "0"
82+
REQUIRE_GPU: "1"
83+
84+
- name: Smoke training + inference
85+
shell: bash
86+
run: |
87+
source .venv_train_${{ matrix.python-version }}/bin/activate
88+
bash code/scripts/smoke_run.sh
89+
env:
90+
EXPERIMENT_NAME: ci_smoke
91+
PREDECODER_TRAIN_SAMPLES: "4096"
92+
PREDECODER_VAL_SAMPLES: "512"
93+
PREDECODER_TEST_SAMPLES: "512"
94+
PREDECODER_TRAIN_EPOCHS: "1"

.github/workflows/ci.yml

Lines changed: 16 additions & 119 deletions
Original file line numberDiff line numberDiff line change
@@ -29,12 +29,9 @@ env:
2929
PIP_DISABLE_PIP_VERSION_CHECK: "1"
3030
PIP_PREFER_BINARY: "1"
3131

32-
# ---------------------------------------------------------------------------
33-
# CPU jobs (GitHub-hosted runners)
34-
# ---------------------------------------------------------------------------
3532
jobs:
3633
spdx-header-check:
37-
runs-on: ubuntu-latest
34+
runs-on: linux-amd64-cpu4
3835
steps:
3936
- uses: actions/checkout@v4
4037
- uses: actions/setup-python@v5
@@ -43,24 +40,29 @@ jobs:
4340
- run: python3 code/scripts/spdx_headers.py --check
4441

4542
unit-tests:
46-
runs-on: ubuntu-latest
43+
runs-on: linux-amd64-cpu4
44+
strategy:
45+
fail-fast: false
46+
matrix:
47+
python-version: ["3.11", "3.12", "3.13"]
48+
name: "unit-tests / py${{ matrix.python-version }}"
4749
steps:
4850
- uses: actions/checkout@v4
4951
with:
5052
lfs: true
5153
- uses: actions/setup-python@v5
5254
with:
53-
python-version: "3.12"
54-
- name: Install dependencies
55-
run: |
56-
python -m pip install --upgrade pip setuptools wheel
57-
pip install -r code/requirements_public_inference.txt \
58-
--extra-index-url https://download.pytorch.org/whl/cpu
59-
- name: Run tests
60-
run: PYTHONPATH=code python -m unittest discover -s code/tests -p "test_*.py"
55+
python-version: ${{ matrix.python-version }}
56+
- name: Install dependencies and run tests
57+
run: bash code/scripts/check_python_compat.sh
58+
env:
59+
PYTHON_BIN: python
60+
MODE: inference
61+
SKIP_TESTS: "0"
62+
PIP_EXTRA_INDEX_URL: "https://download.pytorch.org/whl/cpu"
6163

6264
unit-tests-coverage:
63-
runs-on: ubuntu-latest
65+
runs-on: linux-amd64-cpu4
6466
steps:
6567
- uses: actions/checkout@v4
6668
with:
@@ -88,108 +90,3 @@ jobs:
8890
path: |
8991
htmlcov/
9092
coverage.xml
91-
92-
python-compat:
93-
runs-on: ubuntu-latest
94-
strategy:
95-
fail-fast: false
96-
matrix:
97-
python-version: ["3.11", "3.12", "3.13"]
98-
mode: [inference, train]
99-
name: "compat / py${{ matrix.python-version }} / ${{ matrix.mode }}"
100-
steps:
101-
- uses: actions/checkout@v4
102-
- uses: actions/setup-python@v5
103-
with:
104-
python-version: ${{ matrix.python-version }}
105-
- name: Check Python compatibility
106-
run: bash code/scripts/check_python_compat.sh
107-
env:
108-
MODE: ${{ matrix.mode }}
109-
PYTHON_BIN: python
110-
SKIP_TESTS: "1"
111-
PIP_EXTRA_INDEX_URL: "https://download.pytorch.org/whl/cpu"
112-
113-
# ---------------------------------------------------------------------------
114-
# GPU jobs (self-hosted NVIDIA runners)
115-
# ---------------------------------------------------------------------------
116-
gpu-tests:
117-
runs-on: linux-amd64-gpu-rtxpro6000-latest-1
118-
container:
119-
image: ubuntu:22.04
120-
options: -u root --security-opt seccomp=unconfined --shm-size 16g
121-
env:
122-
NVIDIA_VISIBLE_DEVICES: ${{ env.NVIDIA_VISIBLE_DEVICES }}
123-
timeout-minutes: 30
124-
steps:
125-
- name: Setup proxy cache
126-
uses: nv-gha-runners/setup-proxy-cache@main
127-
with:
128-
enable-apt: true
129-
130-
- name: Install system dependencies
131-
run: |
132-
apt-get update
133-
apt-get install -y git git-lfs python3 python3-pip python3-venv
134-
git lfs install
135-
136-
- uses: actions/checkout@v4
137-
with:
138-
lfs: true
139-
140-
- name: Install Python dependencies
141-
run: |
142-
python3 -m pip install --upgrade pip setuptools wheel
143-
pip install -r code/requirements_public_inference.txt
144-
145-
- name: Verify GPU
146-
run: |
147-
nvidia-smi
148-
python3 -c "import torch; assert torch.cuda.is_available(), 'CUDA not available'; print(torch.cuda.get_device_name(0))"
149-
150-
- name: Run full test suite (CPU + GPU)
151-
run: PYTHONPATH=code python3 -m unittest discover -s code/tests -p "test_*.py"
152-
153-
smoke-test-gpu:
154-
runs-on: linux-amd64-gpu-rtxpro6000-latest-1
155-
needs: gpu-tests
156-
container:
157-
image: ubuntu:22.04
158-
options: -u root --security-opt seccomp=unconfined --shm-size 16g
159-
env:
160-
NVIDIA_VISIBLE_DEVICES: ${{ env.NVIDIA_VISIBLE_DEVICES }}
161-
timeout-minutes: 30
162-
steps:
163-
- name: Setup proxy cache
164-
uses: nv-gha-runners/setup-proxy-cache@main
165-
with:
166-
enable-apt: true
167-
168-
- name: Install system dependencies
169-
run: |
170-
apt-get update
171-
apt-get install -y git git-lfs python3 python3-pip python3-venv
172-
git lfs install
173-
174-
- uses: actions/checkout@v4
175-
with:
176-
lfs: true
177-
178-
- name: Install Python dependencies
179-
run: |
180-
python3 -m pip install --upgrade pip setuptools wheel
181-
pip install -r code/requirements_public_train.txt
182-
183-
- name: Verify GPU
184-
run: |
185-
nvidia-smi
186-
python3 -c "import torch; assert torch.cuda.is_available(), 'CUDA not available'"
187-
188-
- name: Smoke training + inference
189-
run: bash code/scripts/smoke_run.sh
190-
env:
191-
EXPERIMENT_NAME: ci_smoke
192-
PREDECODER_TRAIN_SAMPLES: "4096"
193-
PREDECODER_VAL_SAMPLES: "512"
194-
PREDECODER_TEST_SAMPLES: "512"
195-
PREDECODER_TRAIN_EPOCHS: "1"

0 commit comments

Comments
 (0)