Consolidate CI test jobs: merge GPU smoke test and add Python version matrix (#8)

ivanbasov · web-flow · commit 36d0dfe6100c · 2026-03-05T17:28:42.000-08:00
* Consolidate CI test jobs: merge GPU smoke test and add Python version matrix - Remove separate smoke-test-gpu job (was serial after gpu-tests, increasing pipeline time). Smoke training+inference now runs in the same gpu-tests job. - Replace python-compat matrix (6 jobs, SKIP_TESTS=1) with two focused job groups that actually run tests: * gpu-tests: matrix over Python 3.11/3.12/3.13 on GPU runners — installs train deps, runs full test suite (CPU+GPU), then smoke training+inference. * inference-tests: matrix over Python 3.11/3.12/3.13 on CPU — installs inference deps, runs tests with pre-trained models (GPU tests auto-skip). Reduces total jobs from 11 to 9 while increasing actual test coverage. Made-with: Cursor * Fix GPU CI: set DEBIAN_FRONTEND=noninteractive to prevent tzdata hang The deadsnakes PPA pulls in tzdata as a dependency, which triggers an interactive timezone configuration prompt in the container. This caused all 3 GPU matrix jobs to hang for 45 minutes until timeout. Made-with: Cursor * Add pull_request trigger and gate GPU jobs to push/merge_group only Without the pull_request trigger, CI never fires on PRs — checks aren't even planned (e.g. PR #9 shows zero checks). GPU jobs are gated to push/merge_group events to avoid consuming self-hosted GPU runners on every PR update. Made-with: Cursor * Remove event gate on GPU jobs so they run on PRs too GPU jobs complete in ~5-10 minutes and serve as a useful pre-merge check. Made-with: Cursor * Remove pull-request/[0-9]+ from push trigger to fix duplicate CI runs The copy-pr-bot creates pull-request/N branches for each PR, which matched the push trigger and caused every CI job to run twice (once from pull_request, once from push). The pull_request trigger already covers PRs targeting main, so the push pattern is redundant. Made-with: Cursor * Fix GPU CI: gate on event type, restore push trigger for copy-pr-bot NVIDIA self-hosted runners block pull_request events outright. GPU CI must run via push events — either to main or to pull-request/[0-9]+ branches created by copy-pr-bot for PR testing. - Restore "pull-request/[0-9]+" in push trigger - Gate gpu-tests with if: github.event_name != 'pull_request' - CPU jobs (inference-tests, unit-tests, etc.) still run on pull_request Made-with: Cursor * Remove pull-request/[0-9]+ push pattern and pull_request gate on GPU jobs Simplify triggers: all jobs (including GPU) run on pull_request, push to main, and merge_group. The pull-request/[0-9]+ branch convention is not used by contributors. Made-with: Cursor * Merge unit-tests + inference-tests, gate GPU jobs from pull_request - Combine unit-tests (py3.12) and inference-tests (py3.11/3.12/3.13) into a single unit-tests matrix job across all three Python versions. Both ran identical test suites with inference requirements. - Re-add if: github.event_name != 'pull_request' on gpu-tests since NVIDIA self-hosted runners block pull_request events entirely. GPU CI runs on push to main and merge_group. Made-with: Cursor * Split GPU tests into separate workflow to avoid skipped PR noise NVIDIA self-hosted runners block pull_request events, so GPU jobs in the main CI workflow always showed as a single "Skipped" entry with unresolved matrix names on every PR. Move GPU jobs to ci-gpu.yml (triggers: push to main, merge_group, workflow_dispatch). The main ci.yml keeps CPU jobs only (triggers: pull_request, push to main, merge_group, workflow_dispatch). Made-with: Cursor * Enable GPU CI on PRs via copy-pr-bot push trigger Add pull-request/[0-9]+ to ci-gpu.yml push trigger so GPU tests run when copy-pr-bot creates the corresponding branch for a PR. Made-with: Cursor * Fix smoke test step: use bash shell for source command The container default shell is sh, which doesn't have the source builtin. Explicitly set shell: bash for the venv activation step. Made-with: Cursor * Install gcc in GPU container for torch.compile/inductor The smoke training step uses torch.compile which invokes the inductor backend, requiring a C compiler. The ubuntu:22.04 container doesn't ship with gcc. Made-with: Cursor * Switch CPU jobs to NVIDIA self-hosted linux-amd64-cpu4 runners Use nv-cpu-general runner group instead of GitHub-hosted ubuntu-latest. Also restore pull-request/[0-9]+ push trigger in case self-hosted CPU runners block pull_request events (same as GPU runners). Made-with: Cursor * Remove pull_request trigger since all runners are NVIDIA self-hosted NVIDIA self-hosted runners block pull_request events. All CI (CPU and GPU) now runs via copy-pr-bot push to pull-request/[0-9]+ branches. Made-with: Cursor
diff --git a/.github/workflows/ci-gpu.yml b/.github/workflows/ci-gpu.yml
@@ -0,0 +1,94 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
+#
+# NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
+# property and proprietary rights in and to this material, related
+# documentation and any modifications thereto. Any use, reproduction,
+# disclosure or distribution of this material and related documentation
+# without an express license agreement from NVIDIA CORPORATION or
+# its affiliates is strictly prohibited.
+
+# GPU tests live in a separate workflow because NVIDIA self-hosted runners
+# block pull_request events entirely. Keeping them here avoids a confusing
+# "Skipped" entry with unresolved matrix names on every PR.
+
+name: CI / GPU
+
+on:
+  workflow_dispatch:
+  push:
+    branches:
+      - main
+      - "pull-request/[0-9]+"
+  merge_group:
+    types:
+      - checks_requested
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true
+
+env:
+  PIP_NO_CACHE_DIR: "1"
+  PIP_DISABLE_PIP_VERSION_CHECK: "1"
+  PIP_PREFER_BINARY: "1"
+
+jobs:
+  gpu-tests:
+    runs-on: linux-amd64-gpu-rtxpro6000-latest-1
+    container:
+      image: ubuntu:22.04
+      options: -u root --security-opt seccomp=unconfined --shm-size 16g
+      env:
+        NVIDIA_VISIBLE_DEVICES: ${{ env.NVIDIA_VISIBLE_DEVICES }}
+    timeout-minutes: 45
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ["3.11", "3.12", "3.13"]
+    name: "gpu / py${{ matrix.python-version }}"
+    steps:
+      - name: Setup proxy cache
+        uses: nv-gha-runners/setup-proxy-cache@main
+        with:
+          enable-apt: true
+
+      - name: Install system dependencies
+        run: |
+          export DEBIAN_FRONTEND=noninteractive
+          apt-get update
+          apt-get install -y git git-lfs gcc software-properties-common
+          add-apt-repository -y ppa:deadsnakes/ppa
+          apt-get update
+          apt-get install -y \
+            python${{ matrix.python-version }} \
+            python${{ matrix.python-version }}-venv \
+            python${{ matrix.python-version }}-dev
+          git lfs install
+
+      - uses: actions/checkout@v4
+        with:
+          lfs: true
+
+      - name: Verify GPU
+        run: nvidia-smi
+
+      - name: Install dependencies and run tests
+        run: bash code/scripts/check_python_compat.sh
+        env:
+          PYTHON_BIN: python${{ matrix.python-version }}
+          MODE: train
+          SKIP_TESTS: "0"
+          REQUIRE_GPU: "1"
+
+      - name: Smoke training + inference
+        shell: bash
+        run: |
+          source .venv_train_${{ matrix.python-version }}/bin/activate
+          bash code/scripts/smoke_run.sh
+        env:
+          EXPERIMENT_NAME: ci_smoke
+          PREDECODER_TRAIN_SAMPLES: "4096"
+          PREDECODER_VAL_SAMPLES: "512"
+          PREDECODER_TEST_SAMPLES: "512"
+          PREDECODER_TRAIN_EPOCHS: "1"
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -29,12 +29,9 @@ env:
   PIP_DISABLE_PIP_VERSION_CHECK: "1"
   PIP_PREFER_BINARY: "1"
 
-# ---------------------------------------------------------------------------
-# CPU jobs (GitHub-hosted runners)
-# ---------------------------------------------------------------------------
 jobs:
   spdx-header-check:
-    runs-on: ubuntu-latest
+    runs-on: linux-amd64-cpu4
     steps:
       - uses: actions/checkout@v4
       - uses: actions/setup-python@v5
@@ -43,24 +40,29 @@ jobs:
       - run: python3 code/scripts/spdx_headers.py --check
 
   unit-tests:
-    runs-on: ubuntu-latest
+    runs-on: linux-amd64-cpu4
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ["3.11", "3.12", "3.13"]
+    name: "unit-tests / py${{ matrix.python-version }}"
     steps:
       - uses: actions/checkout@v4
         with:
           lfs: true
       - uses: actions/setup-python@v5
         with:
-          python-version: "3.12"
-      - name: Install dependencies
-        run: |
-          python -m pip install --upgrade pip setuptools wheel
-          pip install -r code/requirements_public_inference.txt \
-            --extra-index-url https://download.pytorch.org/whl/cpu
-      - name: Run tests
-        run: PYTHONPATH=code python -m unittest discover -s code/tests -p "test_*.py"
+          python-version: ${{ matrix.python-version }}
+      - name: Install dependencies and run tests
+        run: bash code/scripts/check_python_compat.sh
+        env:
+          PYTHON_BIN: python
+          MODE: inference
+          SKIP_TESTS: "0"
+          PIP_EXTRA_INDEX_URL: "https://download.pytorch.org/whl/cpu"
 
   unit-tests-coverage:
-    runs-on: ubuntu-latest
+    runs-on: linux-amd64-cpu4
     steps:
       - uses: actions/checkout@v4
         with:
@@ -88,108 +90,3 @@ jobs:
           path: |
             htmlcov/
             coverage.xml
-
-  python-compat:
-    runs-on: ubuntu-latest
-    strategy:
-      fail-fast: false
-      matrix:
-        python-version: ["3.11", "3.12", "3.13"]
-        mode: [inference, train]
-    name: "compat / py${{ matrix.python-version }} / ${{ matrix.mode }}"
-    steps:
-      - uses: actions/checkout@v4
-      - uses: actions/setup-python@v5
-        with:
-          python-version: ${{ matrix.python-version }}
-      - name: Check Python compatibility
-        run: bash code/scripts/check_python_compat.sh
-        env:
-          MODE: ${{ matrix.mode }}
-          PYTHON_BIN: python
-          SKIP_TESTS: "1"
-          PIP_EXTRA_INDEX_URL: "https://download.pytorch.org/whl/cpu"
-
-  # ---------------------------------------------------------------------------
-  # GPU jobs (self-hosted NVIDIA runners)
-  # ---------------------------------------------------------------------------
-  gpu-tests:
-    runs-on: linux-amd64-gpu-rtxpro6000-latest-1
-    container:
-      image: ubuntu:22.04
-      options: -u root --security-opt seccomp=unconfined --shm-size 16g
-      env:
-        NVIDIA_VISIBLE_DEVICES: ${{ env.NVIDIA_VISIBLE_DEVICES }}
-    timeout-minutes: 30
-    steps:
-      - name: Setup proxy cache
-        uses: nv-gha-runners/setup-proxy-cache@main
-        with:
-          enable-apt: true
-
-      - name: Install system dependencies
-        run: |
-          apt-get update
-          apt-get install -y git git-lfs python3 python3-pip python3-venv
-          git lfs install
-
-      - uses: actions/checkout@v4
-        with:
-          lfs: true
-
-      - name: Install Python dependencies
-        run: |
-          python3 -m pip install --upgrade pip setuptools wheel
-          pip install -r code/requirements_public_inference.txt
-
-      - name: Verify GPU
-        run: |
-          nvidia-smi
-          python3 -c "import torch; assert torch.cuda.is_available(), 'CUDA not available'; print(torch.cuda.get_device_name(0))"
-
-      - name: Run full test suite (CPU + GPU)
-        run: PYTHONPATH=code python3 -m unittest discover -s code/tests -p "test_*.py"
-
-  smoke-test-gpu:
-    runs-on: linux-amd64-gpu-rtxpro6000-latest-1
-    needs: gpu-tests
-    container:
-      image: ubuntu:22.04
-      options: -u root --security-opt seccomp=unconfined --shm-size 16g
-      env:
-        NVIDIA_VISIBLE_DEVICES: ${{ env.NVIDIA_VISIBLE_DEVICES }}
-    timeout-minutes: 30
-    steps:
-      - name: Setup proxy cache
-        uses: nv-gha-runners/setup-proxy-cache@main
-        with:
-          enable-apt: true
-
-      - name: Install system dependencies
-        run: |
-          apt-get update
-          apt-get install -y git git-lfs python3 python3-pip python3-venv
-          git lfs install
-
-      - uses: actions/checkout@v4
-        with:
-          lfs: true
-
-      - name: Install Python dependencies
-        run: |
-          python3 -m pip install --upgrade pip setuptools wheel
-          pip install -r code/requirements_public_train.txt
-
-      - name: Verify GPU
-        run: |
-          nvidia-smi
-          python3 -c "import torch; assert torch.cuda.is_available(), 'CUDA not available'"
-
-      - name: Smoke training + inference
-        run: bash code/scripts/smoke_run.sh
-        env:
-          EXPERIMENT_NAME: ci_smoke
-          PREDECODER_TRAIN_SAMPLES: "4096"
-          PREDECODER_VAL_SAMPLES: "512"
-          PREDECODER_TEST_SAMPLES: "512"
-          PREDECODER_TRAIN_EPOCHS: "1"