Skip to content

Bump container hooks to v0.8.8 and harden smoke tests#434

Merged
jeanschmidt merged 2 commits intomainfrom
jeanschmidt/rpc
Apr 13, 2026
Merged

Bump container hooks to v0.8.8 and harden smoke tests#434
jeanschmidt merged 2 commits intomainfrom
jeanschmidt/rpc

Conversation

@jeanschmidt
Copy link
Copy Markdown
Contributor

Impact: CI runners, integration tests, smoke tests
Risk: High

See jeanschmidt/runner-container-hooks#1 for details on what v0.8.7 introduces.

What

Bumps the patched runner-container-hooks from v0.8.5 to v0.8.7 (from jeanschmidt/runner-container-hooks fork), standardizes all integration test containers on ghcr.io/actions/actions-runner:latest, and hardens smoke tests against node churn edge cases.

Why

The container hooks v0.8.7 release includes fixes from jeanschmidt/runner-container-hooks#1. Smoke tests were flaking when pods remained scheduled on nodes that had already been terminated (disappeared from the API server) or on cordoned nodes.

How

  • Switched all test containers to actions-runner image — GPU tests rely on host-level NVIDIA drivers exposed into the container, not on CUDA libraries in the container image
  • Added "disappeared node" tracking in smoke tests — pods referencing nodes no longer in the API server are excluded from failure counts rather than causing false positives
  • Added cordoned node (spec.unschedulable) detection to the unstable node classifier

Changes

Container hooks bump:

  • hooks-warmer.yaml: HOOKS_VERSION 0.8.5 → 0.8.7

Integration test container standardization:

  • build-image.yaml: Replaced privileged moby/buildkit:v0.29.0 container with non-privileged actions-runner:latest + runtime buildctl install step
  • integration-test.yaml.tpl: GPU test jobs (test-gpu-t4, test-gpu-t4-multi, test-gpu-b200-2) switched from nvidia/cuda:12.6.3-base-ubuntu22.04 to actions-runner:latest
  • workflow_generator.py: GPU_CONTAINER constant changed from nvidia/cuda:12.6.3-runtime-ubuntu22.04 to actions-runner:latest

Smoke test hardening:

  • helpers.py: _is_node_unstable() now treats cordoned nodes (spec.unschedulable) as unstable; new get_all_node_names() helper
  • test_logging.py: test_alloy_pods_running now excludes pods on "disappeared" nodes (node no longer in API server) from failure assertions; reports count in error messages

Testing

$  just integration-test arc-staging
Updating kubeconfig for pytorch-arc-staging (us-west-1)...
Updated context pytorch-arc-staging in /Users/jschmidt/.kube/config
17:36:04 [INFO] Integration test for cluster: arc-staging (pytorch-arc-staging)
17:36:04 [INFO]   Runner prefix: 'c-mt-'
17:36:04 [INFO]   B200 enabled: False
17:36:04 [INFO]   Release runners: True
17:36:04 [INFO]   Cache enforcer: True
17:36:04 [INFO]   PyPI cache slugs: cpu cu126 cu128 cu130
17:36:04 [INFO]   Smoke tests: skip
17:36:04 [INFO]   Compactor tests: skip
17:36:04 [INFO]   Branch: osdc-integration-test-arc-staging
17:36:04 [INFO] Phase 0: Cleaning up stale PRs...
17:36:09 [INFO] Phase 1: Checking for active runner pods (arc-staging only)...
17:36:16 [INFO]   No runner pods active. Skipping pool clear.
17:36:16 [INFO]   Canary repo already cloned at /Users/jschmidt/meta/ciforge/osdc/upstream/osdc/.scratch/pytorch-canary, fetching...
17:36:17 [INFO] Phase 2: Preparing PR...
17:36:33 [INFO]   PR #411 created: https://github.com/pytorch/pytorch-canary/pull/411
17:36:33 [INFO] Phase 3: Running parallel validation...
17:36:33 [INFO] Phase 4: Waiting for PR workflow runs (timeout: 50 min, buffer: 10 min)...
17:36:33 [INFO]   Filtering to runs created after 2026-04-11T00:36:17.795084+00:00
17:36:40 [INFO]   Run: OSDC Integration Test — https://github.com/pytorch/pytorch-canary/actions/runs/24270460678
17:36:40 [INFO]   1/1 runs still in progress...
17:37:13 [INFO]   1/1 runs still in progress...
17:37:45 [INFO]   1/1 runs still in progress...
17:38:18 [INFO]   1/1 runs still in progress...
17:38:52 [INFO]   1/1 runs still in progress...
17:39:24 [INFO]   1/1 runs still in progress...
17:40:00 [INFO]   1/1 runs still in progress...
17:40:31 [INFO]   1/1 runs still in progress...
17:41:05 [INFO]   1/1 runs still in progress...
17:41:40 [INFO]   1/1 runs still in progress...
17:42:13 [INFO]   1/1 runs still in progress...
17:42:49 [INFO]   All 1 run(s) completed.


============================================================
  OSDC Integration Test Results
============================================================
  Cluster: arc-staging (pytorch-arc-staging)
  Date:    2026-04-11 00:42 UTC

  PR Workflow Jobs:
    ✓ test-gpu-t4                    success
    ✓ test-pypi-cache-defaults       success
    ✓ test-pypi-cache-action-cuda    success
    ✓ test-git-cache                 success
    ✓ test-pypi-cache-action-cpu     success
    ✓ test-cpu-x86-amx               success
    ✓ test-cpu-arm64                 success
    ✓ test-cpu-x86-avx512            success
    ✓ test-cache-enforcer            success
    ✓ test-release-arm64             success
    ✓ test-gpu-t4-multi              success
    ✓ test-harbor                    success
    ✓ build-amd64 / build            success
    ✓ build-arm64 / build            success

  Smoke            ⊘ SKIPPED
  Compactor        ⊘ SKIPPED

  Overall: PASSED
============================================================

17:42:53 [INFO] Phase 5: Closing PR #411...
17:42:56 [INFO] Total integration test time: 6m52s

- Bump runner-container-hooks from 0.8.5 to 0.8.7 (hooks-warmer DaemonSet)
- Replace nvidia/cuda container images with actions-runner in GPU integration
  tests and load-test generator (privileged buildkit container no longer needed)
- Install buildctl at runtime in build-image workflow instead of using
  privileged moby/buildkit container
- Treat cordoned nodes as unstable in smoke test helpers to avoid false failures
- Exclude pods on disappeared nodes from alloy-logging health checks

Notes:
The container image changes align GPU test jobs and the build-image workflow
with the runner-container-hooks PR (jeanschmidt/runner-container-hooks#1),
which removes the need for privileged containers. The hooks 0.8.7 release
includes fixes from that PR. The smoke test hardening addresses flaky failures
during node scale-down: cordoned nodes and pods orphaned on already-terminated
nodes are now properly excluded from health assertions.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
Copy link
Copy Markdown
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jeanschmidt jeanschmidt changed the title Bump container hooks to v0.8.7 and harden smoke tests Bump container hooks to v0.8.8 and harden smoke tests Apr 13, 2026
- Update HOOKS_VERSION from 0.8.7 to 0.8.8 in hooks-warmer DaemonSet
- Update release URL references in hooks-warmer.yaml and runner.yaml.tpl
- Update documentation to reflect v0.8.8 in node-warmup-and-scheduling-gates.md

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
@jeanschmidt jeanschmidt added this pull request to the merge queue Apr 13, 2026
Merged via the queue into main with commit cfdba52 Apr 13, 2026
13 checks passed
@jeanschmidt jeanschmidt deleted the jeanschmidt/rpc branch April 13, 2026 22:49
jeanschmidt added a commit that referenced this pull request Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants