Bump container hooks to v0.8.8 and harden smoke tests#434
Merged
jeanschmidt merged 2 commits intomainfrom Apr 13, 2026
Merged
Conversation
- Bump runner-container-hooks from 0.8.5 to 0.8.7 (hooks-warmer DaemonSet) - Replace nvidia/cuda container images with actions-runner in GPU integration tests and load-test generator (privileged buildkit container no longer needed) - Install buildctl at runtime in build-image workflow instead of using privileged moby/buildkit container - Treat cordoned nodes as unstable in smoke test helpers to avoid false failures - Exclude pods on disappeared nodes from alloy-logging health checks Notes: The container image changes align GPU test jobs and the build-image workflow with the runner-container-hooks PR (jeanschmidt/runner-container-hooks#1), which removes the need for privileged containers. The hooks 0.8.7 release includes fixes from that PR. The smoke test hardening addresses flaky failures during node scale-down: cordoned nodes and pods orphaned on already-terminated nodes are now properly excluded from health assertions. Signed-off-by: Jean Schmidt <contato@jschmidt.me>
huydhn
approved these changes
Apr 13, 2026
- Update HOOKS_VERSION from 0.8.7 to 0.8.8 in hooks-warmer DaemonSet - Update release URL references in hooks-warmer.yaml and runner.yaml.tpl - Update documentation to reflect v0.8.8 in node-warmup-and-scheduling-gates.md Signed-off-by: Jean Schmidt <contato@jschmidt.me>
jeanschmidt
added a commit
that referenced
this pull request
Apr 14, 2026
This reverts commit cfdba52.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Impact: CI runners, integration tests, smoke tests
Risk: High
See jeanschmidt/runner-container-hooks#1 for details on what v0.8.7 introduces.
What
Bumps the patched runner-container-hooks from v0.8.5 to v0.8.7 (from jeanschmidt/runner-container-hooks fork), standardizes all integration test containers on
ghcr.io/actions/actions-runner:latest, and hardens smoke tests against node churn edge cases.Why
The container hooks v0.8.7 release includes fixes from jeanschmidt/runner-container-hooks#1. Smoke tests were flaking when pods remained scheduled on nodes that had already been terminated (disappeared from the API server) or on cordoned nodes.
How
spec.unschedulable) detection to the unstable node classifierChanges
Container hooks bump:
hooks-warmer.yaml:HOOKS_VERSION0.8.5 → 0.8.7Integration test container standardization:
build-image.yaml: Replaced privilegedmoby/buildkit:v0.29.0container with non-privilegedactions-runner:latest+ runtimebuildctlinstall stepintegration-test.yaml.tpl: GPU test jobs (test-gpu-t4,test-gpu-t4-multi,test-gpu-b200-2) switched fromnvidia/cuda:12.6.3-base-ubuntu22.04toactions-runner:latestworkflow_generator.py:GPU_CONTAINERconstant changed fromnvidia/cuda:12.6.3-runtime-ubuntu22.04toactions-runner:latestSmoke test hardening:
helpers.py:_is_node_unstable()now treats cordoned nodes (spec.unschedulable) as unstable; newget_all_node_names()helpertest_logging.py:test_alloy_pods_runningnow excludes pods on "disappeared" nodes (node no longer in API server) from failure assertions; reports count in error messagesTesting