Fix cascading scheduling failures from stale compactor tokens, aggressive drift, and tight timeouts by jeanschmidt · Pull Request #435 · pytorch/ci-infra

jeanschmidt · 2026-04-13T03:13:11Z

Impact: All OSDC clusters — runner job scheduling, node compactor availability, Karpenter drift behavior
Risk: low

What

Fixes three independent issues that converged to cause recurring pod scheduling failures on arc-cbr-production (4 incidents in 4 days, ~95 pending jobs, trunk red). Adds alerting so the compactor going offline is detected before it causes capacity loss.

Why

Investigation of #1084 identified a cascading failure chain:

Broken node-compactor — lightkube's Client() reads the projected SA token once at construction and caches it forever. After EKS rotated OIDC signing keys (~12 days into the pod's life), every API call returned 401 Unauthorized. The compactor's burst-absorption mechanism (untainting nodes when Pending pods accumulate) was silently offline for days.
Unbounded Karpenter drift replacement — A disk size change across all 20 NodePools triggered simultaneous NodeClassDrift on every node. With the disruption budget effectively at 100%, Karpenter could cordon and replace all nodes of a given type at once, leaving zero schedulable capacity during demand bursts.
Timeout too short for cold nodes — Fresh nodes require EC2 launch (1-3 min) + git-cache sync (~112s) + cold CUDA image pull (5-15 min for images up to 26.8 GB). Total time-to-ready (18-20 min) exceeded the 15-minute ACTIONS_RUNNER_PREPARE_JOB_TIMEOUT_SECONDS, causing pods to hit backoff timeout and fail.

How

Catch 401 at the reconciliation loop level and recreate the Client() to pick up the rotated SA token — matches lightkube's own ExecAuth retry pattern
Cap Karpenter disruption budget at 20% so at least 80% of nodes remain schedulable during drift replacement
Increase runner prepare-job timeout from 15 min to 25 min to cover worst-case cold-node startup
Add a PrometheusRule alert that fires after 15 minutes of continuous compactor reconciliation errors, so silent failures are detected early

Changes

Node compactor — token rotation fix

compactor.py: Add ApiError catch before the generic Exception handler in the main loop; on 401, log a warning and recreate the Client(); on other API errors, log and continue
test_compactor.py: Add test_main_recreates_client_on_401 verifying the client is reconstructed and the next reconciliation uses the fresh client

Karpenter disruption budget

clusters.yaml: Add gpu_disruption_budget: "20%" and cpu_disruption_budget: "20%" to defaults (previous effective default was 100% from the deploy.sh fallback)

Runner timeout

runner.yaml.tpl: Increase ACTIONS_RUNNER_PREPARE_JOB_TIMEOUT_SECONDS from 900 (15 min) to 1500 (25 min)

Compactor health alerting

node-compactor-alerts.yaml: New PrometheusRule — NodeCompactorReconcileErrors fires at severity: critical when rate(node_compactor_reconcile_cycles_total{status="error"}[5m]) > 0 persists for 15 minutes
kustomization.yaml: Register the new alert resource

Notes

The 401 fix is a workaround for lightkube's token caching behavior, not a permanent solution. A proper ServiceAccountAuth class that re-reads the token file proactively (before expiry) would be more robust but requires upstream changes or a custom auth wrapper.
The disruption budget change applies to all clusters via defaults. Staging already inherits defaults and does not override disruption budgets.
The investigation also identified stale manual refresh taints (13 days, 7 nodes) as a contributing factor — that requires an operational just untaint-nodes run, not a code change.

testing

 $  just smoke arc-staging
Updating kubeconfig for pytorch-arc-staging (us-west-1)...
Updated context pytorch-arc-staging in /Users/jschmidt/.kube/config
Running smoke tests for cluster: arc-staging
Test directories:
  - base/helm/harbor/tests/smoke
  - base/kubernetes/git-cache/tests/smoke
  - base/kubernetes/image-cache-janitor/tests/smoke
  - base/kubernetes/tests/smoke
  - base/node-compactor/tests/smoke
  - modules/eks/tests/smoke
  - modules/karpenter/tests/smoke
  - modules/arc/tests/smoke
  - modules/nodepools/tests/smoke
  - modules/arc-runners/tests/smoke
  - modules/buildkit/tests/smoke
  - modules/pypi-cache/tests/smoke
  - modules/cache-enforcer/tests/smoke
  - modules/monitoring/tests/smoke
  - modules/logging/tests/smoke

================================================================================================================================== test session starts ==================================================================================================================================
platform darwin -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0
rootdir: /Users/jschmidt/meta/ciforge/osdc/upstream/osdc
configfile: pyproject.toml
plugins: anyio-4.12.1, xdist-3.8.0, cov-7.0.0
16 workers [195 items]
................................................................................................................................................................s..................................                                                                               [100%]
================================================================================================================================ short test summary info ================================================================================================================================
SKIPPED [1] modules/monitoring/tests/smoke/test_monitoring.py:173: No dcgm-exporter pods found (no GPU nodes)
====================================================================================================================== 194 passed, 1 skipped in 105.52s (0:01:45) =======================================================================================================================

Smoke tests completed in 1m46s

 $  just integration-test arc-staging
Updating kubeconfig for pytorch-arc-staging (us-west-1)...
Updated context pytorch-arc-staging in /Users/jschmidt/.kube/config
20:50:04 [INFO] Integration test for cluster: arc-staging (pytorch-arc-staging)
20:50:04 [INFO]   Runner prefix: 'c-mt-'
20:50:04 [INFO]   B200 enabled: False
20:50:04 [INFO]   Release runners: True
20:50:04 [INFO]   Cache enforcer: True
20:50:04 [INFO]   PyPI cache slugs: cpu cu126 cu128 cu130
20:50:04 [INFO]   Smoke tests: skip
20:50:04 [INFO]   Compactor tests: skip
20:50:04 [INFO]   Branch: osdc-integration-test-arc-staging
20:50:04 [INFO] Phase 0: Cleaning up stale PRs...
20:50:07 [INFO] Phase 1: Checking for active runner pods (arc-staging only)...
20:50:10 [INFO]   No runner pods active. Skipping pool clear.
20:50:11 [INFO]   Canary repo already cloned at /Users/jschmidt/meta/ciforge/osdc/upstream/osdc/.scratch/pytorch-canary, fetching...
20:50:12 [INFO] Phase 2: Preparing PR...
20:50:18 [INFO]   PR #412 created: https://github.com/pytorch/pytorch-canary/pull/412
20:50:18 [INFO] Phase 3: Running parallel validation...
20:50:18 [INFO] Phase 4: Waiting for PR workflow runs (timeout: 50 min, buffer: 10 min)...
20:50:18 [INFO]   Filtering to runs created after 2026-04-13T03:50:12.346241+00:00
20:50:20 [INFO]   No runs found yet, waiting...
20:50:52 [INFO]   Run: OSDC Integration Test — https://github.com/pytorch/pytorch-canary/actions/runs/24324789847
20:50:52 [INFO]   1/1 runs still in progress...
20:51:25 [INFO]   1/1 runs still in progress...
20:51:58 [INFO]   1/1 runs still in progress...
20:52:29 [INFO]   1/1 runs still in progress...
20:53:02 [INFO]   1/1 runs still in progress...
20:53:33 [INFO]   1/1 runs still in progress...
20:54:06 [INFO]   1/1 runs still in progress...
20:54:37 [INFO]   1/1 runs still in progress...
20:55:09 [INFO]   1/1 runs still in progress...
20:55:40 [INFO]   1/1 runs still in progress...
20:56:13 [INFO]   All 1 run(s) completed.


============================================================
  OSDC Integration Test Results
============================================================
  Cluster: arc-staging (pytorch-arc-staging)
  Date:    2026-04-13 03:56 UTC

  PR Workflow Jobs:
    ✓ test-pypi-cache-action-cuda    success
    ✓ test-pypi-cache-action-cpu     success
    ✓ test-cpu-x86-avx512            success
    ✓ test-cpu-arm64                 success
    ✓ test-git-cache                 success
    ✓ test-cpu-x86-amx               success
    ✓ test-pypi-cache-defaults       success
    ✓ test-gpu-t4                    success
    ✓ test-gpu-t4-multi              success
    ✓ test-harbor                    success
    ✓ test-release-arm64             success
    ✓ test-cache-enforcer            success
    ✓ build-arm64 / build            success
    ✓ build-amd64 / build            success

  Smoke            ⊘ SKIPPED
  Compactor        ⊘ SKIPPED

  Overall: PASSED
============================================================

20:56:15 [INFO] Phase 5: Closing PR #412...
20:56:17 [INFO] Total integration test time: 6m13s

Signed-off-by: Jean Schmidt <contato@jschmidt.me>

20260412192335

3df5636

Signed-off-by: Jean Schmidt <contato@jschmidt.me>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cascading scheduling failures from stale compactor tokens, aggressive drift, and tight timeouts#435

Fix cascading scheduling failures from stale compactor tokens, aggressive drift, and tight timeouts#435
jeanschmidt wants to merge 1 commit intomainfrom
jeanschmidt/fix_compactor

jeanschmidt commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeanschmidt commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Changes

Notes

testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jeanschmidt commented Apr 13, 2026 •

edited

Loading