Skip to content

Fix cascading scheduling failures from stale compactor tokens, aggressive drift, and tight timeouts#435

Open
jeanschmidt wants to merge 1 commit intomainfrom
jeanschmidt/fix_compactor
Open

Fix cascading scheduling failures from stale compactor tokens, aggressive drift, and tight timeouts#435
jeanschmidt wants to merge 1 commit intomainfrom
jeanschmidt/fix_compactor

Conversation

@jeanschmidt
Copy link
Copy Markdown
Contributor

@jeanschmidt jeanschmidt commented Apr 13, 2026

Impact: All OSDC clusters — runner job scheduling, node compactor availability, Karpenter drift behavior
Risk: low

What

Fixes three independent issues that converged to cause recurring pod scheduling failures on arc-cbr-production (4 incidents in 4 days, ~95 pending jobs, trunk red). Adds alerting so the compactor going offline is detected before it causes capacity loss.

Why

Investigation of #1084 identified a cascading failure chain:

  1. Broken node-compactor — lightkube's Client() reads the projected SA token once at construction and caches it forever. After EKS rotated OIDC signing keys (~12 days into the pod's life), every API call returned 401 Unauthorized. The compactor's burst-absorption mechanism (untainting nodes when Pending pods accumulate) was silently offline for days.
  2. Unbounded Karpenter drift replacement — A disk size change across all 20 NodePools triggered simultaneous NodeClassDrift on every node. With the disruption budget effectively at 100%, Karpenter could cordon and replace all nodes of a given type at once, leaving zero schedulable capacity during demand bursts.
  3. Timeout too short for cold nodes — Fresh nodes require EC2 launch (1-3 min) + git-cache sync (~112s) + cold CUDA image pull (5-15 min for images up to 26.8 GB). Total time-to-ready (18-20 min) exceeded the 15-minute ACTIONS_RUNNER_PREPARE_JOB_TIMEOUT_SECONDS, causing pods to hit backoff timeout and fail.

How

  • Catch 401 at the reconciliation loop level and recreate the Client() to pick up the rotated SA token — matches lightkube's own ExecAuth retry pattern
  • Cap Karpenter disruption budget at 20% so at least 80% of nodes remain schedulable during drift replacement
  • Increase runner prepare-job timeout from 15 min to 25 min to cover worst-case cold-node startup
  • Add a PrometheusRule alert that fires after 15 minutes of continuous compactor reconciliation errors, so silent failures are detected early

Changes

Node compactor — token rotation fix

  • compactor.py: Add ApiError catch before the generic Exception handler in the main loop; on 401, log a warning and recreate the Client(); on other API errors, log and continue
  • test_compactor.py: Add test_main_recreates_client_on_401 verifying the client is reconstructed and the next reconciliation uses the fresh client

Karpenter disruption budget

  • clusters.yaml: Add gpu_disruption_budget: "20%" and cpu_disruption_budget: "20%" to defaults (previous effective default was 100% from the deploy.sh fallback)

Runner timeout

  • runner.yaml.tpl: Increase ACTIONS_RUNNER_PREPARE_JOB_TIMEOUT_SECONDS from 900 (15 min) to 1500 (25 min)

Compactor health alerting

  • node-compactor-alerts.yaml: New PrometheusRule — NodeCompactorReconcileErrors fires at severity: critical when rate(node_compactor_reconcile_cycles_total{status="error"}[5m]) > 0 persists for 15 minutes
  • kustomization.yaml: Register the new alert resource

Notes

  • The 401 fix is a workaround for lightkube's token caching behavior, not a permanent solution. A proper ServiceAccountAuth class that re-reads the token file proactively (before expiry) would be more robust but requires upstream changes or a custom auth wrapper.
  • The disruption budget change applies to all clusters via defaults. Staging already inherits defaults and does not override disruption budgets.
  • The investigation also identified stale manual refresh taints (13 days, 7 nodes) as a contributing factor — that requires an operational just untaint-nodes run, not a code change.

testing

 $  just smoke arc-staging
Updating kubeconfig for pytorch-arc-staging (us-west-1)...
Updated context pytorch-arc-staging in /Users/jschmidt/.kube/config
Running smoke tests for cluster: arc-staging
Test directories:
  - base/helm/harbor/tests/smoke
  - base/kubernetes/git-cache/tests/smoke
  - base/kubernetes/image-cache-janitor/tests/smoke
  - base/kubernetes/tests/smoke
  - base/node-compactor/tests/smoke
  - modules/eks/tests/smoke
  - modules/karpenter/tests/smoke
  - modules/arc/tests/smoke
  - modules/nodepools/tests/smoke
  - modules/arc-runners/tests/smoke
  - modules/buildkit/tests/smoke
  - modules/pypi-cache/tests/smoke
  - modules/cache-enforcer/tests/smoke
  - modules/monitoring/tests/smoke
  - modules/logging/tests/smoke

================================================================================================================================== test session starts ==================================================================================================================================
platform darwin -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0
rootdir: /Users/jschmidt/meta/ciforge/osdc/upstream/osdc
configfile: pyproject.toml
plugins: anyio-4.12.1, xdist-3.8.0, cov-7.0.0
16 workers [195 items]
................................................................................................................................................................s..................................                                                                               [100%]
================================================================================================================================ short test summary info ================================================================================================================================
SKIPPED [1] modules/monitoring/tests/smoke/test_monitoring.py:173: No dcgm-exporter pods found (no GPU nodes)
====================================================================================================================== 194 passed, 1 skipped in 105.52s (0:01:45) =======================================================================================================================

Smoke tests completed in 1m46s
 $  just integration-test arc-staging
Updating kubeconfig for pytorch-arc-staging (us-west-1)...
Updated context pytorch-arc-staging in /Users/jschmidt/.kube/config
20:50:04 [INFO] Integration test for cluster: arc-staging (pytorch-arc-staging)
20:50:04 [INFO]   Runner prefix: 'c-mt-'
20:50:04 [INFO]   B200 enabled: False
20:50:04 [INFO]   Release runners: True
20:50:04 [INFO]   Cache enforcer: True
20:50:04 [INFO]   PyPI cache slugs: cpu cu126 cu128 cu130
20:50:04 [INFO]   Smoke tests: skip
20:50:04 [INFO]   Compactor tests: skip
20:50:04 [INFO]   Branch: osdc-integration-test-arc-staging
20:50:04 [INFO] Phase 0: Cleaning up stale PRs...
20:50:07 [INFO] Phase 1: Checking for active runner pods (arc-staging only)...
20:50:10 [INFO]   No runner pods active. Skipping pool clear.
20:50:11 [INFO]   Canary repo already cloned at /Users/jschmidt/meta/ciforge/osdc/upstream/osdc/.scratch/pytorch-canary, fetching...
20:50:12 [INFO] Phase 2: Preparing PR...
20:50:18 [INFO]   PR #412 created: https://github.com/pytorch/pytorch-canary/pull/412
20:50:18 [INFO] Phase 3: Running parallel validation...
20:50:18 [INFO] Phase 4: Waiting for PR workflow runs (timeout: 50 min, buffer: 10 min)...
20:50:18 [INFO]   Filtering to runs created after 2026-04-13T03:50:12.346241+00:00
20:50:20 [INFO]   No runs found yet, waiting...
20:50:52 [INFO]   Run: OSDC Integration Test — https://github.com/pytorch/pytorch-canary/actions/runs/24324789847
20:50:52 [INFO]   1/1 runs still in progress...
20:51:25 [INFO]   1/1 runs still in progress...
20:51:58 [INFO]   1/1 runs still in progress...
20:52:29 [INFO]   1/1 runs still in progress...
20:53:02 [INFO]   1/1 runs still in progress...
20:53:33 [INFO]   1/1 runs still in progress...
20:54:06 [INFO]   1/1 runs still in progress...
20:54:37 [INFO]   1/1 runs still in progress...
20:55:09 [INFO]   1/1 runs still in progress...
20:55:40 [INFO]   1/1 runs still in progress...
20:56:13 [INFO]   All 1 run(s) completed.


============================================================
  OSDC Integration Test Results
============================================================
  Cluster: arc-staging (pytorch-arc-staging)
  Date:    2026-04-13 03:56 UTC

  PR Workflow Jobs:
    ✓ test-pypi-cache-action-cuda    success
    ✓ test-pypi-cache-action-cpu     success
    ✓ test-cpu-x86-avx512            success
    ✓ test-cpu-arm64                 success
    ✓ test-git-cache                 success
    ✓ test-cpu-x86-amx               success
    ✓ test-pypi-cache-defaults       success
    ✓ test-gpu-t4                    success
    ✓ test-gpu-t4-multi              success
    ✓ test-harbor                    success
    ✓ test-release-arm64             success
    ✓ test-cache-enforcer            success
    ✓ build-arm64 / build            success
    ✓ build-amd64 / build            success

  Smoke            ⊘ SKIPPED
  Compactor        ⊘ SKIPPED

  Overall: PASSED
============================================================

20:56:15 [INFO] Phase 5: Closing PR #412...
20:56:17 [INFO] Total integration test time: 6m13s

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant