Fix cascading scheduling failures from stale compactor tokens, aggressive drift, and tight timeouts#435
Open
jeanschmidt wants to merge 1 commit intomainfrom
Open
Fix cascading scheduling failures from stale compactor tokens, aggressive drift, and tight timeouts#435jeanschmidt wants to merge 1 commit intomainfrom
jeanschmidt wants to merge 1 commit intomainfrom
Conversation
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Impact: All OSDC clusters — runner job scheduling, node compactor availability, Karpenter drift behavior
Risk: low
What
Fixes three independent issues that converged to cause recurring pod scheduling failures on
arc-cbr-production(4 incidents in 4 days, ~95 pending jobs, trunk red). Adds alerting so the compactor going offline is detected before it causes capacity loss.Why
Investigation of #1084 identified a cascading failure chain:
Client()reads the projected SA token once at construction and caches it forever. After EKS rotated OIDC signing keys (~12 days into the pod's life), every API call returned 401 Unauthorized. The compactor's burst-absorption mechanism (untainting nodes when Pending pods accumulate) was silently offline for days.NodeClassDrifton every node. With the disruption budget effectively at 100%, Karpenter could cordon and replace all nodes of a given type at once, leaving zero schedulable capacity during demand bursts.ACTIONS_RUNNER_PREPARE_JOB_TIMEOUT_SECONDS, causing pods to hit backoff timeout and fail.How
Client()to pick up the rotated SA token — matches lightkube's ownExecAuthretry patternChanges
Node compactor — token rotation fix
compactor.py: AddApiErrorcatch before the genericExceptionhandler in the main loop; on 401, log a warning and recreate theClient(); on other API errors, log and continuetest_compactor.py: Addtest_main_recreates_client_on_401verifying the client is reconstructed and the next reconciliation uses the fresh clientKarpenter disruption budget
clusters.yaml: Addgpu_disruption_budget: "20%"andcpu_disruption_budget: "20%"to defaults (previous effective default was100%from the deploy.sh fallback)Runner timeout
runner.yaml.tpl: IncreaseACTIONS_RUNNER_PREPARE_JOB_TIMEOUT_SECONDSfrom900(15 min) to1500(25 min)Compactor health alerting
node-compactor-alerts.yaml: New PrometheusRule —NodeCompactorReconcileErrorsfires atseverity: criticalwhenrate(node_compactor_reconcile_cycles_total{status="error"}[5m]) > 0persists for 15 minuteskustomization.yaml: Register the new alert resourceNotes
ServiceAccountAuthclass that re-reads the token file proactively (before expiry) would be more robust but requires upstream changes or a custom auth wrapper.just untaint-nodesrun, not a code change.testing