kubernetes: swallow log-acquisition errors to prevent false SYSTEM_ERROR#279
Closed
morgan-wowk wants to merge 1 commit into
Closed
kubernetes: swallow log-acquisition errors to prevent false SYSTEM_ERROR#279morgan-wowk wants to merge 1 commit into
morgan-wowk wants to merge 1 commit into
Conversation
When the Kubernetes API returns a malformed response (truncated body, broken UTF-8, broken JSON), the kubernetes client throws inside read_namespaced_pod_log before returning any data. That exception propagates into the orchestrator's _retry wrapper (max 5 attempts), which — after exhausting retries — marks the execution SYSTEM_ERROR and skips all downstream tasks, even though the training job itself completed successfully. Catch all non-ApiException errors in: - LaunchedKubernetesContainer.get_log (single-pod path) - LaunchedKubernetesJob._get_log_by_pod_key (multi-pod path) On failure, log a warning with the full traceback (observable in Observe) and return an empty string / None. The execution proceeds to its correct terminal state (SUCCEEDED or FAILED) with an empty log rather than being incorrectly promoted to SYSTEM_ERROR. The existing ApiException / "Bad Request" handling for PodInitializing is preserved and unaffected.
Collaborator
Author
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Collaborator
Author
|
We chose a different solution as indicated on the issue #281 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Problem
When the Kubernetes API returns a malformed response — truncated body, broken UTF-8, broken JSON — the Kubernetes client throws inside
read_namespaced_pod_logbefore returning any data. That exception surfaces to the orchestrator's_retrywrapper (max_retries=5). After five consecutive failures the wrapper re-raises, and the orchestrator records aSYSTEM_ERRORand skips all downstream tasks — even though the training job itself completed successfully.The 5-retry pattern has been observed in production: the K8s API served the same malformed response on all five attempts before the retry budget was exhausted.
Fix
Catch all non-
ApiExceptionerrors in the two log-read entry points:LaunchedKubernetesContainer.get_log— single-pod pathLaunchedKubernetesJob._get_log_by_pod_key— multi-pod pathOn failure, emit a
WARNINGlog with the full traceback (observable in Observe) and return an empty string /None. The execution reaches its correct terminal state (SUCCEEDEDorFAILED) with an empty log, rather than being incorrectly promoted toSYSTEM_ERROR.The existing
ApiException/"Bad Request"handling forPodInitializingis preserved.Trade-off
This is a deliberate lossy choice: a log-fetch failure produces a missing log rather than a run failure. The trade-off is:
SYSTEM_ERRORon healthy runs; downstream tasks are no longer skipped; retry budget is preserved for real failures.Context — three approaches under review
errors="replace")?TQDM_DISABLE=1)#277 and #278 address the specific torn-UTF-8 failure class. This PR addresses the broader class of any malformed K8s API response, including cases where the API repeatedly returns broken data that no decode strategy can recover from.