Skip to content

kubernetes: swallow log-acquisition errors to prevent false SYSTEM_ERROR#279

Closed
morgan-wowk wants to merge 1 commit into
masterfrom
fix/swallow-log-acquisition-errors
Closed

kubernetes: swallow log-acquisition errors to prevent false SYSTEM_ERROR#279
morgan-wowk wants to merge 1 commit into
masterfrom
fix/swallow-log-acquisition-errors

Conversation

@morgan-wowk

Copy link
Copy Markdown
Collaborator

Problem

When the Kubernetes API returns a malformed response — truncated body, broken UTF-8, broken JSON — the Kubernetes client throws inside read_namespaced_pod_log before returning any data. That exception surfaces to the orchestrator's _retry wrapper (max_retries=5). After five consecutive failures the wrapper re-raises, and the orchestrator records a SYSTEM_ERROR and skips all downstream tasks — even though the training job itself completed successfully.

The 5-retry pattern has been observed in production: the K8s API served the same malformed response on all five attempts before the retry budget was exhausted.

Fix

Catch all non-ApiException errors in the two log-read entry points:

  • LaunchedKubernetesContainer.get_log — single-pod path
  • LaunchedKubernetesJob._get_log_by_pod_key — multi-pod path

On failure, emit a WARNING log with the full traceback (observable in Observe) and return an empty string / None. The execution reaches its correct terminal state (SUCCEEDED or FAILED) with an empty log, rather than being incorrectly promoted to SYSTEM_ERROR.

The existing ApiException / "Bad Request" handling for PodInitializing is preserved.

Trade-off

This is a deliberate lossy choice: a log-fetch failure produces a missing log rather than a run failure. The trade-off is:

  • Upside: eliminates false SYSTEM_ERROR on healthy runs; downstream tasks are no longer skipped; retry budget is preserved for real failures.
  • Downside: when logs are missing the user has no in-product record of the run's output. The warning in Observe is the only signal that a log was lost.

Context — three approaches under review

PR Approach Logs preserved? Run outcome
#277 Defensive decode (errors="replace") Yes — torn glyphs replaced with ? Run completes, logs uploaded
#278 Suppress tqdm at source (TQDM_DISABLE=1) Yes — no non-ASCII bytes emitted Run completes, logs uploaded
This PR Swallow log-acquisition errors No — empty log on fetch failure Run completes, logs empty

#277 and #278 address the specific torn-UTF-8 failure class. This PR addresses the broader class of any malformed K8s API response, including cases where the API repeatedly returns broken data that no decode strategy can recover from.

When the Kubernetes API returns a malformed response (truncated body,
broken UTF-8, broken JSON), the kubernetes client throws inside
read_namespaced_pod_log before returning any data. That exception
propagates into the orchestrator's _retry wrapper (max 5 attempts),
which — after exhausting retries — marks the execution SYSTEM_ERROR and
skips all downstream tasks, even though the training job itself completed
successfully.

Catch all non-ApiException errors in:
  - LaunchedKubernetesContainer.get_log  (single-pod path)
  - LaunchedKubernetesJob._get_log_by_pod_key  (multi-pod path)

On failure, log a warning with the full traceback (observable in Observe)
and return an empty string / None. The execution proceeds to its correct
terminal state (SUCCEEDED or FAILED) with an empty log rather than being
incorrectly promoted to SYSTEM_ERROR.

The existing ApiException / "Bad Request" handling for PodInitializing
is preserved and unaffected.

Copy link
Copy Markdown
Collaborator Author

@morgan-wowk

Copy link
Copy Markdown
Collaborator Author

We chose a different solution as indicated on the issue #281

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant