Skip to content

UnicodeDecodeError in pod log reading marks healthy GPU training runs as SYSTEM_ERROR #281

@morgan-wowk

Description

@morgan-wowk

Root cause

When a Kubernetes pod completes a GPU training run with tqdm progress bars, the pod log contains multi-byte UTF-8 block glyphs (█▉▊▋▌▍▎▏, each 3 bytes: E2 96 8x). Two distinct mechanisms produce torn byte sequences in the raw log bytes that the Kubernetes API returns:

  1. Concurrent writer interleaving — 64 worker processes share a single stderr fd. Python's TextIOWrapper can split one logical write() call across multiple write() syscalls. When two processes' writes interleave at byte granularity, the receiver sees an orphaned continuation byte (invalid continuation byte).

  2. Response truncation at a chunk boundary — the K8s API response is split at a size boundary (~5.1 MB observed), cutting a 3-byte glyph after its first byte (unexpected end of data at 0xe2).

The Kubernetes Python client reads the response with r.data.decode('utf8') (strict, no error handling). Both variants throw UnicodeDecodeError before the run has any real failure. The exception bubbles through the _retry wrapper in internal_process_one_running_execution and, after 5 retries, marks the execution SYSTEM_ERROR — causing downstream Upload steps to be skipped on an otherwise successful training run.

This is responsible for 3,960 Bugsnag occurrences in the past 7 days.

Neither the Linux kernel nor the Kubernetes client is at fault. The kernel's PIPE_BUF atomic write guarantee applies only to writes ≤ 4096 bytes, and Python's TextIOWrapper may issue multiple syscalls for one logical write. The corruption is baked into the bytes before the K8s client sees them, so the client cannot fix it in a lossless way.

Solutions compared

Three approaches were kept as separate PRs for team comparison:

PR #277 — Defensive decode (_preload_content=False + errors="replace")
Bypass the K8s client's strict decode by using _preload_content=False, then decode ourselves with errors="replace" so U+FFFD is substituted for any malformed byte sequence. A _logger.warning fires when substitution occurs.

  • Pro: handles both error classes; logs remain complete (only the torn glyph is replaced)
  • Con: lossy — the torn bytes (partial tqdm glyph) are replaced rather than recovered

PR #278 — Suppress tqdm output at source
Inject TQDM_DISABLE=1 (and related env vars) into the pod's environment so block glyphs are never written to stderr in the first place.

  • Pro: eliminates the root cause entirely for tqdm-based progress bars
  • Con: only covers tqdm; any other multi-byte output from training code or libraries would still tear

PR #279 / #280 — Swallow log-acquisition errors + Observe search link
Catch all exceptions from log reading, return an empty string / None, and embed a pre-built Observe log search URL in the placeholder message so the user can still find the pod logs.

  • Pro: execution is never incorrectly marked SYSTEM_ERROR; user has a direct link to raw logs
  • Con: the execution log stored in GCS is lost; relies on Observe retention

Fix landed

PR #277_preload_content=False + errors="replace" was selected. It addresses both UnicodeDecodeError variants, keeps the execution log intact (minus the torn glyph bytes), and does not require changes to the training environment.

Fixes #277

Metadata

Metadata

Assignees

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions