UnicodeDecodeError in pod log reading marks healthy GPU training runs as SYSTEM_ERROR

## Root cause

When a Kubernetes pod completes a GPU training run with tqdm progress bars, the pod log contains multi-byte UTF-8 block glyphs (`█▉▊▋▌▍▎▏`, each 3 bytes: `E2 96 8x`). Two distinct mechanisms produce torn byte sequences in the raw log bytes that the Kubernetes API returns:

1. **Concurrent writer interleaving** — 64 worker processes share a single stderr fd. Python's `TextIOWrapper` can split one logical `write()` call across multiple `write()` syscalls. When two processes' writes interleave at byte granularity, the receiver sees an orphaned continuation byte (`invalid continuation byte`).

2. **Response truncation at a chunk boundary** — the K8s API response is split at a size boundary (~5.1 MB observed), cutting a 3-byte glyph after its first byte (`unexpected end of data` at `0xe2`).

The Kubernetes Python client reads the response with `r.data.decode('utf8')` (strict, no error handling). Both variants throw `UnicodeDecodeError` before the run has any real failure. The exception bubbles through the `_retry` wrapper in `internal_process_one_running_execution` and, after 5 retries, marks the execution `SYSTEM_ERROR` — causing downstream Upload steps to be skipped on an otherwise successful training run.

**This is responsible for 3,960 Bugsnag occurrences in the past 7 days.**

Neither the Linux kernel nor the Kubernetes client is at fault. The kernel's `PIPE_BUF` atomic write guarantee applies only to writes ≤ 4096 bytes, and Python's `TextIOWrapper` may issue multiple syscalls for one logical write. The corruption is baked into the bytes before the K8s client sees them, so the client cannot fix it in a lossless way.

## Solutions compared

Three approaches were kept as separate PRs for team comparison:

**PR #277 — Defensive decode (`_preload_content=False` + `errors="replace"`)**
Bypass the K8s client's strict decode by using `_preload_content=False`, then decode ourselves with `errors="replace"` so `U+FFFD` is substituted for any malformed byte sequence. A `_logger.warning` fires when substitution occurs.
- Pro: handles both error classes; logs remain complete (only the torn glyph is replaced)
- Con: lossy — the torn bytes (partial tqdm glyph) are replaced rather than recovered

**PR #278 — Suppress tqdm output at source**
Inject `TQDM_DISABLE=1` (and related env vars) into the pod's environment so block glyphs are never written to stderr in the first place.
- Pro: eliminates the root cause entirely for tqdm-based progress bars
- Con: only covers tqdm; any other multi-byte output from training code or libraries would still tear

**PR #279 / #280 — Swallow log-acquisition errors + Observe search link**
Catch all exceptions from log reading, return an empty string / `None`, and embed a pre-built Observe log search URL in the placeholder message so the user can still find the pod logs.
- Pro: execution is never incorrectly marked `SYSTEM_ERROR`; user has a direct link to raw logs
- Con: the execution log stored in GCS is lost; relies on Observe retention

## Fix landed

**PR #277** — `_preload_content=False` + `errors="replace"` was selected. It addresses both UnicodeDecodeError variants, keeps the execution log intact (minus the torn glyph bytes), and does not require changes to the training environment.

Fixes #277

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError in pod log reading marks healthy GPU training runs as SYSTEM_ERROR #281

Root cause

Solutions compared

Fix landed

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

UnicodeDecodeError in pod log reading marks healthy GPU training runs as SYSTEM_ERROR #281

Description

Root cause

Solutions compared

Fix landed

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions