Skip to content

fix: Launchers - Kubernetes - Fix getting logs when logs arte not valid UTF-8#277

Merged
morgan-wowk merged 1 commit into
masterfrom
fix/kubernetes-log-utf8-decode
Jun 18, 2026
Merged

fix: Launchers - Kubernetes - Fix getting logs when logs arte not valid UTF-8#277
morgan-wowk merged 1 commit into
masterfrom
fix/kubernetes-log-utf8-decode

Conversation

@morgan-wowk

@morgan-wowk morgan-wowk commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

The Kubernetes client's default read_namespaced_pod_log path does a strict
.decode('utf8') over the full log payload before checking HTTP status.
When a pod with high-volume tqdm progress bars (block glyphs █▉▊▋▌▍▎▏,
3-byte UTF-8) runs with num_proc>1, concurrent writes to the same fd can
split a multi-byte glyph across a chunk boundary, leaving an orphaned
continuation byte. The strict decode throws UnicodeDecodeError, which
bubbles through the log-upload retry wrapper and marks an otherwise-healthy
training run as SYSTEM_ERROR.

Fix: pass _preload_content=False to get the raw urllib3 response and decode
manually with errors="replace". This is applied to both the single-pod
(LaunchedKubernetesContainer.get_log) and multi-pod
(LaunchedKubernetesJob._get_log_by_pod_key) log-read paths.
A warning is logged whenever replacement characters are injected, so the
next occurrence is observable in Observe without requiring a separate
debug build.

The existing "Bad Request" catch for PodInitializing is unaffected:
the kubernetes client's status check runs outside the _preload_content
block and still raises ApiException with the correct reason phrase.

User experience: before and after

Orchestrator log-upload path (run lifecycle)

Before — the UnicodeDecodeError bubbles out of the retry wrapper. The run is marked SYSTEM_ERROR, no logs are uploaded, and all downstream tasks (e.g. Upload HF, Upload Training Summary) are skipped. The user sees a failed run with no log output and no indication that their training code was healthy.

After — the log is decoded successfully and uploaded. The run continues to completion. One or two progress-bar characters are substituted with ? (U+FFFD) at the point of corruption, but the rest of the log is intact and readable.

API log-read path (viewing logs for a running execution)

Before — the request throws before returning a response. The user gets a 500 error in the UI when trying to view logs mid-run.

After — the full log is returned. The substituted character appears inline exactly where the torn byte was, typically mid-progress-bar where it is visually unnoticeable.


Example log output

Before (UnicodeDecodeError thrown at byte 5,115,152 — nothing returned):

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 5115152: unexpected end of data

After (log returned; ? marks the single substituted byte at the corruption point):

2024-06-17T20:29:11Z Map (num_proc=64):  77%|██████████████████████████████████████████████████████████████████████████████████████▌         | 137412/178432 [00:34<00:10, 3991.03 examples/s]
2024-06-17T20:29:12Z Map (num_proc=64):  79%|███████████████████████████████████████████████████████████████?██████  | 140876/178432 [00:35<00:09, 3987.11 examples/s]
2024-06-17T20:29:13Z Map (num_proc=64):  81%|█████████████████████████████████████████████████████████████████████▏  | 144501/178432 [00:36<00:08, 3994.77 examples/s]
2024-06-17T20:29:55Z ***** Running training *****
2024-06-17T20:29:55Z   Num examples = 144,501
2024-06-17T20:29:55Z   Num Epochs = 3

The ? on line 2 is where one torn block glyph was replaced. All structured log lines above and below it — training config, loss values, eval metrics — are fully intact.

Fixes: #281

Copy link
Copy Markdown
Collaborator Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@morgan-wowk

morgan-wowk commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 Bugsnag impact — 3.96K occurrences in the last 7 days

This PR directly resolves the top recurring UnicodeDecodeError error group in Bugsnag (3,960 occurrences over the past 7 days). Two distinct variants are hitting the same strict .decode('utf8') call in the Kubernetes client:

  1. invalid continuation byte — 64 worker processes writing tqdm block glyphs (█▉▊▋▌▍▎▏, 3-byte UTF-8: E2 96 8x) to the same stderr fd interleave their byte streams, leaving orphaned continuation bytes in the raw K8s API response.

  2. unexpected end of data at 0xe2 — the K8s API response is truncated mid-sequence at a size boundary (observed around 5.1 MB), cutting a 3-byte glyph after its first byte.

Both variants throw before the run has any real failure, bubble through the retry wrapper in internal_process_one_running_execution, and mark otherwise-healthy training runs as SYSTEM_ERROR — causing the downstream Upload steps to be skipped.

How this fixes it: switching to _preload_content=False bypasses the client's strict decode entirely. We then call .decode("utf-8", errors="replace") ourselves, which substitutes U+FFFD for any malformed byte sequence — covering both error classes above. The change is applied to both the orchestrator log-upload path (_get_log_by_pod_key) and the live log-read path called by the API when a user views logs for a running execution (LaunchedKubernetesContainer.get_log). A _logger.warning fires whenever substitution occurs, so future occurrences remain observable without needing a repro.

)
try:
log = response.data.decode("utf-8", errors="replace")
if "�" in log:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the source code mostly ASCII. Let's use "\N{REPLACEMENT CHARACTER}" instead of "�".

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion. Changed.

Thanks for reviewing.

log = response.data.decode("utf-8", errors="replace")
if "�" in log:
_logger.warning(
"Pod log for %s contained invalid UTF-8 bytes; substituted replacement characters.",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f-string, maybe?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f-string sounds good.

Changed

Ark-kun commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Thank you for the investigation and the fix. Approved. See the small comments.

@Ark-kun Ark-kun changed the title kubernetes: decode pod logs with errors=replace to survive torn UTF-8 bytes fix: Launchers - Kubernetes - Fix getting logs when logs arte not valid UTF-8 Jun 18, 2026

Ark-kun commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Please add "Fixes:" line with links to the GitHub issues this PR closes

… bytes

The Kubernetes client's default read_namespaced_pod_log path does a strict
.decode('utf8') over the full log payload before checking HTTP status.
When a pod with high-volume tqdm progress bars (block glyphs █▉▊▋▌▍▎▏,
3-byte UTF-8) runs with num_proc>1, concurrent writes to the same fd can
split a multi-byte glyph across a chunk boundary, leaving an orphaned
continuation byte.  The strict decode throws UnicodeDecodeError, which
bubbles through the log-upload retry wrapper and marks an otherwise-healthy
training run as SYSTEM_ERROR.

Fix: pass _preload_content=False to get the raw urllib3 response and decode
manually with errors="replace".  This is applied to both the single-pod
(LaunchedKubernetesContainer.get_log) and multi-pod
(LaunchedKubernetesJob._get_log_by_pod_key) log-read paths.
A warning is logged whenever replacement characters are injected, so the
next occurrence is observable in Observe without requiring a separate
debug build.

The existing "Bad Request" catch for PodInitializing is unaffected:
the kubernetes client's status check runs outside the _preload_content
block and still raises ApiException with the correct reason phrase.
@morgan-wowk morgan-wowk force-pushed the fix/kubernetes-log-utf8-decode branch from 7a5ff1b to fe11a6b Compare June 18, 2026 22:49

Copy link
Copy Markdown
Collaborator Author

Please add "Fixes:" line with links to the GitHub issues this PR closes

​Thanks for the review. Added the reference to the GitHub issue.

Copy link
Copy Markdown
Collaborator Author

Smoke tested locally ✅
Smoke tested on known upstream services / wrappers ✅

@morgan-wowk morgan-wowk merged commit a57e9e7 into master Jun 18, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UnicodeDecodeError in pod log reading marks healthy GPU training runs as SYSTEM_ERROR

2 participants