fix: Launchers - Kubernetes - Fix getting logs when logs arte not valid UTF-8 by morgan-wowk · Pull Request #277 · TangleML/tangle

morgan-wowk · 2026-06-17T22:58:33Z

The Kubernetes client's default read_namespaced_pod_log path does a strict
.decode('utf8') over the full log payload before checking HTTP status.
When a pod with high-volume tqdm progress bars (block glyphs █▉▊▋▌▍▎▏,
3-byte UTF-8) runs with num_proc>1, concurrent writes to the same fd can
split a multi-byte glyph across a chunk boundary, leaving an orphaned
continuation byte. The strict decode throws UnicodeDecodeError, which
bubbles through the log-upload retry wrapper and marks an otherwise-healthy
training run as SYSTEM_ERROR.

Fix: pass _preload_content=False to get the raw urllib3 response and decode
manually with errors="replace". This is applied to both the single-pod
(LaunchedKubernetesContainer.get_log) and multi-pod
(LaunchedKubernetesJob._get_log_by_pod_key) log-read paths.
A warning is logged whenever replacement characters are injected, so the
next occurrence is observable in Observe without requiring a separate
debug build.

The existing "Bad Request" catch for PodInitializing is unaffected:
the kubernetes client's status check runs outside the _preload_content
block and still raises ApiException with the correct reason phrase.

User experience: before and after

Orchestrator log-upload path (run lifecycle)

Before — the UnicodeDecodeError bubbles out of the retry wrapper. The run is marked SYSTEM_ERROR, no logs are uploaded, and all downstream tasks (e.g. Upload HF, Upload Training Summary) are skipped. The user sees a failed run with no log output and no indication that their training code was healthy.

After — the log is decoded successfully and uploaded. The run continues to completion. One or two progress-bar characters are substituted with ? (U+FFFD) at the point of corruption, but the rest of the log is intact and readable.

API log-read path (viewing logs for a running execution)

Before — the request throws before returning a response. The user gets a 500 error in the UI when trying to view logs mid-run.

After — the full log is returned. The substituted character appears inline exactly where the torn byte was, typically mid-progress-bar where it is visually unnoticeable.

Example log output

Before (UnicodeDecodeError thrown at byte 5,115,152 — nothing returned):

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 5115152: unexpected end of data

After (log returned; ? marks the single substituted byte at the corruption point):

2024-06-17T20:29:11Z Map (num_proc=64):  77%|██████████████████████████████████████████████████████████████████████████████████████▌         | 137412/178432 [00:34<00:10, 3991.03 examples/s]
2024-06-17T20:29:12Z Map (num_proc=64):  79%|███████████████████████████████████████████████████████████████?██████  | 140876/178432 [00:35<00:09, 3987.11 examples/s]
2024-06-17T20:29:13Z Map (num_proc=64):  81%|█████████████████████████████████████████████████████████████████████▏  | 144501/178432 [00:36<00:08, 3994.77 examples/s]
2024-06-17T20:29:55Z ***** Running training *****
2024-06-17T20:29:55Z   Num examples = 144,501
2024-06-17T20:29:55Z   Num Epochs = 3

The ? on line 2 is where one torn block glyph was replaced. All structured log lines above and below it — training config, loss values, eval metrics — are fully intact.

Fixes: #281

morgan-wowk · 2026-06-17T22:58:47Z

fix: Launchers - Kubernetes - Fix getting logs when logs arte not valid UTF-8 #277 👈 (View in Graphite)
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

morgan-wowk · 2026-06-17T23:04:50Z

🤖 Bugsnag impact — 3.96K occurrences in the last 7 days

This PR directly resolves the top recurring UnicodeDecodeError error group in Bugsnag (3,960 occurrences over the past 7 days). Two distinct variants are hitting the same strict .decode('utf8') call in the Kubernetes client:

invalid continuation byte — 64 worker processes writing tqdm block glyphs (█▉▊▋▌▍▎▏, 3-byte UTF-8: E2 96 8x) to the same stderr fd interleave their byte streams, leaving orphaned continuation bytes in the raw K8s API response.
unexpected end of data at 0xe2 — the K8s API response is truncated mid-sequence at a size boundary (observed around 5.1 MB), cutting a 3-byte glyph after its first byte.

Both variants throw before the run has any real failure, bubble through the retry wrapper in internal_process_one_running_execution, and mark otherwise-healthy training runs as SYSTEM_ERROR — causing the downstream Upload steps to be skipped.

How this fixes it: switching to _preload_content=False bypasses the client's strict decode entirely. We then call .decode("utf-8", errors="replace") ourselves, which substitutes U+FFFD for any malformed byte sequence — covering both error classes above. The change is applied to both the orchestrator log-upload path (_get_log_by_pod_key) and the live log-read path called by the API when a user views logs for a running execution (LaunchedKubernetesContainer.get_log). A _logger.warning fires whenever substitution occurs, so future occurrences remain observable without needing a repro.

Ark-kun · 2026-06-18T22:12:32Z

        )
+        try:
+            log = response.data.decode("utf-8", errors="replace")
+            if "�" in log:


Let's keep the source code mostly ASCII. Let's use "\N{REPLACEMENT CHARACTER}" instead of "�".

Nice suggestion. Changed.

Thanks for reviewing.

Ark-kun · 2026-06-18T22:12:45Z

+            log = response.data.decode("utf-8", errors="replace")
+            if "�" in log:
+                _logger.warning(
+                    "Pod log for %s contained invalid UTF-8 bytes; substituted replacement characters.",


f-string, maybe?

f-string sounds good.

Changed

Ark-kun · 2026-06-18T22:14:15Z

Thank you for the investigation and the fix. Approved. See the small comments.

Ark-kun · 2026-06-18T22:22:07Z

Please add "Fixes:" line with links to the GitHub issues this PR closes

… bytes The Kubernetes client's default read_namespaced_pod_log path does a strict .decode('utf8') over the full log payload before checking HTTP status. When a pod with high-volume tqdm progress bars (block glyphs █▉▊▋▌▍▎▏, 3-byte UTF-8) runs with num_proc>1, concurrent writes to the same fd can split a multi-byte glyph across a chunk boundary, leaving an orphaned continuation byte. The strict decode throws UnicodeDecodeError, which bubbles through the log-upload retry wrapper and marks an otherwise-healthy training run as SYSTEM_ERROR. Fix: pass _preload_content=False to get the raw urllib3 response and decode manually with errors="replace". This is applied to both the single-pod (LaunchedKubernetesContainer.get_log) and multi-pod (LaunchedKubernetesJob._get_log_by_pod_key) log-read paths. A warning is logged whenever replacement characters are injected, so the next occurrence is observable in Observe without requiring a separate debug build. The existing "Bad Request" catch for PodInitializing is unaffected: the kubernetes client's status check runs outside the _preload_content block and still raises ApiException with the correct reason phrase.

morgan-wowk · 2026-06-18T22:50:29Z

Please add "Fixes:" line with links to the GitHub issues this PR closes

Thanks for the review. Added the reference to the GitHub issue.

morgan-wowk · 2026-06-18T23:11:12Z

Smoke tested locally ✅
Smoke tested on known upstream services / wrappers ✅

This was referenced Jun 18, 2026

kubernetes: suppress tqdm progress bars in all container pods #278

Closed

kubernetes: swallow log-acquisition errors to prevent false SYSTEM_ERROR #279

Closed

morgan-wowk marked this pull request as ready for review June 18, 2026 20:05

morgan-wowk requested a review from Ark-kun as a code owner June 18, 2026 20:05

Ark-kun approved these changes Jun 18, 2026

View reviewed changes

Ark-kun changed the title ~~kubernetes: decode pod logs with errors=replace to survive torn UTF-8 bytes~~ fix: Launchers - Kubernetes - Fix getting logs when logs arte not valid UTF-8 Jun 18, 2026

morgan-wowk mentioned this pull request Jun 18, 2026

UnicodeDecodeError in pod log reading marks healthy GPU training runs as SYSTEM_ERROR #281

Closed

morgan-wowk force-pushed the fix/kubernetes-log-utf8-decode branch from 7a5ff1b to fe11a6b Compare June 18, 2026 22:49

morgan-wowk merged commit a57e9e7 into master Jun 18, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Launchers - Kubernetes - Fix getting logs when logs arte not valid UTF-8#277

fix: Launchers - Kubernetes - Fix getting logs when logs arte not valid UTF-8#277
morgan-wowk merged 1 commit into
masterfrom
fix/kubernetes-log-utf8-decode

morgan-wowk commented Jun 17, 2026 •

edited

Loading

Uh oh!

morgan-wowk commented Jun 17, 2026

Uh oh!

morgan-wowk commented Jun 17, 2026 •

edited

Loading

Uh oh!

Ark-kun Jun 18, 2026

Uh oh!

morgan-wowk Jun 18, 2026

Uh oh!

Ark-kun Jun 18, 2026

Uh oh!

morgan-wowk Jun 18, 2026

Uh oh!

Ark-kun commented Jun 18, 2026

Uh oh!

Ark-kun commented Jun 18, 2026

Uh oh!

morgan-wowk commented Jun 18, 2026

Uh oh!

morgan-wowk commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

morgan-wowk commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User experience: before and after

Orchestrator log-upload path (run lifecycle)

API log-read path (viewing logs for a running execution)

Example log output

Uh oh!

morgan-wowk commented Jun 17, 2026

Uh oh!

morgan-wowk commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ark-kun Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

morgan-wowk Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Ark-kun Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

morgan-wowk Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Ark-kun commented Jun 18, 2026

Uh oh!

Ark-kun commented Jun 18, 2026

Uh oh!

morgan-wowk commented Jun 18, 2026

Uh oh!

morgan-wowk commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

morgan-wowk commented Jun 17, 2026 •

edited

Loading

morgan-wowk commented Jun 17, 2026 •

edited

Loading