fix: Launchers - Kubernetes - Fix getting logs when logs arte not valid UTF-8#277
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
|
🤖 Bugsnag impact — 3.96K occurrences in the last 7 days This PR directly resolves the top recurring
Both variants throw before the run has any real failure, bubble through the retry wrapper in How this fixes it: switching to |
| ) | ||
| try: | ||
| log = response.data.decode("utf-8", errors="replace") | ||
| if "�" in log: |
There was a problem hiding this comment.
Let's keep the source code mostly ASCII. Let's use "\N{REPLACEMENT CHARACTER}" instead of "�".
There was a problem hiding this comment.
Nice suggestion. Changed.
Thanks for reviewing.
| log = response.data.decode("utf-8", errors="replace") | ||
| if "�" in log: | ||
| _logger.warning( | ||
| "Pod log for %s contained invalid UTF-8 bytes; substituted replacement characters.", |
There was a problem hiding this comment.
f-string sounds good.
Changed
|
Thank you for the investigation and the fix. Approved. See the small comments. |
|
Please add "Fixes:" line with links to the GitHub issues this PR closes |
… bytes
The Kubernetes client's default read_namespaced_pod_log path does a strict
.decode('utf8') over the full log payload before checking HTTP status.
When a pod with high-volume tqdm progress bars (block glyphs █▉▊▋▌▍▎▏,
3-byte UTF-8) runs with num_proc>1, concurrent writes to the same fd can
split a multi-byte glyph across a chunk boundary, leaving an orphaned
continuation byte. The strict decode throws UnicodeDecodeError, which
bubbles through the log-upload retry wrapper and marks an otherwise-healthy
training run as SYSTEM_ERROR.
Fix: pass _preload_content=False to get the raw urllib3 response and decode
manually with errors="replace". This is applied to both the single-pod
(LaunchedKubernetesContainer.get_log) and multi-pod
(LaunchedKubernetesJob._get_log_by_pod_key) log-read paths.
A warning is logged whenever replacement characters are injected, so the
next occurrence is observable in Observe without requiring a separate
debug build.
The existing "Bad Request" catch for PodInitializing is unaffected:
the kubernetes client's status check runs outside the _preload_content
block and still raises ApiException with the correct reason phrase.
7a5ff1b to
fe11a6b
Compare
Thanks for the review. Added the reference to the GitHub issue. |
|
Smoke tested locally ✅ |

The Kubernetes client's default read_namespaced_pod_log path does a strict
.decode('utf8') over the full log payload before checking HTTP status.
When a pod with high-volume tqdm progress bars (block glyphs █▉▊▋▌▍▎▏,
3-byte UTF-8) runs with num_proc>1, concurrent writes to the same fd can
split a multi-byte glyph across a chunk boundary, leaving an orphaned
continuation byte. The strict decode throws UnicodeDecodeError, which
bubbles through the log-upload retry wrapper and marks an otherwise-healthy
training run as SYSTEM_ERROR.
Fix: pass _preload_content=False to get the raw urllib3 response and decode
manually with errors="replace". This is applied to both the single-pod
(LaunchedKubernetesContainer.get_log) and multi-pod
(LaunchedKubernetesJob._get_log_by_pod_key) log-read paths.
A warning is logged whenever replacement characters are injected, so the
next occurrence is observable in Observe without requiring a separate
debug build.
The existing "Bad Request" catch for PodInitializing is unaffected:
the kubernetes client's status check runs outside the _preload_content
block and still raises ApiException with the correct reason phrase.
User experience: before and after
Orchestrator log-upload path (run lifecycle)
Before — the UnicodeDecodeError bubbles out of the retry wrapper. The run is marked
SYSTEM_ERROR, no logs are uploaded, and all downstream tasks (e.g. Upload HF, Upload Training Summary) are skipped. The user sees a failed run with no log output and no indication that their training code was healthy.After — the log is decoded successfully and uploaded. The run continues to completion. One or two progress-bar characters are substituted with
?(U+FFFD) at the point of corruption, but the rest of the log is intact and readable.API log-read path (viewing logs for a running execution)
Before — the request throws before returning a response. The user gets a 500 error in the UI when trying to view logs mid-run.
After — the full log is returned. The substituted character appears inline exactly where the torn byte was, typically mid-progress-bar where it is visually unnoticeable.
Example log output
Before (UnicodeDecodeError thrown at byte 5,115,152 — nothing returned):
After (log returned;
?marks the single substituted byte at the corruption point):The
?on line 2 is where one torn block glyph was replaced. All structured log lines above and below it — training config, loss values, eval metrics — are fully intact.Fixes: #281