Skip to content

kubernetes: suppress tqdm progress bars in all container pods#278

Closed
morgan-wowk wants to merge 1 commit into
masterfrom
fix/disable-tqdm-in-k8s-jobs
Closed

kubernetes: suppress tqdm progress bars in all container pods#278
morgan-wowk wants to merge 1 commit into
masterfrom
fix/disable-tqdm-in-k8s-jobs

Conversation

@morgan-wowk

@morgan-wowk morgan-wowk commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Injects TQDM_DISABLE=1 and HF_DATASETS_DISABLE_PROGRESS_BARS=1 into every Kubernetes container environment at the launcher level. Components may override by setting the same keys in their own env — those values take precedence.

Why

HF datasets map with num_proc>1 spawns child processes that each write tqdm block-glyph progress bars (█▉▊▋▌▍▎▏, 3-byte UTF-8: E2 96 8x) to a shared inherited stderr fd. The OS does not guarantee atomic writes across processes for sequences longer than PIPE_BUF, so bytes from concurrent workers interleave mid-glyph. This produces torn multi-byte sequences in the pod log stream which cause UnicodeDecodeError in the Kubernetes client's strict .decode('utf8'), marking otherwise-healthy training runs as SYSTEM_ERROR.

With these env vars set, the tokenization and packing phases emit no non-ASCII bytes — the log stream is pure ASCII and the torn-byte condition cannot occur regardless of num_proc.

Side effects

  • Log sizes for heavy tokenization jobs drop significantly. The ~6 MB blobs observed in incident runs (wall-to-wall tqdm bars) shrink to tens of KB of actual training output.
  • Tokenization/packing progress is no longer visible in logs. The HF datasets step count and throughput are still logged at completion; only the animated progress bar is suppressed.

Relationship to #277

This PR and #277 are independent solutions to the same root cause, kept separate for comparison. #277 fixes the Kubernetes client decode layer defensively; this PR eliminates the primary source of non-ASCII bytes. Either alone is an improvement.

Inject TQDM_DISABLE=1 and HF_DATASETS_DISABLE_PROGRESS_BARS=1 into every
Kubernetes container's environment unless the component has already set
those keys explicitly (user values take precedence).

High-volume tqdm block-glyph output (█▉▊▋▌▍▎▏, 3-byte UTF-8) from
concurrent HF datasets workers (num_proc>1) is the dominant source of
non-ASCII bytes in pod log streams. Eliminating the glyphs at the source
makes the log stream pure ASCII for tokenization/packing phases, removing
any possibility of torn multi-byte sequences reaching the Kubernetes API
read path regardless of the defensive decode added in the previous commit.

Side effect: log sizes for heavy tokenization jobs drop significantly
(observed ~6 MB → tens of KB), since tqdm progress bars account for the
bulk of the raw byte volume.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant