kubernetes: suppress tqdm progress bars in all container pods#278
Closed
morgan-wowk wants to merge 1 commit into
Closed
kubernetes: suppress tqdm progress bars in all container pods#278morgan-wowk wants to merge 1 commit into
morgan-wowk wants to merge 1 commit into
Conversation
Inject TQDM_DISABLE=1 and HF_DATASETS_DISABLE_PROGRESS_BARS=1 into every Kubernetes container's environment unless the component has already set those keys explicitly (user values take precedence). High-volume tqdm block-glyph output (█▉▊▋▌▍▎▏, 3-byte UTF-8) from concurrent HF datasets workers (num_proc>1) is the dominant source of non-ASCII bytes in pod log streams. Eliminating the glyphs at the source makes the log stream pure ASCII for tokenization/packing phases, removing any possibility of torn multi-byte sequences reaching the Kubernetes API read path regardless of the defensive decode added in the previous commit. Side effect: log sizes for heavy tokenization jobs drop significantly (observed ~6 MB → tens of KB), since tqdm progress bars account for the bulk of the raw byte volume.
db2050f to
059b2f8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Injects
TQDM_DISABLE=1andHF_DATASETS_DISABLE_PROGRESS_BARS=1into every Kubernetes container environment at the launcher level. Components may override by setting the same keys in their ownenv— those values take precedence.Why
HF datasets
mapwithnum_proc>1spawns child processes that each write tqdm block-glyph progress bars (█▉▊▋▌▍▎▏, 3-byte UTF-8:E2 96 8x) to a shared inherited stderr fd. The OS does not guarantee atomic writes across processes for sequences longer thanPIPE_BUF, so bytes from concurrent workers interleave mid-glyph. This produces torn multi-byte sequences in the pod log stream which causeUnicodeDecodeErrorin the Kubernetes client's strict.decode('utf8'), marking otherwise-healthy training runs asSYSTEM_ERROR.With these env vars set, the tokenization and packing phases emit no non-ASCII bytes — the log stream is pure ASCII and the torn-byte condition cannot occur regardless of
num_proc.Side effects
Relationship to #277
This PR and #277 are independent solutions to the same root cause, kept separate for comparison. #277 fixes the Kubernetes client decode layer defensively; this PR eliminates the primary source of non-ASCII bytes. Either alone is an improvement.