fix(download): retry transient per-file failures in downloadRepo#1
Closed
JulianPscheid wants to merge 1 commit into
Closed
fix(download): retry transient per-file failures in downloadRepo#1JulianPscheid wants to merge 1 commit into
JulianPscheid wants to merge 1 commit into
Conversation
A single TLS handshake abort (NSURLErrorSecureConnectionFailed) or request timeout on one file of a multi-file repo download aborted the entire download, forcing the caller to restart from zero. On flaky CDN paths (e.g. HuggingFace's Xet bridge) this meant several manual retries before a model finished downloading. Wrap each per-file download in a bounded retry with exponential backoff (4 attempts; 1s/2s/4s). Classify errors so only transient failures retry — URLSession timeout/TLS/connectivity and HTTP 429/503/5xx — while 404, other 4xx, and invalid responses fail fast without burning the backoff budget. Already-downloaded files are still skipped, so retries stay cheap.
Owner
Author
|
Superseded by a standalone upstream PR against FluidInference/FluidAudio based directly on upstream/main — this change is independent of the token-timings line it was stacked on here. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A single TLS handshake abort (
NSURLErrorSecureConnectionFailed/ SecureTransport-9816) or request timeout (-1001) on one file of a multi-file repo download aborts the entiredownloadRepocall. On flaky CDN paths — observed on-device against HuggingFace's Xet bridge (us.aws.cdn.hf.co/xet-bridge-us) — this forced several manual retries before a 22-file model (Nemotron streaming) finished downloading.The Nemotron streaming path is especially exposed:
StreamingNemotronAsrManager.loadModelscallsDownloadUtils.downloadRepodirectly and does not go throughDownloadUtils.loadModels, so it never had even the coarse one-shot delete+redownload fallback.Change
Wrap each per-file download in
downloadRepoin a bounded retry with exponential backoff (4 attempts; 1s/2s/4s), via a newdownloadFileWithRetryhelper. Errors are classified so only transient failures retry:URLErrortimeout / TLS / connectivity, and HTTP429/503/5xx.404/ other4xx, invalid responses, non-network errors — a genuinely missing/misnamed file surfaces immediately.Already-downloaded files are still skipped (atomic
moveItemafter a validated 2xx), so retries stay cheap and the HTTP-status validation is unchanged — it just moved into the helper.Notes
swift build+swift test --filter DownloadUtilspass;swift format lintclean.