Skip to content

fix(download): retry transient per-file failures in downloadRepo#1

Closed
JulianPscheid wants to merge 1 commit into
hedy-nemotron-token-timingsfrom
hedy-download-retry-resilience
Closed

fix(download): retry transient per-file failures in downloadRepo#1
JulianPscheid wants to merge 1 commit into
hedy-nemotron-token-timingsfrom
hedy-download-retry-resilience

Conversation

@JulianPscheid

Copy link
Copy Markdown
Owner

Problem

A single TLS handshake abort (NSURLErrorSecureConnectionFailed / SecureTransport -9816) or request timeout (-1001) on one file of a multi-file repo download aborts the entire downloadRepo call. On flaky CDN paths — observed on-device against HuggingFace's Xet bridge (us.aws.cdn.hf.co/xet-bridge-us) — this forced several manual retries before a 22-file model (Nemotron streaming) finished downloading.

The Nemotron streaming path is especially exposed: StreamingNemotronAsrManager.loadModels calls DownloadUtils.downloadRepo directly and does not go through DownloadUtils.loadModels, so it never had even the coarse one-shot delete+redownload fallback.

Change

Wrap each per-file download in downloadRepo in a bounded retry with exponential backoff (4 attempts; 1s/2s/4s), via a new downloadFileWithRetry helper. Errors are classified so only transient failures retry:

  • Retry: URLError timeout / TLS / connectivity, and HTTP 429/503/5xx.
  • Fail fast (no backoff): 404 / other 4xx, invalid responses, non-network errors — a genuinely missing/misnamed file surfaces immediately.

Already-downloaded files are still skipped (atomic moveItem after a validated 2xx), so retries stay cheap and the HTTP-status validation is unchanged — it just moved into the helper.

Notes

  • Listing-phase calls are intentionally left as-is (no observed listing failures).
  • On a retry the per-file byte counter restarts, so reported progress may briefly dip — cosmetic, not addressed here.
  • swift build + swift test --filter DownloadUtils pass; swift format lint clean.

A single TLS handshake abort (NSURLErrorSecureConnectionFailed) or request
timeout on one file of a multi-file repo download aborted the entire
download, forcing the caller to restart from zero. On flaky CDN paths
(e.g. HuggingFace's Xet bridge) this meant several manual retries before a
model finished downloading.

Wrap each per-file download in a bounded retry with exponential backoff
(4 attempts; 1s/2s/4s). Classify errors so only transient failures retry —
URLSession timeout/TLS/connectivity and HTTP 429/503/5xx — while 404, other
4xx, and invalid responses fail fast without burning the backoff budget.
Already-downloaded files are still skipped, so retries stay cheap.
@JulianPscheid

Copy link
Copy Markdown
Owner Author

Superseded by a standalone upstream PR against FluidInference/FluidAudio based directly on upstream/main — this change is independent of the token-timings line it was stacked on here.

@JulianPscheid JulianPscheid deleted the hedy-download-retry-resilience branch June 15, 2026 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant