Skip to content

fix: add stall detection and faster retries to git fetch#33

Open
devin-ai-integration[bot] wants to merge 2 commits into
mainfrom
devin/1781213783-fetch-stall-detection
Open

fix: add stall detection and faster retries to git fetch#33
devin-ai-integration[bot] wants to merge 2 commits into
mainfrom
devin/1781213783-fetch-stall-detection

Conversation

@devin-ai-integration

@devin-ai-integration devin-ai-integration Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Addresses checkout step timeouts (e.g. bnk-dev/bank run 27376021190 where 3 jobs failed and most "successful" checkouts took 100-178s). The root cause is a stalled git fetch consuming the entire step-level timeout with no chance for the retry mechanism to help.

Two changes to GitCommandManager.fetch():

  1. Stall detection via git's built-in HTTP low-speed abort:

    -c http.lowSpeedLimit=1000 -c http.lowSpeedTime=15
    

    If transfer speed drops below 1 KB/s for 15 consecutive seconds, git aborts the fetch attempt. Without this, a hung connection blocks until the step timeout (typically 3 minutes) kills everything — only one attempt ever runs.

  2. Faster retry cadenceRetryHelper(3, 1, 5) instead of the default (3, 10, 20). With a 3-minute step timeout, the old 10-20s inter-attempt sleeps wasted 20-40s. New 1-5s sleeps give all 3 attempts a fair chance within the budget:

    • Old worst case: attempt₁ stalls 15s + sleep 20s + attempt₂ stalls 15s + sleep 20s + attempt₃ 15s = 85s
    • New worst case: attempt₁ stalls 15s + sleep 5s + attempt₂ stalls 15s + sleep 5s + attempt₃ 15s = 55s

Data context (ClickHouse, last 6h for bnk-dev in us-west): 7187 successful checkouts avg 41s, 31 failures all at ~189s (= step timeout). Multiple customers affected across us-west, not region-specific.

Link to Devin session: https://app.devin.ai/sessions/88dbc85a54eb4f22b1bcf057a8509fcf


View with Codesmith Autofix with Codesmith
Need help on this PR? Tag /codesmith with what you need. Autofix is disabled. (Staging)

Add http.lowSpeedLimit (1000 bytes/sec) and http.lowSpeedTime (15s) git
configs to the fetch command. When the transfer speed drops below 1KB/s
for 15 consecutive seconds, git aborts the fetch attempt. This enables
the retry mechanism to actually work when connections stall, rather than
having a single hung fetch consume the entire step-level timeout.

Also switch fetch retries from the default 10-20s inter-attempt sleep to
1-5s. With a typical 3-minute step timeout, every second counts; the old
delays could waste 40+ seconds on sleeps alone.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant