feat: retry on timeout + post fallbacks as PR reviews#19
Conversation
Change 1 — retry-on-timeout:
- Add retry-on-timeout input (default 0) to root action.yml
- exec.ts: when exit 124, break only if retryOnTimeout budget exhausted;
otherwise fall through to retry loop
- index.ts: read and pass retryOnTimeout to runAgent()
- review-pr/action.yml: set retry-on-timeout: "1" on Run PR Review step
so a timed-out review gets one automatic retry
- Add two new unit tests: retry succeeds on second attempt; stops after budget
Change 2 — fallback comments → PR reviews:
- Post all three fallback paths (timeout / failure / defense-in-depth LGTM)
via POST /pulls/{pr}/reviews with event:COMMENT instead of
POST /issues/{pr}/comments — enables GitHub's re-request review button
- Update dedup guard to query /pulls/{pr}/reviews instead of /issues/comments
docker-agent
left a comment
There was a problem hiding this comment.
Assessment: 🟡 NEEDS ATTENTION
| } | ||
|
|
||
| if (attempt > opts.maxRetries) { | ||
| if (opts.retryOnTimeout <= 0 || attempt > opts.retryOnTimeout) { |
There was a problem hiding this comment.
[MEDIUM] Shared attempt counter conflates timeout and non-timeout retry budgets
The timeout retry budget (retryOnTimeout) and the normal failure retry budget (maxRetries) both increment the same attempt counter. This causes under-retrying in mixed-failure sequences:
- Attempt 1: normal (non-timeout) failure →
attempt(1) > maxRetries(1)isfalse→ retries (correct so far) - Attempt 2: timeout →
attempt(2) > retryOnTimeout(1)istrue→ breaks with no timeout retry used
The intent is that retryOnTimeout=1 allows one extra retry when a timeout fires. But if a normal failure has already advanced attempt to 2, the timeout-retry budget is silently exhausted without ever being used.
The two budgets should be tracked with independent counters (e.g. timeoutRetryCount and failureRetryCount) rather than sharing a single attempt offset against different limits.
There was a problem hiding this comment.
Fixed — replaced the shared attempt counter with two independent counters: failureRetryCount (checked against maxRetries) and timeoutRetryCount (checked against retryOnTimeout). A mixed failure→timeout→success sequence now works correctly and is covered by a new test.
There was a problem hiding this comment.
Excellent fix! The independent counters now correctly track each budget separately. I verified that:
- failureRetryCount increments only for non-timeout failures (checked against maxRetries)
- timeoutRetryCount increments only for timeout failures (checked against retryOnTimeout)
- A mixed-failure sequence like failure → timeout → success now correctly uses 1 from each budget instead of exhausting the shared counter prematurely
The implementation is clean and the log messages also accurately reflect which budget is being used. Great work covering this with a test case!
| # Dedup guard: skip posting if an identical fallback comment already exists. | ||
| EXISTING=$(gh api "repos/$REPOSITORY/issues/$PR_NUMBER/comments" \ | ||
| # Dedup guard: skip posting if an identical fallback PR review already exists. | ||
| EXISTING=$(gh api "repos/$REPOSITORY/pulls/$PR_NUMBER/reviews" \ |
There was a problem hiding this comment.
[MEDIUM] Dedup guard for fallback LGTM is blind to old issue-comment style posts
The dedup guard now queries GET /pulls/{pr}/reviews for existing LGTM reviews, but the previous code posted fallbacks to POST /issues/{pr}/comments. These are separate GitHub API endpoints returning entirely different data.
On the first run of this new code against a PR that previously received an LGTM issue comment (from the old path), the guard will find zero matching reviews and post a duplicate LGTM PR review — leaving the PR with both an issue comment and a PR review saying the same thing.
Consider adding a one-time cross-check against the issue comments endpoint as well:
EXISTING_OLD=$(gh api "repos/$REPOSITORY/issues/$PR_NUMBER/comments" \
--jq '[.[] | select(.body | startswith("🟢 **No issues found**"))] | length' \
2>/dev/null || echo "0")Or simply accept the one-time duplicate as a known migration artefact (subsequent runs will dedup correctly via the reviews endpoint).
There was a problem hiding this comment.
Fixed with the cross-check approach — the guard now queries both /pulls/{pr}/reviews (new path) and /issues/{pr}/comments (old path), and skips posting if either returns a match. This avoids the one-time duplicate on the migration run without any manual cleanup.
There was a problem hiding this comment.
Perfect — the cross-check implementation looks solid. The guard now queries both /pulls/{pr}/reviews and /issues/{pr}/comments, and skips posting if either returns a match. This cleanly handles the migration without manual intervention.
The || echo \"0\" fallback on both queries is a nice touch — ensures the skip logic degrades gracefully if either API call fails (e.g., rate limiting).
exec.ts: replace shared attempt counter with separate failureRetryCount and timeoutRetryCount so mixed-failure sequences (e.g. a non-timeout failure followed by a timeout) consume only from the correct budget. The previous code used attempt > retryOnTimeout which silently exhausted the timeout budget when a prior failure had already advanced attempt. Add a test covering the mixed failure→timeout→success sequence. review-pr/action.yml: extend the LGTM dedup guard to also check the old issues/comments endpoint alongside the new pulls/reviews endpoint, so the first run of this code against a PR that previously received an LGTM issue comment does not post a duplicate review.
exec.ts: retry log now shows 'Timeout retry N of M' vs 'Retry N of M' so ops can distinguish which budget fired without reading code. review-pr/action.yml: guard jq selects with (.body // "") to avoid a type error when .body is null (draft reviews, deleted comments), which previously caused the || echo 0 fallback to fire and post a duplicate. action.yml: retry-on-timeout description notes that the budget is independent of max-retries so both can be consumed in the same run.
Summary
Two improvements to timeout handling in PR reviews.
Change 1: retry-on-timeout
retry-on-timeoutinput (default"0") to rootaction.ymlexec.ts: exit code 124 now only skips retry when theretryOnTimeoutbudget is exhausted; otherwise falls through to the normal retry pathindex.ts: reads and passesretryOnTimeouttorunAgent()review-pr/action.yml: setsretry-on-timeout: "1"on the Run PR Review step — one automatic retry when the 45-min timeout firesChange 2: fallback comments → PR reviews
All three fallback comment paths in
Post clean summarynow post viaPOST /pulls/{pr}/reviewswithevent: COMMENTinstead ofPOST /issues/{pr}/comments:/pulls/{pr}/reviewsaccordingly)This enables GitHub's re-request review button on all fallback scenarios, not just the happy path.
No permission changes needed —
GH_TOKENalready haspull-requests: write.Testing