Skip to content

feat: retry on timeout + post fallbacks as PR reviews#19

Merged
derekmisler merged 4 commits into
docker:mainfrom
derekmisler:improve-timeout-handling
Jun 25, 2026
Merged

feat: retry on timeout + post fallbacks as PR reviews#19
derekmisler merged 4 commits into
docker:mainfrom
derekmisler:improve-timeout-handling

Conversation

@derekmisler

Copy link
Copy Markdown
Collaborator

Summary

Two improvements to timeout handling in PR reviews.

Change 1: retry-on-timeout

  • Add retry-on-timeout input (default "0") to root action.yml
  • exec.ts: exit code 124 now only skips retry when the retryOnTimeout budget is exhausted; otherwise falls through to the normal retry path
  • index.ts: reads and passes retryOnTimeout to runAgent()
  • review-pr/action.yml: sets retry-on-timeout: "1" on the Run PR Review step — one automatic retry when the 45-min timeout fires

Change 2: fallback comments → PR reviews

All three fallback comment paths in Post clean summary now post via POST /pulls/{pr}/reviews with event: COMMENT instead of POST /issues/{pr}/comments:

  • Timeout fallback
  • Failure fallback
  • Defense-in-depth LGTM fallback (dedup guard updated to query /pulls/{pr}/reviews accordingly)

This enables GitHub's re-request review button on all fallback scenarios, not just the happy path.

No permission changes needed — GH_TOKEN already has pull-requests: write.

Testing

  • 472 unit tests pass (2 new tests for retry-on-timeout behavior)
  • TypeScript typecheck clean
  • Biome lint clean

Change 1 — retry-on-timeout:
- Add retry-on-timeout input (default 0) to root action.yml
- exec.ts: when exit 124, break only if retryOnTimeout budget exhausted;
  otherwise fall through to retry loop
- index.ts: read and pass retryOnTimeout to runAgent()
- review-pr/action.yml: set retry-on-timeout: "1" on Run PR Review step
  so a timed-out review gets one automatic retry
- Add two new unit tests: retry succeeds on second attempt; stops after budget

Change 2 — fallback comments → PR reviews:
- Post all three fallback paths (timeout / failure / defense-in-depth LGTM)
  via POST /pulls/{pr}/reviews with event:COMMENT instead of
  POST /issues/{pr}/comments — enables GitHub's re-request review button
- Update dedup guard to query /pulls/{pr}/reviews instead of /issues/comments
@derekmisler derekmisler self-assigned this Jun 25, 2026
@derekmisler derekmisler marked this pull request as ready for review June 25, 2026 12:13
@derekmisler derekmisler requested a review from docker-agent June 25, 2026 12:14

@docker-agent docker-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟡 NEEDS ATTENTION

Comment thread src/main/exec.ts Outdated
}

if (attempt > opts.maxRetries) {
if (opts.retryOnTimeout <= 0 || attempt > opts.retryOnTimeout) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Shared attempt counter conflates timeout and non-timeout retry budgets

The timeout retry budget (retryOnTimeout) and the normal failure retry budget (maxRetries) both increment the same attempt counter. This causes under-retrying in mixed-failure sequences:

  • Attempt 1: normal (non-timeout) failure → attempt(1) > maxRetries(1) is false → retries (correct so far)
  • Attempt 2: timeout → attempt(2) > retryOnTimeout(1) is truebreaks with no timeout retry used

The intent is that retryOnTimeout=1 allows one extra retry when a timeout fires. But if a normal failure has already advanced attempt to 2, the timeout-retry budget is silently exhausted without ever being used.

The two budgets should be tracked with independent counters (e.g. timeoutRetryCount and failureRetryCount) rather than sharing a single attempt offset against different limits.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — replaced the shared attempt counter with two independent counters: failureRetryCount (checked against maxRetries) and timeoutRetryCount (checked against retryOnTimeout). A mixed failure→timeout→success sequence now works correctly and is covered by a new test.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent fix! The independent counters now correctly track each budget separately. I verified that:

  • failureRetryCount increments only for non-timeout failures (checked against maxRetries)
  • timeoutRetryCount increments only for timeout failures (checked against retryOnTimeout)
  • A mixed-failure sequence like failure → timeout → success now correctly uses 1 from each budget instead of exhausting the shared counter prematurely

The implementation is clean and the log messages also accurately reflect which budget is being used. Great work covering this with a test case!

Comment thread review-pr/action.yml
# Dedup guard: skip posting if an identical fallback comment already exists.
EXISTING=$(gh api "repos/$REPOSITORY/issues/$PR_NUMBER/comments" \
# Dedup guard: skip posting if an identical fallback PR review already exists.
EXISTING=$(gh api "repos/$REPOSITORY/pulls/$PR_NUMBER/reviews" \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Dedup guard for fallback LGTM is blind to old issue-comment style posts

The dedup guard now queries GET /pulls/{pr}/reviews for existing LGTM reviews, but the previous code posted fallbacks to POST /issues/{pr}/comments. These are separate GitHub API endpoints returning entirely different data.

On the first run of this new code against a PR that previously received an LGTM issue comment (from the old path), the guard will find zero matching reviews and post a duplicate LGTM PR review — leaving the PR with both an issue comment and a PR review saying the same thing.

Consider adding a one-time cross-check against the issue comments endpoint as well:

EXISTING_OLD=$(gh api "repos/$REPOSITORY/issues/$PR_NUMBER/comments" \
  --jq '[.[] | select(.body | startswith("🟢 **No issues found**"))] | length' \
  2>/dev/null || echo "0")

Or simply accept the one-time duplicate as a known migration artefact (subsequent runs will dedup correctly via the reviews endpoint).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with the cross-check approach — the guard now queries both /pulls/{pr}/reviews (new path) and /issues/{pr}/comments (old path), and skips posting if either returns a match. This avoids the one-time duplicate on the migration run without any manual cleanup.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect — the cross-check implementation looks solid. The guard now queries both /pulls/{pr}/reviews and /issues/{pr}/comments, and skips posting if either returns a match. This cleanly handles the migration without manual intervention.

The || echo \"0\" fallback on both queries is a nice touch — ensures the skip logic degrades gracefully if either API call fails (e.g., rate limiting).

exec.ts: replace shared attempt counter with separate failureRetryCount
and timeoutRetryCount so mixed-failure sequences (e.g. a non-timeout
failure followed by a timeout) consume only from the correct budget.
The previous code used attempt > retryOnTimeout which silently exhausted
the timeout budget when a prior failure had already advanced attempt.

Add a test covering the mixed failure→timeout→success sequence.

review-pr/action.yml: extend the LGTM dedup guard to also check the old
issues/comments endpoint alongside the new pulls/reviews endpoint, so the
first run of this code against a PR that previously received an LGTM issue
comment does not post a duplicate review.
exec.ts: retry log now shows 'Timeout retry N of M' vs 'Retry N of M'
so ops can distinguish which budget fired without reading code.

review-pr/action.yml: guard jq selects with (.body // "") to avoid a
type error when .body is null (draft reviews, deleted comments), which
previously caused the || echo 0 fallback to fire and post a duplicate.

action.yml: retry-on-timeout description notes that the budget is
independent of max-retries so both can be consumed in the same run.
@derekmisler derekmisler enabled auto-merge (squash) June 25, 2026 14:00
@derekmisler derekmisler merged commit e44355e into docker:main Jun 25, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants