You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The first code agent run for issue #644 (dispatch run 25549439711) failed on 2026-05-08 due to a corrupted sandbox gateway (k3s cluster). No failure notification was posted to issue #644. The issue sat idle for 3 days until rh-hemartin manually noticed and re-triggered with /code on 2026-05-11.
This is the same pattern documented in #934 (auto-retry on fast failures), #957 (failure notifications in post-run lifecycle), #976 (silent failure on iteration limit), and #872 (surface failure reasons). PR #800 is now the third confirmed instance of this pattern (after #934's example on PR #920).
What could go better
The failure notification gap is the single highest-impact issue for time-to-resolution across the platform. Each silent failure creates a multi-day delay that only ends when a human happens to check. The fix is straightforward (post a comment on failure), but the work is spread across 4+ issues with overlapping scope.
Confidence: High. This is a well-understood problem with clear solutions. The pattern has been observed at least 3 times now (issue #644/PR #800, issue #910/PR #920, and the case in #976).
Proposed change
Consolidate issues #934, #957, #976, and #872 into a single tracking issue or epic that covers the full failure-notification lifecycle:
Prioritize this as a single deliverable rather than 4 separate issues, since the implementation touches the same post-run scripts and dispatch workflows.
Validation criteria
After implementation: (1) every code/fix agent infrastructure failure should result in a comment on the originating issue within 5 minutes, (2) transient failures (sandbox crashes) should be auto-retried at least once before notifying, and (3) the average time-to-human-awareness of agent failures should drop from days to minutes. Measure over the next 30 days of agent runs.
What happened
The first code agent run for issue #644 (dispatch run 25549439711) failed on 2026-05-08 due to a corrupted sandbox gateway (k3s cluster). No failure notification was posted to issue #644. The issue sat idle for 3 days until rh-hemartin manually noticed and re-triggered with
/codeon 2026-05-11.This is the same pattern documented in #934 (auto-retry on fast failures), #957 (failure notifications in post-run lifecycle), #976 (silent failure on iteration limit), and #872 (surface failure reasons). PR #800 is now the third confirmed instance of this pattern (after #934's example on PR #920).
What could go better
The failure notification gap is the single highest-impact issue for time-to-resolution across the platform. Each silent failure creates a multi-day delay that only ends when a human happens to check. The fix is straightforward (post a comment on failure), but the work is spread across 4+ issues with overlapping scope.
Confidence: High. This is a well-understood problem with clear solutions. The pattern has been observed at least 3 times now (issue #644/PR #800, issue #910/PR #920, and the case in #976).
Proposed change
Consolidate issues #934, #957, #976, and #872 into a single tracking issue or epic that covers the full failure-notification lifecycle:
Prioritize this as a single deliverable rather than 4 separate issues, since the implementation touches the same post-run scripts and dispatch workflows.
Validation criteria
After implementation: (1) every code/fix agent infrastructure failure should result in a comment on the originating issue within 5 minutes, (2) transient failures (sandbox crashes) should be auto-retried at least once before notifying, and (3) the average time-to-human-awareness of agent failures should drop from days to minutes. Measure over the next 30 days of agent runs.
Generated by retro agent from #800