Skip to content

Consolidate and prioritize silent-failure notification issues #988

@fullsend-ai-retro

Description

@fullsend-ai-retro

What happened

The first code agent run for issue #644 (dispatch run 25549439711) failed on 2026-05-08 due to a corrupted sandbox gateway (k3s cluster). No failure notification was posted to issue #644. The issue sat idle for 3 days until rh-hemartin manually noticed and re-triggered with /code on 2026-05-11.

This is the same pattern documented in #934 (auto-retry on fast failures), #957 (failure notifications in post-run lifecycle), #976 (silent failure on iteration limit), and #872 (surface failure reasons). PR #800 is now the third confirmed instance of this pattern (after #934's example on PR #920).

What could go better

The failure notification gap is the single highest-impact issue for time-to-resolution across the platform. Each silent failure creates a multi-day delay that only ends when a human happens to check. The fix is straightforward (post a comment on failure), but the work is spread across 4+ issues with overlapping scope.

Confidence: High. This is a well-understood problem with clear solutions. The pattern has been observed at least 3 times now (issue #644/PR #800, issue #910/PR #920, and the case in #976).

Proposed change

Consolidate issues #934, #957, #976, and #872 into a single tracking issue or epic that covers the full failure-notification lifecycle:

  1. Immediate: Post-run scripts should always comment on the issue when a code/fix agent run fails (covers Add agent failure notifications to the post-run lifecycle #957).
  2. Auto-retry: On infrastructure failures (exit code 1, no output, <3 min runtime), automatically retry once before notifying (covers Auto-retry code agent on fast failures (exit code 1, no output) #934).
  3. Diagnostics: Include available failure context in the notification comment -- exit code, duration, link to logs if accessible (covers Surface /code failure reasons when .fullsend repo logs are inaccessible #872, Fix agent fails silently when it cannot produce output within iteration limit #976).

Prioritize this as a single deliverable rather than 4 separate issues, since the implementation touches the same post-run scripts and dispatch workflows.

Validation criteria

After implementation: (1) every code/fix agent infrastructure failure should result in a comment on the originating issue within 5 minutes, (2) transient failures (sandbox crashes) should be auto-retried at least once before notifying, and (3) the average time-to-human-awareness of agent failures should drop from days to minutes. Measure over the next 30 days of agent runs.


Generated by retro agent from #800

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions