Consolidate and prioritize silent-failure notification issues

## What happened

The first code agent run for issue #644 ([dispatch run 25549439711](https://github.com/fullsend-ai/.fullsend/actions/runs/25549439711)) failed on 2026-05-08 due to a corrupted sandbox gateway (k3s cluster). No failure notification was posted to issue #644. The issue sat idle for 3 days until rh-hemartin manually noticed and re-triggered with `/code` on 2026-05-11.

This is the same pattern documented in [#934](https://github.com/fullsend-ai/fullsend/issues/934) (auto-retry on fast failures), [#957](https://github.com/fullsend-ai/fullsend/issues/957) (failure notifications in post-run lifecycle), [#976](https://github.com/fullsend-ai/fullsend/issues/976) (silent failure on iteration limit), and [#872](https://github.com/fullsend-ai/fullsend/issues/872) (surface failure reasons). PR #800 is now the third confirmed instance of this pattern (after #934's example on PR #920).

## What could go better

The failure notification gap is the single highest-impact issue for time-to-resolution across the platform. Each silent failure creates a multi-day delay that only ends when a human happens to check. The fix is straightforward (post a comment on failure), but the work is spread across 4+ issues with overlapping scope.

Confidence: **High.** This is a well-understood problem with clear solutions. The pattern has been observed at least 3 times now (issue #644/PR #800, issue #910/PR #920, and the case in #976).

## Proposed change

Consolidate issues #934, #957, #976, and #872 into a single tracking issue or epic that covers the full failure-notification lifecycle:

1. **Immediate:** Post-run scripts should always comment on the issue when a code/fix agent run fails (covers #957).
2. **Auto-retry:** On infrastructure failures (exit code 1, no output, <3 min runtime), automatically retry once before notifying (covers #934).
3. **Diagnostics:** Include available failure context in the notification comment -- exit code, duration, link to logs if accessible (covers #872, #976).

Prioritize this as a single deliverable rather than 4 separate issues, since the implementation touches the same post-run scripts and dispatch workflows.

## Validation criteria

After implementation: (1) every code/fix agent infrastructure failure should result in a comment on the originating issue within 5 minutes, (2) transient failures (sandbox crashes) should be auto-retried at least once before notifying, and (3) the average time-to-human-awareness of agent failures should drop from days to minutes. Measure over the next 30 days of agent runs.

---
_Generated by retro agent from https://github.com/fullsend-ai/fullsend/pull/800_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate and prioritize silent-failure notification issues #988

What happened

What could go better

Proposed change

Validation criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consolidate and prioritize silent-failure notification issues #988

Description

What happened

What could go better

Proposed change

Validation criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions