|
| 1 | +# Failure Recovery |
| 2 | + |
| 3 | +> What to do when things break. Because they will. |
| 4 | +
|
| 5 | +Long-horizon collaboration doesn't fail dramatically. It fails through small, invisible drift that compounds until something snaps. |
| 6 | + |
| 7 | +The goal isn't preventing failure. The goal is making failure visible, bounded, and repairable. |
| 8 | + |
| 9 | +This document tells you how. |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## The Core Protocol |
| 14 | + |
| 15 | +When something goes wrong, follow this sequence: |
| 16 | + |
| 17 | +``` |
| 18 | +STOP → DIAGNOSE → ROLLBACK → NOTE |
| 19 | +``` |
| 20 | + |
| 21 | +**Do not skip steps. Do not rush.** |
| 22 | + |
| 23 | +--- |
| 24 | + |
| 25 | +## Step 1: STOP |
| 26 | + |
| 27 | +The moment you sense something is off — stop. |
| 28 | + |
| 29 | +Do not: |
| 30 | +- Ask Claude to explain itself |
| 31 | +- Try to fix it quickly |
| 32 | +- Push forward hoping it resolves |
| 33 | + |
| 34 | +Do: |
| 35 | +- Pause all work |
| 36 | +- State clearly: "We're stopping. Something is wrong." |
| 37 | + |
| 38 | +**Prompt:** |
| 39 | + |
| 40 | +> "Stop. Something isn't right. Do not generate further output until we diagnose what happened." |
| 41 | +
|
| 42 | +Stopping prevents error from spreading. The earlier you stop, the less you have to repair. |
| 43 | + |
| 44 | +--- |
| 45 | + |
| 46 | +## Step 2: DIAGNOSE |
| 47 | + |
| 48 | +Identify what went wrong. Not why — just what. |
| 49 | + |
| 50 | +Ask: |
| 51 | +- What was the last known good state? |
| 52 | +- What changed since then? |
| 53 | +- Which file(s) are affected? |
| 54 | +- Is this a content error, a structure error, or a drift error? |
| 55 | + |
| 56 | +**Prompt:** |
| 57 | + |
| 58 | +> "Let's diagnose. What was the last stable state? What changed? Which files are affected? Don't explain why yet — just identify what." |
| 59 | +
|
| 60 | +Common failure types: |
| 61 | + |
| 62 | +| Type | Symptom | |
| 63 | +|------|---------| |
| 64 | +| Context drift | Claude acting on outdated or forgotten rules | |
| 65 | +| File divergence | Multiple versions, Claude referenced wrong one | |
| 66 | +| Numeric error | Reconstructed number instead of referencing canonical | |
| 67 | +| Boundary violation | Work from one domain leaking into another | |
| 68 | +| Tone/trust drift | Responses feel off, over-soft, or misaligned | |
| 69 | + |
| 70 | +--- |
| 71 | + |
| 72 | +## Step 3: ROLLBACK |
| 73 | + |
| 74 | +Return to the last known good state. |
| 75 | + |
| 76 | +Do not try to fix forward. Go back to stable ground first. |
| 77 | + |
| 78 | +Actions: |
| 79 | +- Identify the last clean version of affected file(s) |
| 80 | +- Restore from archive or revert changes |
| 81 | +- Re-read `RUNNING-DOCUMENT.md` to reset context |
| 82 | +- Confirm Claude is aligned before proceeding |
| 83 | + |
| 84 | +**Prompt:** |
| 85 | + |
| 86 | +> "We're rolling back to [last stable state]. Discard work since [point of failure]. Re-read RUNNING-DOCUMENT.md and confirm you're aligned with current rules." |
| 87 | +
|
| 88 | +--- |
| 89 | + |
| 90 | +## Step 4: NOTE |
| 91 | + |
| 92 | +Document what happened so it doesn't repeat. |
| 93 | + |
| 94 | +Update these files: |
| 95 | + |
| 96 | +1. **RUNNING-DOCUMENT.md** — add to Corrections Log |
| 97 | +2. **This file** — add to Failure History below (if pattern is new) |
| 98 | + |
| 99 | +Capture: |
| 100 | +- What failed |
| 101 | +- What triggered it |
| 102 | +- How it was repaired |
| 103 | +- What rule/practice prevents recurrence |
| 104 | + |
| 105 | +**Prompt:** |
| 106 | + |
| 107 | +> "Log this failure in the Corrections Log: what happened, what we fixed, what prevents it next time." |
| 108 | +
|
| 109 | +--- |
| 110 | + |
| 111 | +## Quick Reference Card |
| 112 | + |
| 113 | +``` |
| 114 | +┌─────────────────────────────────────────────┐ |
| 115 | +│ FAILURE RECOVERY │ |
| 116 | +├─────────────────────────────────────────────┤ |
| 117 | +│ 1. STOP — Halt immediately │ |
| 118 | +│ 2. DIAGNOSE — What broke, not why │ |
| 119 | +│ 3. ROLLBACK — Return to last stable state │ |
| 120 | +│ 4. NOTE — Document to prevent repeat │ |
| 121 | +└─────────────────────────────────────────────┘ |
| 122 | +``` |
| 123 | + |
| 124 | +--- |
| 125 | + |
| 126 | +## Failure History |
| 127 | + |
| 128 | +<!-- Log significant failures here for pattern recognition --> |
| 129 | + |
| 130 | +| Date | Type | What Happened | Resolution | Prevention | |
| 131 | +|------|------|---------------|------------|------------| |
| 132 | +| | | | | | |
| 133 | + |
| 134 | +--- |
| 135 | + |
| 136 | +## Warning Signs |
| 137 | + |
| 138 | +Catch drift early. Watch for: |
| 139 | + |
| 140 | +- Claude contradicting earlier decisions |
| 141 | +- Numbers that look plausible but weren't referenced |
| 142 | +- Tone shifts (over-soft, over-confident, defensive) |
| 143 | +- Confusion about which file is authoritative |
| 144 | +- Claude asking questions that were already answered |
| 145 | +- Work from one project appearing in another |
| 146 | + |
| 147 | +When you see these: **STOP**. Don't wait for full failure. |
| 148 | + |
| 149 | +--- |
| 150 | + |
| 151 | +## Prevention Habits |
| 152 | + |
| 153 | +Best way to handle failure is to make it rare: |
| 154 | + |
| 155 | +1. **Read RUNNING-DOCUMENT.md every session** — context resets |
| 156 | +2. **Reference CANONICAL-NUMBERS.md for any numbers** — no reconstruction |
| 157 | +3. **One canonical file per domain** — no version confusion |
| 158 | +4. **Archive, don't delete** — rollback requires history |
| 159 | +5. **Name files with status** — know what's DRAFT vs FINAL |
| 160 | +6. **Trust your instincts** — if something feels off, stop |
| 161 | + |
| 162 | +--- |
| 163 | + |
| 164 | +## Remember |
| 165 | + |
| 166 | +Failure isn't the enemy. Invisible failure is. |
| 167 | + |
| 168 | +A system that breaks visibly and repairs cleanly is more trustworthy than one that pretends to be perfect. |
| 169 | + |
| 170 | +When in doubt: **STOP → DIAGNOSE → ROLLBACK → NOTE** |
1 | 171 |
|
0 commit comments