Skip to content

Commit e61485e

Browse files
Update FAILURE-RECOVERY.md
1 parent d0b9e76 commit e61485e

1 file changed

Lines changed: 170 additions & 0 deletions

File tree

FAILURE-RECOVERY.md

Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,171 @@
1+
# Failure Recovery
2+
3+
> What to do when things break. Because they will.
4+
5+
Long-horizon collaboration doesn't fail dramatically. It fails through small, invisible drift that compounds until something snaps.
6+
7+
The goal isn't preventing failure. The goal is making failure visible, bounded, and repairable.
8+
9+
This document tells you how.
10+
11+
---
12+
13+
## The Core Protocol
14+
15+
When something goes wrong, follow this sequence:
16+
17+
```
18+
STOP → DIAGNOSE → ROLLBACK → NOTE
19+
```
20+
21+
**Do not skip steps. Do not rush.**
22+
23+
---
24+
25+
## Step 1: STOP
26+
27+
The moment you sense something is off — stop.
28+
29+
Do not:
30+
- Ask Claude to explain itself
31+
- Try to fix it quickly
32+
- Push forward hoping it resolves
33+
34+
Do:
35+
- Pause all work
36+
- State clearly: "We're stopping. Something is wrong."
37+
38+
**Prompt:**
39+
40+
> "Stop. Something isn't right. Do not generate further output until we diagnose what happened."
41+
42+
Stopping prevents error from spreading. The earlier you stop, the less you have to repair.
43+
44+
---
45+
46+
## Step 2: DIAGNOSE
47+
48+
Identify what went wrong. Not why — just what.
49+
50+
Ask:
51+
- What was the last known good state?
52+
- What changed since then?
53+
- Which file(s) are affected?
54+
- Is this a content error, a structure error, or a drift error?
55+
56+
**Prompt:**
57+
58+
> "Let's diagnose. What was the last stable state? What changed? Which files are affected? Don't explain why yet — just identify what."
59+
60+
Common failure types:
61+
62+
| Type | Symptom |
63+
|------|---------|
64+
| Context drift | Claude acting on outdated or forgotten rules |
65+
| File divergence | Multiple versions, Claude referenced wrong one |
66+
| Numeric error | Reconstructed number instead of referencing canonical |
67+
| Boundary violation | Work from one domain leaking into another |
68+
| Tone/trust drift | Responses feel off, over-soft, or misaligned |
69+
70+
---
71+
72+
## Step 3: ROLLBACK
73+
74+
Return to the last known good state.
75+
76+
Do not try to fix forward. Go back to stable ground first.
77+
78+
Actions:
79+
- Identify the last clean version of affected file(s)
80+
- Restore from archive or revert changes
81+
- Re-read `RUNNING-DOCUMENT.md` to reset context
82+
- Confirm Claude is aligned before proceeding
83+
84+
**Prompt:**
85+
86+
> "We're rolling back to [last stable state]. Discard work since [point of failure]. Re-read RUNNING-DOCUMENT.md and confirm you're aligned with current rules."
87+
88+
---
89+
90+
## Step 4: NOTE
91+
92+
Document what happened so it doesn't repeat.
93+
94+
Update these files:
95+
96+
1. **RUNNING-DOCUMENT.md** — add to Corrections Log
97+
2. **This file** — add to Failure History below (if pattern is new)
98+
99+
Capture:
100+
- What failed
101+
- What triggered it
102+
- How it was repaired
103+
- What rule/practice prevents recurrence
104+
105+
**Prompt:**
106+
107+
> "Log this failure in the Corrections Log: what happened, what we fixed, what prevents it next time."
108+
109+
---
110+
111+
## Quick Reference Card
112+
113+
```
114+
┌─────────────────────────────────────────────┐
115+
│ FAILURE RECOVERY │
116+
├─────────────────────────────────────────────┤
117+
│ 1. STOP — Halt immediately │
118+
│ 2. DIAGNOSE — What broke, not why │
119+
│ 3. ROLLBACK — Return to last stable state │
120+
│ 4. NOTE — Document to prevent repeat │
121+
└─────────────────────────────────────────────┘
122+
```
123+
124+
---
125+
126+
## Failure History
127+
128+
<!-- Log significant failures here for pattern recognition -->
129+
130+
| Date | Type | What Happened | Resolution | Prevention |
131+
|------|------|---------------|------------|------------|
132+
| | | | | |
133+
134+
---
135+
136+
## Warning Signs
137+
138+
Catch drift early. Watch for:
139+
140+
- Claude contradicting earlier decisions
141+
- Numbers that look plausible but weren't referenced
142+
- Tone shifts (over-soft, over-confident, defensive)
143+
- Confusion about which file is authoritative
144+
- Claude asking questions that were already answered
145+
- Work from one project appearing in another
146+
147+
When you see these: **STOP**. Don't wait for full failure.
148+
149+
---
150+
151+
## Prevention Habits
152+
153+
Best way to handle failure is to make it rare:
154+
155+
1. **Read RUNNING-DOCUMENT.md every session** — context resets
156+
2. **Reference CANONICAL-NUMBERS.md for any numbers** — no reconstruction
157+
3. **One canonical file per domain** — no version confusion
158+
4. **Archive, don't delete** — rollback requires history
159+
5. **Name files with status** — know what's DRAFT vs FINAL
160+
6. **Trust your instincts** — if something feels off, stop
161+
162+
---
163+
164+
## Remember
165+
166+
Failure isn't the enemy. Invisible failure is.
167+
168+
A system that breaks visibly and repairs cleanly is more trustworthy than one that pretends to be perfect.
169+
170+
When in doubt: **STOP → DIAGNOSE → ROLLBACK → NOTE**
1171

0 commit comments

Comments
 (0)