|
| 1 | +--- |
| 2 | +name: diagnose |
| 3 | +description: Disciplined diagnosis loop for hard bugs and performance regressions. Reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when user says "diagnose this" / "debug this", reports a bug, says something is broken/throwing/failing, or describes a performance regression. |
| 4 | +--- |
| 5 | + |
| 6 | +# Diagnose |
| 7 | + |
| 8 | +A discipline for hard bugs. Skip phases only when explicitly justified. |
| 9 | + |
| 10 | +When exploring the codebase, use the project's domain glossary to get a clear mental model of the relevant modules, and check ADRs in the area you're touching. |
| 11 | + |
| 12 | +## Phase 1 — Build a feedback loop |
| 13 | + |
| 14 | +**This is the skill.** Everything else is mechanical. If you have a fast, deterministic, agent-runnable pass/fail signal for the bug, you will find the cause — bisection, hypothesis-testing, and instrumentation all just consume that signal. If you don't have one, no amount of staring at code will save you. |
| 15 | + |
| 16 | +Spend disproportionate effort here. **Be aggressive. Be creative. Refuse to give up.** |
| 17 | + |
| 18 | +### Ways to construct one — try them in roughly this order |
| 19 | + |
| 20 | +1. **Failing test** at whatever seam reaches the bug — unit, integration, e2e. |
| 21 | +2. **Curl / HTTP script** against a running dev server. |
| 22 | +3. **CLI invocation** with a fixture input, diffing stdout against a known-good snapshot. |
| 23 | +4. **Headless browser script** (Playwright / Puppeteer) — drives the UI, asserts on DOM/console/network. |
| 24 | +5. **Replay a captured trace.** Save a real network request / payload / event log to disk; replay it through the code path in isolation. |
| 25 | +6. **Throwaway harness.** Spin up a minimal subset of the system (one service, mocked deps) that exercises the bug code path with a single function call. |
| 26 | +7. **Property / fuzz loop.** If the bug is "sometimes wrong output", run 1000 random inputs and look for the failure mode. |
| 27 | +8. **Bisection harness.** If the bug appeared between two known states (commit, dataset, version), automate "boot at state X, check, repeat" so you can `git bisect run` it. |
| 28 | +9. **Differential loop.** Run the same input through old-version vs new-version (or two configs) and diff outputs. |
| 29 | +10. **HITL bash script.** Last resort. If a human must click, drive _them_ with `scripts/hitl-loop.template.sh` so the loop is still structured. Captured output feeds back to you. |
| 30 | + |
| 31 | +Build the right feedback loop, and the bug is 90% fixed. |
| 32 | + |
| 33 | +### Iterate on the loop itself |
| 34 | + |
| 35 | +Treat the loop as a product. Once you have _a_ loop, ask: |
| 36 | + |
| 37 | +- Can I make it faster? (Cache setup, skip unrelated init, narrow the test scope.) |
| 38 | +- Can I make the signal sharper? (Assert on the specific symptom, not "didn't crash".) |
| 39 | +- Can I make it more deterministic? (Pin time, seed RNG, isolate filesystem, freeze network.) |
| 40 | + |
| 41 | +A 30-second flaky loop is barely better than no loop. A 2-second deterministic loop is a debugging superpower. |
| 42 | + |
| 43 | +### Non-deterministic bugs |
| 44 | + |
| 45 | +The goal is not a clean repro but a **higher reproduction rate**. Loop the trigger 100×, parallelise, add stress, narrow timing windows, inject sleeps. A 50%-flake bug is debuggable; 1% is not — keep raising the rate until it's debuggable. |
| 46 | + |
| 47 | +### When you genuinely cannot build a loop |
| 48 | + |
| 49 | +Stop and say so explicitly. List what you tried. Ask the user for: (a) access to whatever environment reproduces it, (b) a captured artifact (HAR file, log dump, core dump, screen recording with timestamps), or (c) permission to add temporary production instrumentation. Do **not** proceed to hypothesise without a loop. |
| 50 | + |
| 51 | +Do not proceed to Phase 2 until you have a loop you believe in. |
| 52 | + |
| 53 | +## Phase 2 — Reproduce |
| 54 | + |
| 55 | +Run the loop. Watch the bug appear. |
| 56 | + |
| 57 | +Confirm: |
| 58 | + |
| 59 | +- [ ] The loop produces the failure mode the **user** described — not a different failure that happens to be nearby. Wrong bug = wrong fix. |
| 60 | +- [ ] The failure is reproducible across multiple runs (or, for non-deterministic bugs, reproducible at a high enough rate to debug against). |
| 61 | +- [ ] You have captured the exact symptom (error message, wrong output, slow timing) so later phases can verify the fix actually addresses it. |
| 62 | + |
| 63 | +Do not proceed until you reproduce the bug. |
| 64 | + |
| 65 | +## Phase 3 — Hypothesise |
| 66 | + |
| 67 | +Generate **3–5 ranked hypotheses** before testing any of them. Single-hypothesis generation anchors on the first plausible idea. |
| 68 | + |
| 69 | +Each hypothesis must be **falsifiable**: state the prediction it makes. |
| 70 | + |
| 71 | +> Format: "If <X> is the cause, then <changing Y> will make the bug disappear / <changing Z> will make it worse." |
| 72 | +
|
| 73 | +If you cannot state the prediction, the hypothesis is a vibe — discard or sharpen it. |
| 74 | + |
| 75 | +**Show the ranked list to the user before testing.** They often have domain knowledge that re-ranks instantly ("we just deployed a change to #3"), or know hypotheses they've already ruled out. Cheap checkpoint, big time saver. Don't block on it — proceed with your ranking if the user is AFK. |
| 76 | + |
| 77 | +## Phase 4 — Instrument |
| 78 | + |
| 79 | +Each probe must map to a specific prediction from Phase 3. **Change one variable at a time.** |
| 80 | + |
| 81 | +Tool preference: |
| 82 | + |
| 83 | +1. **Debugger / REPL inspection** if the env supports it. One breakpoint beats ten logs. |
| 84 | +2. **Targeted logs** at the boundaries that distinguish hypotheses. |
| 85 | +3. Never "log everything and grep". |
| 86 | + |
| 87 | +**Tag every debug log** with a unique prefix, e.g. `[DEBUG-a4f2]`. Cleanup at the end becomes a single grep. Untagged logs survive; tagged logs die. |
| 88 | + |
| 89 | +**Perf branch.** For performance regressions, logs are usually wrong. Instead: establish a baseline measurement (timing harness, `performance.now()`, profiler, query plan), then bisect. Measure first, fix second. |
| 90 | + |
| 91 | +## Phase 5 — Fix + regression test |
| 92 | + |
| 93 | +Write the regression test **before the fix** — but only if there is a **correct seam** for it. |
| 94 | + |
| 95 | +A correct seam is one where the test exercises the **real bug pattern** as it occurs at the call site. If the only available seam is too shallow (single-caller test when the bug needs multiple callers, unit test that can't replicate the chain that triggered the bug), a regression test there gives false confidence. |
| 96 | + |
| 97 | +**If no correct seam exists, that itself is the finding.** Note it. The codebase architecture is preventing the bug from being locked down. Flag this for the next phase. |
| 98 | + |
| 99 | +If a correct seam exists: |
| 100 | + |
| 101 | +1. Turn the minimised repro into a failing test at that seam. |
| 102 | +2. Watch it fail. |
| 103 | +3. Apply the fix. |
| 104 | +4. Watch it pass. |
| 105 | +5. Re-run the Phase 1 feedback loop against the original (un-minimised) scenario. |
| 106 | + |
| 107 | +## Phase 6 — Cleanup + post-mortem |
| 108 | + |
| 109 | +Required before declaring done: |
| 110 | + |
| 111 | +- [ ] Original repro no longer reproduces (re-run the Phase 1 loop) |
| 112 | +- [ ] Regression test passes (or absence of seam is documented) |
| 113 | +- [ ] All `[DEBUG-...]` instrumentation removed (`grep` the prefix) |
| 114 | +- [ ] Throwaway prototypes deleted (or moved to a clearly-marked debug location) |
| 115 | +- [ ] The hypothesis that turned out correct is stated in the commit / PR message — so the next debugger learns |
| 116 | + |
| 117 | +**Then ask: what would have prevented this bug?** If the answer involves architectural change (no good test seam, tangled callers, hidden coupling) hand off to the `/improve-codebase-architecture` skill with the specifics. Make the recommendation **after** the fix is in, not before — you have more information now than when you started. |
0 commit comments