Summary
Add the ability to define and store runbooks for common incidents so both AI agents and SREs can reference structured resolution steps during incident response.
Problem
When incidents occur, resolution steps often live in scattered docs, Slack threads, or individual memory.
This leads to:
- Slower resolution times
- Inconsistent handling
- Knowledge silos
- Limited AI-assisted troubleshooting
We need a structured, queryable way to store and retrieve incident runbooks.
Proposed Solution
Introduce Runbooks as a first-class entity:
- Create / edit runbooks
- Tag by service, severity, category
- Structured steps (checklist format)
- Attach logs, queries, dashboards, or links
- Support markdown
Each runbook should include:
- Title
- Description
- Affected services
- Trigger conditions
- Step-by-step resolution instructions
- Escalation notes
- Post-incident checklist
AI Integration
Runbooks should be:
- Searchable via semantic search
- Automatically suggested during incidents
- Usable by AI agents to execute or recommend steps
- Context-aware based on error signals
Example:
If error rate spikes on api-service, suggest “High 5xx Errors – API Service” runbook.
Benefits
- Faster MTTR
- Consistent resolution
- Easier onboarding of new SREs
- Enables AI-assisted incident response
- Institutional knowledge capture
Future Extensions
- Link runbooks to specific alert rules
- Auto-trigger runbooks
- Execution logs tied to incidents
- Feedback loop to improve runbooks over time
Would love community feedback on:
- How you currently manage runbooks
- What fields are essential
- Whether AI-assisted execution would be useful
Open to refining the scope before implementation.
Summary
Add the ability to define and store runbooks for common incidents so both AI agents and SREs can reference structured resolution steps during incident response.
Problem
When incidents occur, resolution steps often live in scattered docs, Slack threads, or individual memory.
This leads to:
We need a structured, queryable way to store and retrieve incident runbooks.
Proposed Solution
Introduce Runbooks as a first-class entity:
Each runbook should include:
AI Integration
Runbooks should be:
Example:
Benefits
Future Extensions
Would love community feedback on:
Open to refining the scope before implementation.