Skip to content

Commit 8cbe13c

Browse files
committed
docs: add on-call escalation policy (doc 18)
- P1/P2/P3/P4 severity definitions with response targets - Escalation path: Mattermost #ops-alerts -> primary -> secondary -> manager - Incident response runbooks: service down, DB unreachable, identity down - RTO/RPO targets table (matches playbooks/test-restore.yml) - Incident report template - Maintenance window schedule - Zabbix silence API example - Contact directory template
1 parent 68577fa commit 8cbe13c

1 file changed

Lines changed: 256 additions & 0 deletions

File tree

Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
# IT-Stack On-Call Escalation Policy
2+
3+
**Document:** 18
4+
**Location:** `docs/05-guides/18-on-call-policy.md`
5+
**Last Updated:** March 2026
6+
7+
---
8+
9+
## Overview
10+
11+
This document defines the on-call rotation, alert escalation path, and incident response procedures for IT-Stack production environments. It is intended for organizations that have deployed IT-Stack and need a structured approach to handling after-hours incidents.
12+
13+
For alert configuration details see:
14+
- [Admin Runbook](17-admin-runbook.md) — daily ops / Zabbix / Graylog
15+
- [Ansible repo](../../../it-stack-ansible/)`roles/zabbix/tasks/mattermost-alerts.yml`
16+
17+
---
18+
19+
## On-Call Rotation
20+
21+
### Minimum Staffing
22+
23+
| Role | Responsibility | Min. Required |
24+
|------|---------------|---------------|
25+
| Primary On-Call | First responder — triages all P1/P2 alerts | 1 person |
26+
| Secondary On-Call | Backup if primary unreachable (15 min) | 1 person |
27+
| Escalation Manager | Declares major incidents, coordinates teams | 1 person |
28+
29+
### Rotation Schedule
30+
31+
- **Rotation length:** 1 week (Monday 08:00 → Monday 07:59)
32+
- **Handover:** Monday standup — outgoing on-call reviews open incidents
33+
- **Tool options:** PagerDuty · Opsgenie · Mattermost `#on-call` channel · phone/SMS
34+
35+
### Handover Checklist
36+
37+
Before ending your on-call week:
38+
- [ ] All P1/P2 incidents resolved or formally handed over
39+
- [ ] Outstanding Zabbix problems acknowledged or silenced with explanation
40+
- [ ] Graylog WARN/ERROR backlog reviewed; false positives suppressed
41+
- [ ] Backup verification (`make backup-verify`) run and clean
42+
- [ ] Handover notes posted to `#ops-handover` Mattermost channel
43+
44+
---
45+
46+
## Alert Severity Levels
47+
48+
| Severity | Name | Examples | Response Target |
49+
|----------|------|---------|----------------|
50+
| **P1** | Critical | Identity server down · PostgreSQL unreachable · All email bouncing | 15 min |
51+
| **P2** | High | Single service down · DB replication lag > 60s · Certificate expiry < 7 days | 1 hour |
52+
| **P3** | Medium | High CPU > 90% for 15min · Disk > 80% · Redis evictions | Next business day |
53+
| **P4** | Low | Log volume spike · Backup older than 48h · Slow query | 3 business days |
54+
55+
---
56+
57+
## Escalation Path
58+
59+
```
60+
Alert fires in Zabbix
61+
62+
63+
Mattermost #ops-alerts (immediate, automated)
64+
65+
├─ P3/P4 ──► Primary On-Call acknowledges within 4h (business hours)
66+
67+
├─ P2 ──────► Primary On-Call responds within 1h (any time)
68+
│ │
69+
│ └─ No response in 30 min ──► Page Secondary On-Call
70+
71+
└─ P1 ──────► Page Primary On-Call immediately
72+
73+
├─ No response in 15 min ──► Page Secondary On-Call
74+
75+
└─ No response in another 15 min ──► Page Escalation Manager
76+
```
77+
78+
### Notification Channels
79+
80+
| Channel | Used For | Tool |
81+
|---------|---------|------|
82+
| Mattermost `#ops-alerts` | All automated Zabbix/Graylog alerts | Zabbix webhook |
83+
| Mattermost `#incidents` | Active incident coordination | Manual |
84+
| Mattermost `#on-call` | Rotation schedule, ack confirmations | Manual |
85+
| Phone / SMS | P1 escalation when Mattermost is down | PagerDuty or manually |
86+
| Email | P3/P4 non-urgent notifications | Zabbix SMTP action |
87+
88+
---
89+
90+
## Incident Response Procedures
91+
92+
### P1 — Critical Service Down
93+
94+
**Goal: Restore service within RTO target (see below)**
95+
96+
1. **Acknowledge** the Zabbix alert immediately (prevents repeat paging).
97+
2. **Identify** affected service and server:
98+
```bash
99+
ansible all -i inventory/hosts.ini -m ping
100+
ssh ansible@<affected-host> systemctl status <service>
101+
```
102+
3. **Attempt quick restart** (acceptable for non-data-loss scenarios):
103+
```bash
104+
ssh ansible@<host> sudo systemctl restart <service>
105+
```
106+
4. **If restart fails** — check logs:
107+
```bash
108+
ssh ansible@<host> journalctl -u <service> -n 50 --no-pager
109+
# Or check Graylog stream for that host
110+
```
111+
5. **Escalate** if not resolved in 30 minutes. Post in `#incidents`.
112+
6. **Post-incident:** File an incident report within 24 hours (see template below).
113+
114+
### P1 — Database Unreachable
115+
116+
1. Verify PostgreSQL is running on `lab-db1`:
117+
```bash
118+
ssh ansible@lab-db1 systemctl status postgresql
119+
ssh ansible@lab-db1 psql -U postgres -c '\l'
120+
```
121+
2. Check disk space (common cause):
122+
```bash
123+
ssh ansible@lab-db1 df -h /var/lib/postgresql
124+
```
125+
3. Check for locks / stuck connections:
126+
```bash
127+
ssh ansible@lab-db1 psql -U postgres -c "SELECT pid, state, query_start, query FROM pg_stat_activity WHERE state != 'idle';"
128+
```
129+
4. If disk full — purge old backups or WAL segments:
130+
```bash
131+
ssh ansible@lab-db1 find /var/backups/it-stack/postgres -name '*.dump' -mtime +3 -delete
132+
```
133+
5. Backup restoration (break-glass):
134+
```bash
135+
ansible-playbook -i inventory/hosts.ini playbooks/test-restore.yml --tags postgres
136+
```
137+
138+
### P1 — Identity (FreeIPA / Keycloak) Down
139+
140+
All services authenticate through Keycloak → FreeIPA. Impact is **all users locked out**.
141+
142+
1. Check FreeIPA health:
143+
```bash
144+
ssh ansible@lab-id1 ipactl status
145+
ssh ansible@lab-id1 curl -sf https://lab-id1/ipa/ui/ | head -5
146+
```
147+
2. Restart FreeIPA services:
148+
```bash
149+
ssh ansible@lab-id1 sudo ipactl restart
150+
```
151+
3. Check Keycloak:
152+
```bash
153+
ssh ansible@lab-id1 systemctl status keycloak
154+
ssh ansible@lab-id1 curl -sf http://localhost:8080/health/ready
155+
```
156+
4. Emergency user access (break-glass local accounts):
157+
- Each server has a local `ansible` sudo user (key-only)
158+
- Temporary direct login to avoid SSO dependency during recovery
159+
160+
---
161+
162+
## RTO / RPO Targets
163+
164+
| Component | RPO | RTO |
165+
|-----------|-----|-----|
166+
| PostgreSQL (single DB) | 24 hours | 15 min (< 1 GB) / 60 min (< 10 GB) |
167+
| PostgreSQL (full cluster) | 24 hours | 2–4 hours |
168+
| Nextcloud files | 24 hours | 30 min (rsync) |
169+
| Service configs | 24 hours | 5 min (tar extract) |
170+
| FreeIPA (identity) | N/A (rebuilt via Ansible) | 45 min |
171+
| Keycloak | N/A (state in PostgreSQL) | 15 min |
172+
| Traefik | N/A (stateless) | 5 min |
173+
174+
> RTO = time from start of restore to service available.
175+
> Run `make test-restore` quarterly to verify these targets remain achievable.
176+
177+
---
178+
179+
## Incident Report Template
180+
181+
Post to `#incidents` (or a ticket in GLPI) after every P1/P2:
182+
183+
```
184+
## Incident Report — [SERVICE] [DATE]
185+
186+
**Severity:** P1 / P2
187+
**Duration:** HH:MM – HH:MM UTC (X minutes)
188+
**Impact:** [Which users / services were affected]
189+
**Root Cause:** [What broke and why]
190+
**Detection:** Zabbix alert / user report / monitoring gap
191+
**Timeline:**
192+
- HH:MM – Alert fired
193+
- HH:MM – Acknowledged by [name]
194+
- HH:MM – Root cause identified
195+
- HH:MM – Fix applied
196+
- HH:MM – Service restored
197+
**Fix Applied:** [Exact commands or Ansible run]
198+
**Prevention:** [What will stop this happening again]
199+
**Follow-up Issues:** [GLPI ticket numbers]
200+
```
201+
202+
---
203+
204+
## Scheduled Maintenance Windows
205+
206+
| Frequency | Time | Duration | Purpose |
207+
|-----------|------|----------|---------|
208+
| Weekly | Sunday 02:00–04:00 UTC | 2 hours | Security patches (`make harden`) |
209+
| Monthly | First Sunday 01:00–05:00 | 4 hours | OS upgrades, cert rotation |
210+
| Quarterly | TBD | Half day | DR test, `make test-restore`, load test |
211+
212+
Announce maintenance in Mattermost `#maintenance` at least **24 hours** in advance.
213+
214+
---
215+
216+
## Silence / Maintenance Mode
217+
218+
To suppress Zabbix alerts during maintenance:
219+
220+
```bash
221+
# Via Zabbix API (from ansible-playbook or manual)
222+
curl -s -X POST https://zabbix.yourdomain.com/api_jsonrpc.php \
223+
-H 'Content-Type: application/json' \
224+
-d '{
225+
"jsonrpc":"2.0","method":"maintenance.create",
226+
"params":{
227+
"name":"Maintenance window '"$(date +%F)"'",
228+
"active_since":'"$(date +%s)"',
229+
"active_till":'"$(($(date +%s) + 7200))"',
230+
"hosts":[{"hostid":"ALL"}],
231+
"timeperiods":[{"period":7200}]
232+
},
233+
"auth":"<ZABBIX_TOKEN>","id":1
234+
}'
235+
```
236+
237+
Or use the Zabbix web UI → Configuration → Maintenance.
238+
239+
---
240+
241+
## Contact Directory Template
242+
243+
> Replace with your organisation's actual contacts.
244+
245+
| Role | Name | Mattermost | Phone |
246+
|------|------|-----------|-------|
247+
| Primary On-Call | *See rotation* | `@primary-oncall` ||
248+
| Secondary On-Call | *See rotation* | `@secondary-oncall` ||
249+
| Escalation Manager | *TBD* | `@it-manager` ||
250+
| PostgreSQL DBA | *TBD* | `@dba` ||
251+
| Network Admin | *TBD* | `@netops` ||
252+
253+
---
254+
255+
*This document should be reviewed and contact information updated at least quarterly.*
256+
*Run `make test-restore` quarterly to verify RTO/RPO targets remain achievable.*

0 commit comments

Comments
 (0)