|
| 1 | +# IT-Stack On-Call Escalation Policy |
| 2 | + |
| 3 | +**Document:** 18 |
| 4 | +**Location:** `docs/05-guides/18-on-call-policy.md` |
| 5 | +**Last Updated:** March 2026 |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Overview |
| 10 | + |
| 11 | +This document defines the on-call rotation, alert escalation path, and incident response procedures for IT-Stack production environments. It is intended for organizations that have deployed IT-Stack and need a structured approach to handling after-hours incidents. |
| 12 | + |
| 13 | +For alert configuration details see: |
| 14 | +- [Admin Runbook](17-admin-runbook.md) — daily ops / Zabbix / Graylog |
| 15 | +- [Ansible repo](../../../it-stack-ansible/) — `roles/zabbix/tasks/mattermost-alerts.yml` |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +## On-Call Rotation |
| 20 | + |
| 21 | +### Minimum Staffing |
| 22 | + |
| 23 | +| Role | Responsibility | Min. Required | |
| 24 | +|------|---------------|---------------| |
| 25 | +| Primary On-Call | First responder — triages all P1/P2 alerts | 1 person | |
| 26 | +| Secondary On-Call | Backup if primary unreachable (15 min) | 1 person | |
| 27 | +| Escalation Manager | Declares major incidents, coordinates teams | 1 person | |
| 28 | + |
| 29 | +### Rotation Schedule |
| 30 | + |
| 31 | +- **Rotation length:** 1 week (Monday 08:00 → Monday 07:59) |
| 32 | +- **Handover:** Monday standup — outgoing on-call reviews open incidents |
| 33 | +- **Tool options:** PagerDuty · Opsgenie · Mattermost `#on-call` channel · phone/SMS |
| 34 | + |
| 35 | +### Handover Checklist |
| 36 | + |
| 37 | +Before ending your on-call week: |
| 38 | +- [ ] All P1/P2 incidents resolved or formally handed over |
| 39 | +- [ ] Outstanding Zabbix problems acknowledged or silenced with explanation |
| 40 | +- [ ] Graylog WARN/ERROR backlog reviewed; false positives suppressed |
| 41 | +- [ ] Backup verification (`make backup-verify`) run and clean |
| 42 | +- [ ] Handover notes posted to `#ops-handover` Mattermost channel |
| 43 | + |
| 44 | +--- |
| 45 | + |
| 46 | +## Alert Severity Levels |
| 47 | + |
| 48 | +| Severity | Name | Examples | Response Target | |
| 49 | +|----------|------|---------|----------------| |
| 50 | +| **P1** | Critical | Identity server down · PostgreSQL unreachable · All email bouncing | 15 min | |
| 51 | +| **P2** | High | Single service down · DB replication lag > 60s · Certificate expiry < 7 days | 1 hour | |
| 52 | +| **P3** | Medium | High CPU > 90% for 15min · Disk > 80% · Redis evictions | Next business day | |
| 53 | +| **P4** | Low | Log volume spike · Backup older than 48h · Slow query | 3 business days | |
| 54 | + |
| 55 | +--- |
| 56 | + |
| 57 | +## Escalation Path |
| 58 | + |
| 59 | +``` |
| 60 | +Alert fires in Zabbix |
| 61 | + │ |
| 62 | + ▼ |
| 63 | +Mattermost #ops-alerts (immediate, automated) |
| 64 | + │ |
| 65 | + ├─ P3/P4 ──► Primary On-Call acknowledges within 4h (business hours) |
| 66 | + │ |
| 67 | + ├─ P2 ──────► Primary On-Call responds within 1h (any time) |
| 68 | + │ │ |
| 69 | + │ └─ No response in 30 min ──► Page Secondary On-Call |
| 70 | + │ |
| 71 | + └─ P1 ──────► Page Primary On-Call immediately |
| 72 | + │ |
| 73 | + ├─ No response in 15 min ──► Page Secondary On-Call |
| 74 | + │ |
| 75 | + └─ No response in another 15 min ──► Page Escalation Manager |
| 76 | +``` |
| 77 | + |
| 78 | +### Notification Channels |
| 79 | + |
| 80 | +| Channel | Used For | Tool | |
| 81 | +|---------|---------|------| |
| 82 | +| Mattermost `#ops-alerts` | All automated Zabbix/Graylog alerts | Zabbix webhook | |
| 83 | +| Mattermost `#incidents` | Active incident coordination | Manual | |
| 84 | +| Mattermost `#on-call` | Rotation schedule, ack confirmations | Manual | |
| 85 | +| Phone / SMS | P1 escalation when Mattermost is down | PagerDuty or manually | |
| 86 | +| Email | P3/P4 non-urgent notifications | Zabbix SMTP action | |
| 87 | + |
| 88 | +--- |
| 89 | + |
| 90 | +## Incident Response Procedures |
| 91 | + |
| 92 | +### P1 — Critical Service Down |
| 93 | + |
| 94 | +**Goal: Restore service within RTO target (see below)** |
| 95 | + |
| 96 | +1. **Acknowledge** the Zabbix alert immediately (prevents repeat paging). |
| 97 | +2. **Identify** affected service and server: |
| 98 | + ```bash |
| 99 | + ansible all -i inventory/hosts.ini -m ping |
| 100 | + ssh ansible@<affected-host> systemctl status <service> |
| 101 | + ``` |
| 102 | +3. **Attempt quick restart** (acceptable for non-data-loss scenarios): |
| 103 | + ```bash |
| 104 | + ssh ansible@<host> sudo systemctl restart <service> |
| 105 | + ``` |
| 106 | +4. **If restart fails** — check logs: |
| 107 | + ```bash |
| 108 | + ssh ansible@<host> journalctl -u <service> -n 50 --no-pager |
| 109 | + # Or check Graylog stream for that host |
| 110 | + ``` |
| 111 | +5. **Escalate** if not resolved in 30 minutes. Post in `#incidents`. |
| 112 | +6. **Post-incident:** File an incident report within 24 hours (see template below). |
| 113 | + |
| 114 | +### P1 — Database Unreachable |
| 115 | + |
| 116 | +1. Verify PostgreSQL is running on `lab-db1`: |
| 117 | + ```bash |
| 118 | + ssh ansible@lab-db1 systemctl status postgresql |
| 119 | + ssh ansible@lab-db1 psql -U postgres -c '\l' |
| 120 | + ``` |
| 121 | +2. Check disk space (common cause): |
| 122 | + ```bash |
| 123 | + ssh ansible@lab-db1 df -h /var/lib/postgresql |
| 124 | + ``` |
| 125 | +3. Check for locks / stuck connections: |
| 126 | + ```bash |
| 127 | + ssh ansible@lab-db1 psql -U postgres -c "SELECT pid, state, query_start, query FROM pg_stat_activity WHERE state != 'idle';" |
| 128 | + ``` |
| 129 | +4. If disk full — purge old backups or WAL segments: |
| 130 | + ```bash |
| 131 | + ssh ansible@lab-db1 find /var/backups/it-stack/postgres -name '*.dump' -mtime +3 -delete |
| 132 | + ``` |
| 133 | +5. Backup restoration (break-glass): |
| 134 | + ```bash |
| 135 | + ansible-playbook -i inventory/hosts.ini playbooks/test-restore.yml --tags postgres |
| 136 | + ``` |
| 137 | + |
| 138 | +### P1 — Identity (FreeIPA / Keycloak) Down |
| 139 | + |
| 140 | +All services authenticate through Keycloak → FreeIPA. Impact is **all users locked out**. |
| 141 | + |
| 142 | +1. Check FreeIPA health: |
| 143 | + ```bash |
| 144 | + ssh ansible@lab-id1 ipactl status |
| 145 | + ssh ansible@lab-id1 curl -sf https://lab-id1/ipa/ui/ | head -5 |
| 146 | + ``` |
| 147 | +2. Restart FreeIPA services: |
| 148 | + ```bash |
| 149 | + ssh ansible@lab-id1 sudo ipactl restart |
| 150 | + ``` |
| 151 | +3. Check Keycloak: |
| 152 | + ```bash |
| 153 | + ssh ansible@lab-id1 systemctl status keycloak |
| 154 | + ssh ansible@lab-id1 curl -sf http://localhost:8080/health/ready |
| 155 | + ``` |
| 156 | +4. Emergency user access (break-glass local accounts): |
| 157 | + - Each server has a local `ansible` sudo user (key-only) |
| 158 | + - Temporary direct login to avoid SSO dependency during recovery |
| 159 | + |
| 160 | +--- |
| 161 | + |
| 162 | +## RTO / RPO Targets |
| 163 | + |
| 164 | +| Component | RPO | RTO | |
| 165 | +|-----------|-----|-----| |
| 166 | +| PostgreSQL (single DB) | 24 hours | 15 min (< 1 GB) / 60 min (< 10 GB) | |
| 167 | +| PostgreSQL (full cluster) | 24 hours | 2–4 hours | |
| 168 | +| Nextcloud files | 24 hours | 30 min (rsync) | |
| 169 | +| Service configs | 24 hours | 5 min (tar extract) | |
| 170 | +| FreeIPA (identity) | N/A (rebuilt via Ansible) | 45 min | |
| 171 | +| Keycloak | N/A (state in PostgreSQL) | 15 min | |
| 172 | +| Traefik | N/A (stateless) | 5 min | |
| 173 | + |
| 174 | +> RTO = time from start of restore to service available. |
| 175 | +> Run `make test-restore` quarterly to verify these targets remain achievable. |
| 176 | +
|
| 177 | +--- |
| 178 | + |
| 179 | +## Incident Report Template |
| 180 | + |
| 181 | +Post to `#incidents` (or a ticket in GLPI) after every P1/P2: |
| 182 | + |
| 183 | +``` |
| 184 | +## Incident Report — [SERVICE] [DATE] |
| 185 | +
|
| 186 | +**Severity:** P1 / P2 |
| 187 | +**Duration:** HH:MM – HH:MM UTC (X minutes) |
| 188 | +**Impact:** [Which users / services were affected] |
| 189 | +**Root Cause:** [What broke and why] |
| 190 | +**Detection:** Zabbix alert / user report / monitoring gap |
| 191 | +**Timeline:** |
| 192 | + - HH:MM – Alert fired |
| 193 | + - HH:MM – Acknowledged by [name] |
| 194 | + - HH:MM – Root cause identified |
| 195 | + - HH:MM – Fix applied |
| 196 | + - HH:MM – Service restored |
| 197 | +**Fix Applied:** [Exact commands or Ansible run] |
| 198 | +**Prevention:** [What will stop this happening again] |
| 199 | +**Follow-up Issues:** [GLPI ticket numbers] |
| 200 | +``` |
| 201 | + |
| 202 | +--- |
| 203 | + |
| 204 | +## Scheduled Maintenance Windows |
| 205 | + |
| 206 | +| Frequency | Time | Duration | Purpose | |
| 207 | +|-----------|------|----------|---------| |
| 208 | +| Weekly | Sunday 02:00–04:00 UTC | 2 hours | Security patches (`make harden`) | |
| 209 | +| Monthly | First Sunday 01:00–05:00 | 4 hours | OS upgrades, cert rotation | |
| 210 | +| Quarterly | TBD | Half day | DR test, `make test-restore`, load test | |
| 211 | + |
| 212 | +Announce maintenance in Mattermost `#maintenance` at least **24 hours** in advance. |
| 213 | + |
| 214 | +--- |
| 215 | + |
| 216 | +## Silence / Maintenance Mode |
| 217 | + |
| 218 | +To suppress Zabbix alerts during maintenance: |
| 219 | + |
| 220 | +```bash |
| 221 | +# Via Zabbix API (from ansible-playbook or manual) |
| 222 | +curl -s -X POST https://zabbix.yourdomain.com/api_jsonrpc.php \ |
| 223 | + -H 'Content-Type: application/json' \ |
| 224 | + -d '{ |
| 225 | + "jsonrpc":"2.0","method":"maintenance.create", |
| 226 | + "params":{ |
| 227 | + "name":"Maintenance window '"$(date +%F)"'", |
| 228 | + "active_since":'"$(date +%s)"', |
| 229 | + "active_till":'"$(($(date +%s) + 7200))"', |
| 230 | + "hosts":[{"hostid":"ALL"}], |
| 231 | + "timeperiods":[{"period":7200}] |
| 232 | + }, |
| 233 | + "auth":"<ZABBIX_TOKEN>","id":1 |
| 234 | + }' |
| 235 | +``` |
| 236 | + |
| 237 | +Or use the Zabbix web UI → Configuration → Maintenance. |
| 238 | + |
| 239 | +--- |
| 240 | + |
| 241 | +## Contact Directory Template |
| 242 | + |
| 243 | +> Replace with your organisation's actual contacts. |
| 244 | +
|
| 245 | +| Role | Name | Mattermost | Phone | |
| 246 | +|------|------|-----------|-------| |
| 247 | +| Primary On-Call | *See rotation* | `@primary-oncall` | — | |
| 248 | +| Secondary On-Call | *See rotation* | `@secondary-oncall` | — | |
| 249 | +| Escalation Manager | *TBD* | `@it-manager` | — | |
| 250 | +| PostgreSQL DBA | *TBD* | `@dba` | — | |
| 251 | +| Network Admin | *TBD* | `@netops` | — | |
| 252 | + |
| 253 | +--- |
| 254 | + |
| 255 | +*This document should be reviewed and contact information updated at least quarterly.* |
| 256 | +*Run `make test-restore` quarterly to verify RTO/RPO targets remain achievable.* |
0 commit comments