Set up monitoring, alerting, and observability for the application.
- Analyze the application to determine monitoring needs:
- Web server: response times, error rates, request volume.
- Database: query performance, connection pool, replication lag.
- Queue: message throughput, consumer lag, dead letters.
- Background jobs: execution time, failure rate, queue depth.
- Generate monitoring configuration for the detected stack:
- Prometheus: Scrape config, recording rules, alert rules.
- Grafana: Dashboard JSON with key panels.
- Datadog:
datadog.yamlor agent configuration. - Health endpoint:
/healthor/healthzimplementation.
- Define alerts for critical metrics:
- Error rate > 1% over 5 minutes.
- P99 latency > 2 seconds.
- Disk usage > 80%.
- Memory usage > 90%.
- Certificate expiry < 14 days.
- Add structured logging configuration:
- JSON log format with timestamp, level, message, trace ID.
- Log levels: ERROR for failures, WARN for degradation, INFO for operations.
- Set up distributed tracing if applicable:
- OpenTelemetry SDK initialization.
- Trace context propagation headers.
- Write all configuration files to
monitoring/ordeploy/monitoring/.
groups:
- name: <app-name>-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"- Every production service must have health checks, error rate alerts, and latency monitoring.
- Use percentile-based latency metrics (P50, P95, P99), not averages.
- Set alert thresholds based on SLO targets, not arbitrary values.
- Include runbook links in alert annotations.
- Log at appropriate levels; never log sensitive data (passwords, tokens, PII).