Production readiness requires Prometheus metrics, OTel traces, log correlation, sample control, and default Grafana dashboards with SLO-driven alerts (latency, error rates, agent health, AI cost).
Paths to start:
- Metrics: expose Prometheus endpoints in API/agent; see any
metrics.go or injection in main.go
- Traces: distributed and sample logic in API, agent, key flows (
internal/diagnostics/, circuit breaker layer ref: CLAUDE.md)
- Logging: add and correlate context IDs; see logrus config
- Dashboards/alerts: publish default Grafana JSON; SLO recipes for latency, errors, agent heartbeat, budget
Integrate with key flows (diagnostics, remediation, agent lifecycle).
References: internal/diagnostics/, CLAUDE.md, grafana dashboards, main.go.
Production readiness requires Prometheus metrics, OTel traces, log correlation, sample control, and default Grafana dashboards with SLO-driven alerts (latency, error rates, agent health, AI cost).
Paths to start:
metrics.goor injection in main.gointernal/diagnostics/, circuit breaker layer ref:CLAUDE.md)Integrate with key flows (diagnostics, remediation, agent lifecycle).
References:
internal/diagnostics/,CLAUDE.md, grafana dashboards, main.go.