- Health Checks & Monitoring
- Alerting & Notifications
- Backup & Recovery
- Performance Testing
- Security Scanning
- Resource Management
- High Availability
- Observability
| Component | URL | Purpose | Status |
|---|---|---|---|
| Prometheus | http://localhost:9090 | Metrics collection & alerting | ✅ |
| Grafana | http://localhost:3000 | Dashboards & visualization | ✅ |
| Jaeger | http://localhost:16686 | Distributed tracing | ✅ |
| Alertmanager | http://localhost:9093 | Alert management | ✅ |
- Service availability: Up/down status for all components
- Response times: API endpoint latency measurements
- Throughput: Requests per second across services
- Error rates: 4xx/5xx response codes and application errors
- Health check status: Individual service health endpoints
- CPU utilization: Per-service and system-wide CPU usage
- Memory consumption: Heap usage, garbage collection metrics
- Disk I/O: Read/write operations and disk space usage
- Network I/O: Bandwidth utilization and connection counts
- Database performance: Query times, connection pool status
- Coupon redemptions: Rate and success of coupon usage
- User activity: Login rates, session duration, active users
- Event processing: Message queue throughput and latency
- Cache performance: Hit/miss ratios, cache size, eviction rates
- Authentication failures: Failed login attempts and patterns
- Suspicious activity: Unusual request patterns or access attempts
- Token validation: JWT validation success/failure rates
- Security alerts: Triggered security rules and violations
All services include comprehensive health checks accessible via dedicated endpoints:
# Run comprehensive health check
./scripts/production-health-check.sh- ✅ Service availability validation: Verify all services are running and responsive
- ✅ Database connectivity checks: Test connections to DynamoDB, Redis, PostgreSQL
- ✅ Authentication system validation: Verify Keycloak connectivity and JWT validation
- ✅ Resource usage monitoring: Check CPU, memory, and disk usage
- ✅ Security configuration verification: Validate security settings and configurations
| Service | Health Endpoint | Response Format |
|---|---|---|
| Write Layer | /actuator/health |
Spring Boot Actuator JSON |
| Public API | /health |
Custom JSON health status |
| Coupon Service | /actuator/health |
Spring Boot Actuator JSON |
| Web App | /health |
Simple status response |
# Automated health monitoring
while true; do
./scripts/production-health-check.sh
sleep 30
done- Service downtime > 5 minutes
- Database connectivity issues
- Authentication system failures
- High error rates > 5%
- Resource exhaustion (CPU > 90%, Memory > 95%)
- High latency > 1 second (95th percentile)
- Memory usage > 90%
- Queue backlog growth
- SSL certificate expiration (< 30 days)
- Unusual traffic patterns
- Performance degradation
- Cache miss rate increases
- Background job failures
- Configuration changes
- Endpoint: http://localhost:5001/webhook
- Format: JSON payload with alert details
- Purpose: Integration with external systems
- Configuration: Configurable for production deployment
- Recipients: Operations team, on-call engineers
- Escalation: Automatic escalation for critical alerts
- Critical → Warning → Info
- Immediate notification for critical issues
- Batched notifications for warnings and info
# Check current active alerts
curl http://localhost:9093/api/v1/alerts
# View alert history in Grafana
# Navigate to: http://localhost:3000
# Silence specific alerts
curl -X POST http://localhost:9093/api/v1/silences \
-d '{"matchers":[{"name":"alertname","value":"HighErrorRate"}]}'- Receive Alert: Notification received via configured channels
- Initial Assessment: Check alert details and severity
- Service Status: Verify affected services using health checks
- Log Analysis: Review recent logs for error patterns
- Metrics Review: Check Grafana dashboards for trends
- Resolution: Apply fixes or escalate to development team
- Post-Incident: Update documentation and improve monitoring
# Run comprehensive performance tests
./scripts/performance-test.sh
# Run specific test types
./scripts/performance-test.sh read # Read performance only
./scripts/performance-test.sh write # Write performance only
./scripts/performance-test.sh concurrent # Load testing
./scripts/performance-test.sh stress # Stress testing- Response Time: < 500ms (95th percentile)
- Throughput: > 1000 req/sec
- Error Rate: < 1%
- Availability: > 99.9%
Read Operations:
- Test Load: 1000 requests
- Concurrency: 10 concurrent users
- Duration: Sustained load testing
- Endpoints: All read APIs
Write Operations:
- Test Load: 100 requests
- Concurrency: 5 concurrent users
- Duration: Burst and sustained testing
- Endpoints: All write APIs
Concurrent Load:
- Users: 20 concurrent users
- Duration: 30-second sustained load
- Mix: 80% read, 20% write operations
- Ramp-up: Gradual increase to target load
Stress Testing:
- Load Pattern: Gradual increase to breaking point
- Monitoring: Resource utilization during stress
- Recovery: System recovery after stress removal
- Limits: Identify system capacity limits
- Average Response Time: Real-time API response times
- Request Rate: Requests per second by endpoint
- Error Rate: Percentage of failed requests
- Active Users: Current number of authenticated users
- Queue Depth: Message queue backlog size
- Hourly Performance: Performance patterns throughout the day
- Daily Performance: Day-over-day performance comparison
- Weekly Trends: Long-term performance trend analysis
- Capacity Planning: Growth trends and capacity requirements
The comprehensive backup script covers all critical system components:
- ✅ PostgreSQL databases (Keycloak data)
- ✅ Redis cache snapshots
- ✅ RabbitMQ configurations
- ✅ DynamoDB tables
- ✅ Application configurations
- ✅ Docker volumes
# Create full system backup
./scripts/backup-restore.sh backup
# Restore from backup
./scripts/backup-restore.sh restore ./backups/2024-01-01_12-00-00- Ready for cron integration
- Configurable backup intervals
- Retention policies
- Storage management
- Efficient storage with tar.gz compression
- Backup manifest with metadata
- Integrity verification
- Multiple storage backends support
- Timestamped backups for precise recovery
- Incremental backup support
- Cross-service consistency coordination
- Recovery validation procedures
# Test backup integrity
tar -tzf backups/backup-2024-01-01_12-00-00.tar.gz
# Validate backup manifest
jq . backups/2024-01-01_12-00-00/manifest.json
# Test restoration process
./scripts/test-restore.sh ./backups/2024-01-01_12-00-00- Assess Impact: Determine scope of data loss or corruption
- Select Recovery Point: Choose appropriate backup timestamp
- Stop Services: Safely shutdown affected services
- Restore Data: Execute restoration procedures
- Validate Integrity: Verify restored data consistency
- Restart Services: Bring services back online
- Monitor Recovery: Watch for issues post-recovery
- Document Incident: Update procedures based on experience
# Run comprehensive security scan
./scripts/security-scan.sh
# Run specific security checks
./scripts/security-scan.sh containers # Container security
./scripts/security-scan.sh network # Network security
./scripts/security-scan.sh auth # Authentication
./scripts/security-scan.sh secrets # Secret exposure- ✅ JWT-based authentication
- ✅ Keycloak identity management
- ✅ Role-based access control
- ✅ Session management
- ✅ Container network isolation
- ✅ Port exposure validation
- ✅ HAProxy load balancing
⚠️ HTTPS configuration (ready for production certificates)
- ✅ Database access controls
- ✅ Secret scanning
- ✅ Environment variable validation
- ✅ Log sanitization
- Real-time threat detection
- Security event correlation
- Compliance reporting
- Vulnerability scanning
- Authentication success/failure rates
- Access control violations
- Security alert frequency
- Compliance status indicators
- Request IDs: Unique identifiers for request tracking
- Service Correlation: Track requests across service boundaries
- Error Correlation: Link errors to specific request flows
- Performance Analysis: Identify bottlenecks in request processing
- Service Dependencies: Visualize service interaction patterns
- Latency Analysis: Identify slow components in request chains
- Error Propagation: Understand how errors spread through system
- Capacity Planning: Use trace data for scaling decisions
- JSON Format: Machine-readable log entries
- Consistent Fields: Standardized log fields across services
- Log Levels: Appropriate use of DEBUG, INFO, WARN, ERROR
- Context Information: Include relevant request/user context
- Centralized Collection: All logs in single searchable location
- Log Retention: Appropriate retention policies for different log types
- Log Analysis: Tools for searching and analyzing log patterns
- Alert Integration: Logs trigger appropriate alerts
- Business Metrics: Key performance indicators for business logic
- Technical Metrics: System performance and health indicators
- Custom Metrics: Application-specific measurements
- Metric Aggregation: Appropriate time-based aggregation
- Host Metrics: CPU, memory, disk, network for all hosts
- Container Metrics: Docker container resource usage
- Service Metrics: Per-service performance indicators
- Dependency Metrics: External service performance impact