- Overall, 4 years of experience in Site Reliability Engineering & Observability platforms & IT Infrastructure & Applications Production Support and Java Support Engineer
- Experienced Observability Monitoring Engineer with over 3 years in administrative roles, specializing in providing 24/7 support for global customers in production environments.
- Proficient in APM monitoring tools such as DataDog, Grafana, Kibana, Dynatrace, Splunk, OMI, Tidal, and Sitescope. Skilled in managing SLOs, SLIs, and SLAs, and well-versed in ITIL frameworks including incident, change, major, and problem management. Proven ability in Datadog administration, dashboard creation, and monitoring services in production environments.
- API Development: Engineered secure and robust API endpoints for CRUD operations, ensuring data integrity and correct performance.
- Debugging & Maintenance: Adept at bug fixing and debugging complex applications to maintain system health.
- Frameworks: I have good knowledge in developing and troubleshooting applications using Spring Boot and Spring MVC.
- Timely Resolution: Committed to diagnosing and resolving system issues to minimize downtime and impact.
Proficient in the end-to-end administration of a comprehensive APM and monitoring stack, including:
Tools:
Datadog|Grafana|Kibana|New Relic|Dynatrace|Splunk|
- Datadog Administration: Onboarding services, configuring agents, and tuning metrics collection.
- Visualization: Designing and building insightful dashboards tailored to SLOs/SLIs and business KPIs.
- Alerting: Implementing and managing alert policies to reduce noise and improve MTTR.
- Service Management: Skilled in managing SLOs, SLIs, and SLAs to align IT services with business goals.
- ITIL Practices: Well-versed in ITIL frameworks for Incident, Change, Major Incident, and Problem Management.
- ITIL: Incident, Change, Major Incident, Problem Management; SLOs, SLIs, SLAs (metrics, traces, logs).
- Alerting: success/error/composite alerts, threshold tuning, refinement, noise/toil reduction.
- App monitoring: triage in production, dev collaboration via JIRA, runbooks, dashboards, reporting.
- Tooling: Grafana (error insights), Kibana (log analysis), Datadog admin (monitors, dashboards), PagerDuty (on-call).
- Process: onboarding services to monitoring, gap analysis, RCA participation, weekly/monthly reporting.
- Programming: Java, Python (custom metrics, light instrumentation).
Client: Qatar Airways β Payments Monitoring Group
- Provided 24/7 support to global customers for payments applications in production environments.
- Managed AWS infrastructure (EC2, VPC, IAM, S3) supporting 50+ microservices and high-volume payment transactions with auto-scaling and fault tolerance
- Implemented SLOs, SLIs, SLAs to ensure performance and reliability goals were met and measured.
- Maintained 99.9%+ uptime for critical applications handling high-volume τinancial transactions.
- Designed and implemented Datadog dashboards, monitors, and SLO tracking, improving system visibility and proactive issue detection
- Led P1/P2 incident war rooms and RCA, reducing MTTR by 25% and preventing recurring production issues.
- Integrated CI/CD pipelines (Azure DevOps, Git, Maven) to enable reliable and faster deployments with reduced failure rates
- Reduced MTTR by 30% by developing Python automation scripts for alerting, log analysis, and incident response.
- Configured and monitored alerts with PagerDuty to ensure timely incident response and on-call rotations.
- Supported distributed microservices architecture with end-to-end monitoring across services using logs, metrics, and traces.
- Implemented Infrastructure as Code (Terraform) for provisioning and managing AWS resources
- Performed advanced observability tasks: custom dashboards, widgets, panels in DataDog; threshold tuning; noise reduction in alerts.
- Collaborated with development teams to debug Java/Spring Boot applications, analyzing logs, JVM metrics, and API performance
- Improved monitoring efτiciency by reducing alert noise and false positives, enhancing operational response time.
- Deployed and managed containerized applications on Kubernetes, handling pod failures (CrashLoopBackOff), auto-scaling, and resource optimization.
Client: Enterprise Solutions β Payments Monitoring
- Provided 24/7 L1/L2 support to global customers for critical payments applications in production environments.
- Managed and administered the APM/Monitoring stack: Datadog, Grafana, Kibana, OMI, Tidal, SiteScope.
- Configured and tuned alert thresholds, significantly reducing noise from ineffective alerts and improving signal clarity.
- Monitored and supported applications, services, and batch jobs across multiple platforms to ensure system health.
- Created and escalated JIRA tickets to development teams for faster incident resolution and tracking.
- Prepared structured incident checklists and runbooks, sharing clear documentation with clients and business teams.
- Defined and monitored SLA/SLI metrics for payment services using Datadog to uphold service quality agreements.
- Built and customized JIRA dashboards based on project requirements to streamline workflow and visibility.
- Configured PagerDuty for effective alerting and implementing escalation workflows to ensure on-call responsiveness.
- Performed detailed incident analysis and engaged with Root Cause Analysis (RCA) teams to drive long-term fixes.
- Generated and shared daily, weekly, and monthly status reports with business stakeholders to communicate system health and incidents.
- Conducted basic front-end troubleshooting of applications and engaged next-level support teams for complex issues.
- Provided front-line and second-level IT operations support, ensuring outstanding client service delivery.
- Supported weekend server patching activities, including comprehensive pre- and post-patching validation checks.