Contact Me

Site Reliability Engineer • Observability • Production Support • Technical Support

🎯 Career Objective

Overall, 4 years of experience in Site Reliability Engineering & Observability platforms & IT Infrastructure & Applications Production Support and Java Support Engineer
Experienced Observability Monitoring Engineer with over 3 years in administrative roles, specializing in providing 24/7 support for global customers in production environments.
Proficient in APM monitoring tools such as DataDog, Grafana, Kibana, Dynatrace, Splunk, OMI, Tidal, and Sitescope. Skilled in managing SLOs, SLIs, and SLAs, and well-versed in ITIL frameworks including incident, change, major, and problem management. Proven ability in Datadog administration, dashboard creation, and monitoring services in production environments.
API Development: Engineered secure and robust API endpoints for CRUD operations, ensuring data integrity and correct performance.
Debugging & Maintenance: Adept at bug fixing and debugging complex applications to maintain system health.
Frameworks: I have good knowledge in developing and troubleshooting applications using Spring Boot and Spring MVC.
Timely Resolution: Committed to diagnosing and resolving system issues to minimize downtime and impact.

Monitoring & Observability

Proficient in the end-to-end administration of a comprehensive APM and monitoring stack, including:

Tools: Datadog | Grafana | Kibana | New Relic | Dynatrace | Splunk |

Datadog Administration: Onboarding services, configuring agents, and tuning metrics collection.
Visualization: Designing and building insightful dashboards tailored to SLOs/SLIs and business KPIs.
Alerting: Implementing and managing alert policies to reduce noise and improve MTTR.

Process & Framework

Service Management: Skilled in managing SLOs, SLIs, and SLAs to align IT services with business goals.
ITIL Practices: Well-versed in ITIL frameworks for Incident, Change, Major Incident, and Problem Management.

🔑 Key Skills

ITIL: Incident, Change, Major Incident, Problem Management; SLOs, SLIs, SLAs (metrics, traces, logs).
Alerting: success/error/composite alerts, threshold tuning, refinement, noise/toil reduction.
App monitoring: triage in production, dev collaboration via JIRA, runbooks, dashboards, reporting.
Tooling: Grafana (error insights), Kibana (log analysis), Datadog admin (monitors, dashboards), PagerDuty (on-call).
Process: onboarding services to monitoring, gap analysis, RCA participation, weekly/monthly reporting.
Programming: Java, Python (custom metrics, light instrumentation).

Professional Experience

DXC Technology, Bangalore — Site Reliability Engineer (Dec 2022 – Present)

Client: Qatar Airways — Payments Monitoring Group

Provided 24/7 support to global customers for payments applications in production environments.
Managed AWS infrastructure (EC2, VPC, IAM, S3) supporting 50+ microservices and high-volume payment transactions with auto-scaling and fault tolerance
Implemented SLOs, SLIs, SLAs to ensure performance and reliability goals were met and measured.
Maintained 99.9%+ uptime for critical applications handling high-volume 􀏐inancial transactions.
Designed and implemented Datadog dashboards, monitors, and SLO tracking, improving system visibility and proactive issue detection
Led P1/P2 incident war rooms and RCA, reducing MTTR by 25% and preventing recurring production issues.
Integrated CI/CD pipelines (Azure DevOps, Git, Maven) to enable reliable and faster deployments with reduced failure rates
Reduced MTTR by 30% by developing Python automation scripts for alerting, log analysis, and incident response.
Configured and monitored alerts with PagerDuty to ensure timely incident response and on-call rotations.
Supported distributed microservices architecture with end-to-end monitoring across services using logs, metrics, and traces.
Implemented Infrastructure as Code (Terraform) for provisioning and managing AWS resources
Performed advanced observability tasks: custom dashboards, widgets, panels in DataDog; threshold tuning; noise reduction in alerts.
Collaborated with development teams to debug Java/Spring Boot applications, analyzing logs, JVM metrics, and API performance
Improved monitoring ef􀏐iciency by reducing alert noise and false positives, enhancing operational response time.
Deployed and managed containerized applications on Kubernetes, handling pod failures (CrashLoopBackOff), auto-scaling, and resource optimization.

Wipro, Bangalore — Site Reliability Engineer (Apr 2022 – Nov 2022)

Client: Enterprise Solutions — Payments Monitoring

Provided 24/7 L1/L2 support to global customers for critical payments applications in production environments.
Managed and administered the APM/Monitoring stack: Datadog, Grafana, Kibana, OMI, Tidal, SiteScope.
Configured and tuned alert thresholds, significantly reducing noise from ineffective alerts and improving signal clarity.
Monitored and supported applications, services, and batch jobs across multiple platforms to ensure system health.
Created and escalated JIRA tickets to development teams for faster incident resolution and tracking.
Prepared structured incident checklists and runbooks, sharing clear documentation with clients and business teams.
Defined and monitored SLA/SLI metrics for payment services using Datadog to uphold service quality agreements.
Built and customized JIRA dashboards based on project requirements to streamline workflow and visibility.
Configured PagerDuty for effective alerting and implementing escalation workflows to ensure on-call responsiveness.
Performed detailed incident analysis and engaged with Root Cause Analysis (RCA) teams to drive long-term fixes.
Generated and shared daily, weekly, and monthly status reports with business stakeholders to communicate system health and incidents.
Conducted basic front-end troubleshooting of applications and engaged next-level support teams for complex issues.
Provided front-line and second-level IT operations support, ensuring outstanding client service delivery.
Supported weekend server patching activities, including comprehensive pre- and post-patching validation checks.