Skip to content

feat: add OTEL trace changes for fault-remediation#1171

Open
tanishagoyal2 wants to merge 3 commits intoNVIDIA:mainfrom
tanishagoyal2:otel-setp-fault-remediation
Open

feat: add OTEL trace changes for fault-remediation#1171
tanishagoyal2 wants to merge 3 commits intoNVIDIA:mainfrom
tanishagoyal2:otel-setp-fault-remediation

Conversation

@tanishagoyal2
Copy link
Copy Markdown
Contributor

@tanishagoyal2 tanishagoyal2 commented Apr 13, 2026

Summary

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Testing

Tested traces integration changes in tilt and dev cluster. We have tested various negative and positive cases and output of exported traces are mentioned below

  1. Remediation skipped because of unsupported action:
Screenshot 2026-04-13 at 4 00 25 PM Screenshot 2026-04-13 at 4 01 28 PM
  1. Received breakfix cancellation event:
Screenshot 2026-04-13 at 4 02 41 PM
  1. Failed to call janitor webhook as janitor pod was crashing:
Screenshot 2026-04-13 at 4 04 32 PM Screenshot 2026-04-13 at 4 05 02 PM
  1. Successfully created maintenance CR
Screenshot 2026-04-13 at 4 06 45 PM Screenshot 2026-04-13 at 4 07 34 PM

Summary by CodeRabbit

  • New Features

    • OpenTelemetry tracing added across the fault-remediation service for richer observability and error recording
    • Structured logs now include trace correlation so logs link to traces
    • Trace and span IDs are propagated into remediation resources as annotations for end-to-end traceability
    • Startup, shutdown, and runtime logs improved to include context-aware information
  • Configuration

    • Deployment now conditionally injects OpenTelemetry env vars when tracing is enabled

Signed-off-by: Tanisha goyal <tanishag@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 8a324198-2081-439d-9d8d-13be78865a15

📥 Commits

Reviewing files that changed from the base of the PR and between df16fec and c214fdd.

📒 Files selected for processing (1)
  • fault-remediation/pkg/reconciler/reconciler.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • fault-remediation/pkg/reconciler/reconciler.go

📝 Walkthrough

Walkthrough

Add OpenTelemetry tracing and trace propagation across fault-remediation: new service-name constants, tracing init in main, per-event spans in the reconciler, trace-aware remediation flows and templates, context-aware slog usage, Helm OTLP env injection, go.mod dependency adjustments, and a reconciler pointer change.

Changes

Cohort / File(s) Summary
Tracing constants
commons/pkg/tracing/tracing.go
Added ServiceNodeDrainer and ServiceFaultRemediation constants.
Helm / Deployment
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
Conditionally injects OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_EXPORTER_OTLP_INSECURE env vars when global tracing is enabled.
Module deps
fault-remediation/go.mod
Promoted OpenTelemetry packages and github.com/go-logr/logr from indirect to direct requirements.
Application init & logging
fault-remediation/main.go
Initialize tracing, enable trace correlation in structured logs, integrate controller-runtime logger, and adjust shutdown logging.
Reconciler implementation & tests
fault-remediation/pkg/reconciler/reconciler.go, fault-remediation/pkg/reconciler/reconciler_e2e_test.go
Reconciler changed to pointer return/type; introduces per-event trace sessions, extensive spans/attributes/error recording, and context-aware slog updates; tests updated to use pointer instances.
Remediation client & interfaces
fault-remediation/pkg/remediation/remediation.go, fault-remediation/pkg/remediation/fault_remediation_client_interface.go
Instrumented remediation and log-collector flows with spans; propagated trace/span IDs into TemplateData; updated function signatures to accept ctx and include trace IDs.
CR templates
fault-remediation/pkg/reconciler/templates/..., fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
Added metadata.annotations to inject nvsentinel.nvidia.com/trace-id and nvsentinel.nvidia.com/span-id into generated manifests.
Context-aware logging
fault-remediation/pkg/annotation/annotation.go, fault-remediation/pkg/crstatus/checker.go
Replaced slog.* calls with context-aware slog.*Context variants.
Initializer types
fault-remediation/pkg/initializer/init.go
Changed Components.FaultRemediationReconciler from value to pointer type.

Sequence Diagram(s)

sequenceDiagram
    participant Watcher as ChangeStreamWatcher
    participant Reconciler as FaultRemediationReconciler
    participant Tracer as OpenTelemetry
    participant Datastore as Datastore/TokenStore
    participant RemClient as FaultRemediationClient
    participant K8s as Kubernetes API
    participant OTLP as OTLP Exporter

    Watcher->>Reconciler: deliver EventWithToken
    Reconciler->>Tracer: Start/Reuse event span (fault_remediation.event_received)
    Reconciler->>Datastore: read event / mark token
    Reconciler->>Tracer: Start reconcile span (fault_remediation.reconcile)
    Reconciler->>RemClient: select action/template (pass trace/span IDs)
    RemClient->>Tracer: start remediation spans, set attributes/errors
    RemClient->>K8s: create CR / Jobs (annotations include trace/span)
    K8s-->>RemClient: ACK / failure
    RemClient->>Tracer: record errors / set attributes
    Reconciler->>Datastore: update status (persist span ID)
    Tracer->>OTLP: export spans to endpoint (env from Helm)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hop through spans and stitch each broken thread,
Traces like carrots trail where watchers tread,
I whisper IDs into CRs and logs that sing,
Spans bloom, errors noted, observability's spring.
Hoppity-hop—I sprinkle traces as my thing.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add OTEL trace changes for fault-remediation' directly and accurately describes the main objective: adding OpenTelemetry tracing instrumentation to the fault-remediation service.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
fault-remediation/pkg/reconciler/reconciler.go (2)

575-584: Remove unnecessary whitespace before variable declaration.

Static analysis flags unnecessary whitespace. The blank line before crName declaration can be removed.

Proposed fix
 	if !result.IsZero() {
 		return result, nil
 	}
 
-	crName, performRemediationErr := r.performRemediation(ctx, healthEventWithStatus, groupConfig)
+	crName, performRemediationErr := r.performRemediation(ctx, healthEventWithStatus, groupConfig)
 	nodeRemediatedStatus := performRemediationErr == nil

Note: The linter appears to be flagging a formatting issue. Running gofmt should resolve this automatically.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-remediation/pkg/reconciler/reconciler.go` around lines 575 - 584,
Remove the unnecessary blank line immediately before the declaration "crName,
performRemediationErr := r.performRemediation(ctx, healthEventWithStatus,
groupConfig)" so the variable declaration directly follows the preceding code;
you can simply delete the empty line or run gofmt to reformat the file. Ensure
the variables (crName, performRemediationErr, nodeRemediatedStatus) and
subsequent calls (span.SetAttributes, tracing.RecordError) remain unchanged and
properly aligned after formatting.

809-812: Consider using safeMarkProcessed for consistency.

When document ID extraction fails, the code directly calls watcherInstance.MarkProcessed without going through safeMarkProcessed. While this works, using the helper would provide consistent error handling and tracing. However, since safeMarkProcessed logs node-specific context and "unknown" is used here, the current approach may be intentional.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-remediation/pkg/reconciler/reconciler.go` around lines 809 - 812,
Replace the direct call to watcherInstance.MarkProcessed with the
safeMarkProcessed helper to ensure consistent error handling and tracing:
instead of calling watcherInstance.MarkProcessed(context.Background(),
eventWithToken.ResumeToken) use safeMarkProcessed(context.Background(),
watcherInstance, eventWithToken.ResumeToken, "unknown") (or pass the appropriate
node identifier if available), and remove the manual metrics/logging around the
direct call so the helper handles errors and logs uniformly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@fault-remediation/pkg/reconciler/reconciler.go`:
- Around line 192-201: In completeEventSession (type FaultRemediationReconciler,
parameter session *eventTraceSession), remove the deprecated use of
result.Requeue from the early-return condition and only check reconcileErr or
result.RequeueAfter > 0; update the conditional from "if reconcileErr ||
result.Requeue || result.RequeueAfter > 0 { return }" to "if reconcileErr ||
result.RequeueAfter > 0 { return }" so static analysis no longer flags the
deprecated field.
- Around line 532-543: There are two identical calls to
metrics.EventsProcessed.WithLabelValues(metrics.CRStatusSkipped, nodeName).Inc()
causing double-counting; remove the duplicate so the skipped event is only
incremented once (keep the single call and delete the extra). Locate the block
where span is set with attributes including "fault_remediation.action.type",
"fault_remediation.existing_cr.name" (existingCR) and ensure
metrics.EventsProcessed is called exactly once in that skip/early-return branch
(the call that references metrics.EventsProcessed, metrics.CRStatusSkipped and
nodeName).

---

Nitpick comments:
In `@fault-remediation/pkg/reconciler/reconciler.go`:
- Around line 575-584: Remove the unnecessary blank line immediately before the
declaration "crName, performRemediationErr := r.performRemediation(ctx,
healthEventWithStatus, groupConfig)" so the variable declaration directly
follows the preceding code; you can simply delete the empty line or run gofmt to
reformat the file. Ensure the variables (crName, performRemediationErr,
nodeRemediatedStatus) and subsequent calls (span.SetAttributes,
tracing.RecordError) remain unchanged and properly aligned after formatting.
- Around line 809-812: Replace the direct call to watcherInstance.MarkProcessed
with the safeMarkProcessed helper to ensure consistent error handling and
tracing: instead of calling watcherInstance.MarkProcessed(context.Background(),
eventWithToken.ResumeToken) use safeMarkProcessed(context.Background(),
watcherInstance, eventWithToken.ResumeToken, "unknown") (or pass the appropriate
node identifier if available), and remove the manual metrics/logging around the
direct call so the helper handles errors and logs uniformly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c9cc01c4-3e05-4996-beb1-e9b6f6ab5713

📥 Commits

Reviewing files that changed from the base of the PR and between 9e73e8f and b46c863.

📒 Files selected for processing (14)
  • commons/pkg/tracing/tracing.go
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
  • fault-remediation/go.mod
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/templates/gpureset-template.yaml
  • fault-remediation/pkg/reconciler/templates/rebootnode-template.yaml
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

@tanishagoyal2 tanishagoyal2 force-pushed the otel-setp-fault-remediation branch from d0e0202 to ea770b2 Compare April 13, 2026 10:42
Signed-off-by: Tanisha goyal <tanishag@nvidia.com>
@tanishagoyal2 tanishagoyal2 force-pushed the otel-setp-fault-remediation branch from ea770b2 to df16fec Compare April 13, 2026 14:01
Signed-off-by: Tanisha goyal <tanishag@nvidia.com>
@github-actions
Copy link
Copy Markdown

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler 22.05% (-0.67%) 👎

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler.go 22.05% (-0.67%) 1950 (+555) 430 (+113) 1520 (+442) 👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant