feat: add OTEL trace changes for fault-remediation#1171
feat: add OTEL trace changes for fault-remediation#1171tanishagoyal2 wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Tanisha goyal <tanishag@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdd OpenTelemetry tracing and trace propagation across fault-remediation: new service-name constants, tracing init in main, per-event spans in the reconciler, trace-aware remediation flows and templates, context-aware slog usage, Helm OTLP env injection, go.mod dependency adjustments, and a reconciler pointer change. Changes
Sequence Diagram(s)sequenceDiagram
participant Watcher as ChangeStreamWatcher
participant Reconciler as FaultRemediationReconciler
participant Tracer as OpenTelemetry
participant Datastore as Datastore/TokenStore
participant RemClient as FaultRemediationClient
participant K8s as Kubernetes API
participant OTLP as OTLP Exporter
Watcher->>Reconciler: deliver EventWithToken
Reconciler->>Tracer: Start/Reuse event span (fault_remediation.event_received)
Reconciler->>Datastore: read event / mark token
Reconciler->>Tracer: Start reconcile span (fault_remediation.reconcile)
Reconciler->>RemClient: select action/template (pass trace/span IDs)
RemClient->>Tracer: start remediation spans, set attributes/errors
RemClient->>K8s: create CR / Jobs (annotations include trace/span)
K8s-->>RemClient: ACK / failure
RemClient->>Tracer: record errors / set attributes
Reconciler->>Datastore: update status (persist span ID)
Tracer->>OTLP: export spans to endpoint (env from Helm)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
fault-remediation/pkg/reconciler/reconciler.go (2)
575-584: Remove unnecessary whitespace before variable declaration.Static analysis flags unnecessary whitespace. The blank line before
crNamedeclaration can be removed.Proposed fix
if !result.IsZero() { return result, nil } - crName, performRemediationErr := r.performRemediation(ctx, healthEventWithStatus, groupConfig) + crName, performRemediationErr := r.performRemediation(ctx, healthEventWithStatus, groupConfig) nodeRemediatedStatus := performRemediationErr == nilNote: The linter appears to be flagging a formatting issue. Running
gofmtshould resolve this automatically.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-remediation/pkg/reconciler/reconciler.go` around lines 575 - 584, Remove the unnecessary blank line immediately before the declaration "crName, performRemediationErr := r.performRemediation(ctx, healthEventWithStatus, groupConfig)" so the variable declaration directly follows the preceding code; you can simply delete the empty line or run gofmt to reformat the file. Ensure the variables (crName, performRemediationErr, nodeRemediatedStatus) and subsequent calls (span.SetAttributes, tracing.RecordError) remain unchanged and properly aligned after formatting.
809-812: Consider usingsafeMarkProcessedfor consistency.When document ID extraction fails, the code directly calls
watcherInstance.MarkProcessedwithout going throughsafeMarkProcessed. While this works, using the helper would provide consistent error handling and tracing. However, sincesafeMarkProcessedlogs node-specific context and "unknown" is used here, the current approach may be intentional.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-remediation/pkg/reconciler/reconciler.go` around lines 809 - 812, Replace the direct call to watcherInstance.MarkProcessed with the safeMarkProcessed helper to ensure consistent error handling and tracing: instead of calling watcherInstance.MarkProcessed(context.Background(), eventWithToken.ResumeToken) use safeMarkProcessed(context.Background(), watcherInstance, eventWithToken.ResumeToken, "unknown") (or pass the appropriate node identifier if available), and remove the manual metrics/logging around the direct call so the helper handles errors and logs uniformly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@fault-remediation/pkg/reconciler/reconciler.go`:
- Around line 192-201: In completeEventSession (type FaultRemediationReconciler,
parameter session *eventTraceSession), remove the deprecated use of
result.Requeue from the early-return condition and only check reconcileErr or
result.RequeueAfter > 0; update the conditional from "if reconcileErr ||
result.Requeue || result.RequeueAfter > 0 { return }" to "if reconcileErr ||
result.RequeueAfter > 0 { return }" so static analysis no longer flags the
deprecated field.
- Around line 532-543: There are two identical calls to
metrics.EventsProcessed.WithLabelValues(metrics.CRStatusSkipped, nodeName).Inc()
causing double-counting; remove the duplicate so the skipped event is only
incremented once (keep the single call and delete the extra). Locate the block
where span is set with attributes including "fault_remediation.action.type",
"fault_remediation.existing_cr.name" (existingCR) and ensure
metrics.EventsProcessed is called exactly once in that skip/early-return branch
(the call that references metrics.EventsProcessed, metrics.CRStatusSkipped and
nodeName).
---
Nitpick comments:
In `@fault-remediation/pkg/reconciler/reconciler.go`:
- Around line 575-584: Remove the unnecessary blank line immediately before the
declaration "crName, performRemediationErr := r.performRemediation(ctx,
healthEventWithStatus, groupConfig)" so the variable declaration directly
follows the preceding code; you can simply delete the empty line or run gofmt to
reformat the file. Ensure the variables (crName, performRemediationErr,
nodeRemediatedStatus) and subsequent calls (span.SetAttributes,
tracing.RecordError) remain unchanged and properly aligned after formatting.
- Around line 809-812: Replace the direct call to watcherInstance.MarkProcessed
with the safeMarkProcessed helper to ensure consistent error handling and
tracing: instead of calling watcherInstance.MarkProcessed(context.Background(),
eventWithToken.ResumeToken) use safeMarkProcessed(context.Background(),
watcherInstance, eventWithToken.ResumeToken, "unknown") (or pass the appropriate
node identifier if available), and remove the manual metrics/logging around the
direct call so the helper handles errors and logs uniformly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c9cc01c4-3e05-4996-beb1-e9b6f6ab5713
📒 Files selected for processing (14)
commons/pkg/tracing/tracing.godistros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yamlfault-remediation/go.modfault-remediation/main.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/templates/gpureset-template.yamlfault-remediation/pkg/reconciler/templates/rebootnode-template.yamlfault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/templates/rebootnode-template.yaml
d0e0202 to
ea770b2
Compare
Signed-off-by: Tanisha goyal <tanishag@nvidia.com>
ea770b2 to
df16fec
Compare
Signed-off-by: Tanisha goyal <tanishag@nvidia.com>
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
Summary
Type of Change
Component(s) Affected
Testing
Checklist
Testing
Tested traces integration changes in tilt and dev cluster. We have tested various negative and positive cases and output of exported traces are mentioned below
Summary by CodeRabbit
New Features
Configuration