Skip to content

feat: add node-drainer otel changes#1146

Open
tanishagoyal2 wants to merge 4 commits intoNVIDIA:mainfrom
tanishagoyal2:otel-setup-node-drainer
Open

feat: add node-drainer otel changes#1146
tanishagoyal2 wants to merge 4 commits intoNVIDIA:mainfrom
tanishagoyal2:otel-setup-node-drainer

Conversation

@tanishagoyal2
Copy link
Copy Markdown
Contributor

@tanishagoyal2 tanishagoyal2 commented Apr 10, 2026

Summary

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Testing

Tested traces integration changes in tilt and dev cluster. We have tested various negative and positive cases and output of exported traces are mentioned below

  1. Node recovery: Injected an fatal error and node went in draining state. The healthy event later on injected to simulate the node recovery.
node recovery during drain

The span is created for status update along with final status that has been updated in DB

node-recovery along with status and cancel received
  1. Healthy event injected to cancel the breakfix pipeline:
Screenshot 2026-04-11 at 5 08 27 PM
  1. Mixed mode of eviction: the node went in draining with pods present in each mode of eviction (Immediate, Allow completion and delete after timeout). The span was created for each mode to show the start and end of each mode.
mixed-modes mixed-modes 2
  1. Node drainer crashed while draining the node: the node drainer was crashing for 5min, when it came back in running state it picked up the event and continued the draining with spans added to the old trace. We can see two node_drainer.phase.allow_completion.start span both at the gap of 5min.
Screenshot 2026-04-11 at 5 28 38 PM
  1. Draining failed as no supported entities were present for partial drain
drain-failed drain-failed 2
  1. Node is already in draining state: the node drainer updated status to AlreadyDrained
already drained already drained 1
  1. Partial drain in node-drainer: Spans were created with drain scope, entity type and entity value info
partial-drain

Summary by CodeRabbit

  • New Features

    • Added OpenTelemetry distributed tracing across node-drainer with per-operation spans, context-aware logs, and persisted drain-session trace info for cross-service correlation.
    • Exposed configurable OTLP exporter settings in Kubernetes deployments.
  • Bug Fixes

    • Corrected a log message spelling.
  • Chores

    • Promoted OpenTelemetry dependencies to v1.43.0 and added a new service identifier constant for tracing.

@tanishagoyal2 tanishagoyal2 marked this pull request as draft April 10, 2026 18:34
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds OpenTelemetry tracing to node-drainer: new exported service constant, tracing init/shutdown and trace-correlated logger in main, per-event DrainSession tracing with spans in queue/worker, broad tracing and span-attribute instrumentation in reconciler/evaluator, Helm OTLP env injection, and module dependency adjustments.

Changes

Cohort / File(s) Summary
Tracing constant
commons/pkg/tracing/tracing.go
Added exported ServiceNodeDrainer = "node-drainer".
Helm deployment
distros/kubernetes/nvsentinel/charts/node-drainer/templates/deployment.yaml
Conditionally injects OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_EXPORTER_OTLP_INSECURE env vars when .Values.global.tracing.enabled is true.
Go module deps
node-drainer/go.mod
Promoted core OTel packages (go.opentelemetry.io/otel, go.opentelemetry.io/otel/trace v1.43.0) from indirect to direct requires.
Application init & logging
node-drainer/main.go
Initialize tracing with tracing.InitTracing(ServiceNodeDrainer), add shutdown sequence, switch to trace-correlated logger, and migrate to context-aware slog.*Context calls.
Queue session management
node-drainer/pkg/queue/queue.go, node-drainer/pkg/queue/types.go
Added exported DrainSession type, context helpers (ContextWithDrainSession, DrainSessionFromContext), and sessions sync.Map to track per-event sessions.
Worker & processing
node-drainer/pkg/queue/worker.go
Parse stored trace metadata, load/create per-event DrainSession, start/link spans, inject session into processing context, end spans on success and preserve on requeue.
Reconciler instrumentation & tests
node-drainer/pkg/reconciler/reconciler.go, node-drainer/pkg/reconciler/reconciler_integration_test.go
Extensive spans/attributes across enqueue/process/execute flows, persist span IDs into datastore status, make cancellation ctx-aware; tests updated to pass ctx to HandleCancellation.
Evaluator instrumentation
node-drainer/pkg/evaluator/evaluator.go
Propagated ctx through checks, added spans/attributes and context-aware logging for eligibility and custom-drain flows.
Informers & initializer logging
node-drainer/pkg/informers/informers.go, node-drainer/pkg/initializer/init.go
Replaced slog.* with slog.*Context calls and minor log text correction.
Other node-drainer files
node-drainer/...
Widespread migration to context-aware logging and error recording to support trace correlation and span attribute capture across components.

Sequence Diagrams

sequenceDiagram
    participant App as Application (main)
    participant OTel as OpenTelemetry SDK
    participant Logger as Logger (trace-corr)
    App->>OTel: InitTracing(ServiceNodeDrainer)
    activate OTel
    OTel-->>App: Tracer ready
    deactivate OTel
    App->>Logger: SetDefaultStructuredLoggerWithTraceCorrelation()
    Logger-->>App: Logger configured
    Note over App,OTel: Shutdown on exit
    App->>OTel: ShutdownTracing(ctx)
    activate OTel
    OTel-->>App: Flush complete
    deactivate OTel
Loading
sequenceDiagram
    participant Worker as Queue Worker
    participant Event as Datastore (stored event)
    participant Sessions as Sessions Map
    participant Span as Trace Span
    participant Reconciler as Processor / Reconciler

    Worker->>Event: Read event + trace metadata (traceID,parentSpanID)
    Worker->>Sessions: Load or create DrainSession
    alt existing session span
        Sessions-->>Worker: DrainSession with span
    else start new linked span
        Worker->>Span: Start linked drain-session span
        Span-->>Sessions: Store session with span
    end
    Worker->>Reconciler: Process event with context+session
    alt success
        Reconciler-->>Worker: Success
        Worker->>Span: End span
        Worker->>Sessions: Delete session
    else error (requeue)
        Reconciler-->>Worker: Error
        Note right of Sessions: Session preserved for retry
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐇 I hopped through spans with ears held high,

Linked traces like ribbons in the sky,
Sessions snug in a warm little den,
Logs and spans now dance as one—amen,
Hooray for drains that trace again!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'feat: add node-drainer otel changes' is vague and generic, using non-descriptive terms that don't clearly convey the specific scope or nature of the changes beyond mentioning 'otel changes'. Consider using a more specific title that describes the key technical change, such as 'feat: add OpenTelemetry tracing instrumentation to node-drainer' to better convey the implementation scope and purpose.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
node-drainer/pkg/reconciler/reconciler.go (1)

1371-1378: ⚠️ Potential issue | 🟠 Major

Persist failed drain status in the same shape as the other status updates.

This path writes a generated protos.OperationStatus directly, while updateNodeUserPodsEvictedStatus explicitly flattens the same field to a plain {"status","message"} map. Using a different encoder only for failed drains risks a different document shape across stores and breaks consumers that read healtheventstatus.userpodsevictionstatus.status/message.

Suggested fix
+	evictionStatusMap := map[string]any{
+		"status":  string(model.StatusFailed),
+		"message": reason,
+	}
+
 	update := map[string]any{
 		"$set": map[string]any{
-			"healtheventstatus.userpodsevictionstatus": protos.OperationStatus{
-				Status:  string(model.StatusFailed),
-				Message: reason,
-			},
+			"healtheventstatus.userpodsevictionstatus":                evictionStatusMap,
 			"healtheventstatus.spanids." + tracing.ServiceNodeDrainer: tracing.SpanIDFromSpan(tracing.SpanFromContext(ctx)),
 		},
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler.go` around lines 1371 - 1378, The
update currently writes a protos.OperationStatus object directly into the DB,
causing an inconsistent document shape versus updateNodeUserPodsEvictedStatus
which writes a plain map; change the "$set" value for
"healtheventstatus.userpodsevictionstatus" to a flattened map with keys "status"
(string(model.StatusFailed)) and "message" (reason) instead of
protos.OperationStatus, leaving the
"healtheventstatus.spanids."+tracing.ServiceNodeDrainer assignment as-is so
consumers reading healtheventstatus.userpodsevictionstatus.status/message see
the same structure as produced by updateNodeUserPodsEvictedStatus.
node-drainer/pkg/queue/queue.go (1)

64-74: ⚠️ Potential issue | 🟠 Major

Don’t propagate an empty EventID into the queue item.

The new drain-session store is keyed by nodeEvent.EventID, but this function still leaves EventID empty when utils.ExtractEventID(event) cannot find one. ProcessEventGeneric already has to backfill missing IDs from the document later, so this case is real. With the current code, unrelated legacy events without IDs will share one DrainSession and reuse span/phase state across retries.

Suggested fix
 	eventID := utils.ExtractEventID(event)
+	if eventID == "" {
+		eventID = fmt.Sprintf("%v", documentID)
+	}
 
 	nodeEvent := NodeEvent{
 		NodeName:         nodeName,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/queue/queue.go` around lines 64 - 74, Extracted eventID may
be empty; before constructing the NodeEvent (and enqueuing), backfill it so
drain-session keys aren't empty: call utils.ExtractEventID(event) into eventID,
and if eventID == "" set eventID = documentID (or derive a stable non-empty id
such as documentID + suffix) then assign that to NodeEvent.EventID; update the
code around the NodeEvent creation (the nodeEvent variable) so
ProcessEventGeneric still sees the backfilled ID and no empty EventID is
enqueued.
🧹 Nitpick comments (1)
node-drainer/main.go (1)

372-377: Consider making tracing shutdown failure non-fatal.

Currently, if ShutdownTracing fails, the function returns early without calling QueueManager.Shutdown(). This could leave work items in the queue unprocessed. For consistency with the non-fatal tracing initialization (lines 74-76), consider logging the error and continuing:

Suggested change
-	if err := tracing.ShutdownTracing(ctx); err != nil {
-		return fmt.Errorf("failed to shutdown tracing: %w", err)
-	}
+	if err := tracing.ShutdownTracing(ctx); err != nil {
+		slog.WarnContext(ctx, "Failed to shutdown tracing", "error", err)
+	}

 	components.QueueManager.Shutdown()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/main.go` around lines 372 - 377, tracing.ShutdownTracing
currently returns an error which causes an early return and prevents
components.QueueManager.Shutdown() and slog.InfoContext(ctx, ...) from running;
change this to log the error (e.g., use slog.ErrorContext(ctx, "failed to
shutdown tracing", "err", err)) and continue execution so that
components.QueueManager.Shutdown() is always called and the final
slog.InfoContext(ctx, "Node drainer stopped") still runs; keep ctx,
tracing.ShutdownTracing, components.QueueManager.Shutdown and slog.InfoContext
as the referenced symbols.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@commons/pkg/statemanager/statemanager.go`:
- Line 318: The code now allows the transition QuarantinedLabelValue ->
DrainSucceededLabelValue (see QuarantinedLabelValue, DrainingLabelValue,
DrainSucceededLabelValue), but tests and package docs still treat it as invalid;
update the test cases in statemanager_test.go (the cases around the
quarantined->drain-succeeded assertion) so they expect a successful transition
instead of an error, and update the package-level documentation/comments in
statemanager.go to list quarantined -> drain-succeeded under valid transitions
(remove it from the invalid transitions list) so behavior, tests, and docs are
consistent.

In `@node-drainer/pkg/queue/queue.go`:
- Around line 120-129: The logging in the extraction branches currently calls
slog.Error with the full event document (variable event) when toMap fails for
healthevent and metadata; change these slog.Error calls in the toMap failure
paths (around the heMap/toMap("healthevent") and
metaMap/toMap(heMap["metadata"]) checks) to log only identifying keys (e.g.,
extract and log _id or eventID from doc or heMap if present) and include the
name of the missing field (e.g., "missing field: healthevent" or "missing field:
metadata") instead of dumping the whole event; apply the same change to the
similar failure branches referenced (the other toMap failures around lines
149-158) so all extraction errors log minimal identifying info rather than the
full document.

In `@node-drainer/pkg/queue/worker.go`:
- Around line 99-123: PreprocessAndEnqueueEvent uses
tracing.ServicePlatformConnector as a fallback when
tracing.ServiceFaultQuarantine is missing, but the drain session creation only
reads tracing.ServiceFaultQuarantine, causing node_drainer.drain_session to be
unlinked for events that only carry the platform-connector span; update the
parentSpanID selection in the DrainSession creation (where
DrainSession.DrainSessionSpan is set and node_drainer.drain_session is started)
to fall back to spanIDs[tracing.ServicePlatformConnector] when
spanIDs[tracing.ServiceFaultQuarantine] is empty so the new drain session span
links to the same upstream span as preprocess.

---

Outside diff comments:
In `@node-drainer/pkg/queue/queue.go`:
- Around line 64-74: Extracted eventID may be empty; before constructing the
NodeEvent (and enqueuing), backfill it so drain-session keys aren't empty: call
utils.ExtractEventID(event) into eventID, and if eventID == "" set eventID =
documentID (or derive a stable non-empty id such as documentID + suffix) then
assign that to NodeEvent.EventID; update the code around the NodeEvent creation
(the nodeEvent variable) so ProcessEventGeneric still sees the backfilled ID and
no empty EventID is enqueued.

In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 1371-1378: The update currently writes a protos.OperationStatus
object directly into the DB, causing an inconsistent document shape versus
updateNodeUserPodsEvictedStatus which writes a plain map; change the "$set"
value for "healtheventstatus.userpodsevictionstatus" to a flattened map with
keys "status" (string(model.StatusFailed)) and "message" (reason) instead of
protos.OperationStatus, leaving the
"healtheventstatus.spanids."+tracing.ServiceNodeDrainer assignment as-is so
consumers reading healtheventstatus.userpodsevictionstatus.status/message see
the same structure as produced by updateNodeUserPodsEvictedStatus.

---

Nitpick comments:
In `@node-drainer/main.go`:
- Around line 372-377: tracing.ShutdownTracing currently returns an error which
causes an early return and prevents components.QueueManager.Shutdown() and
slog.InfoContext(ctx, ...) from running; change this to log the error (e.g., use
slog.ErrorContext(ctx, "failed to shutdown tracing", "err", err)) and continue
execution so that components.QueueManager.Shutdown() is always called and the
final slog.InfoContext(ctx, "Node drainer stopped") still runs; keep ctx,
tracing.ShutdownTracing, components.QueueManager.Shutdown and slog.InfoContext
as the referenced symbols.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: afa6a9dd-b00f-4a74-a184-9ac6f3307ee3

📥 Commits

Reviewing files that changed from the base of the PR and between d79c6c7 and a3a6ce1.

📒 Files selected for processing (13)
  • commons/pkg/statemanager/statemanager.go
  • commons/pkg/tracing/tracing.go
  • distros/kubernetes/nvsentinel/charts/node-drainer/templates/deployment.yaml
  • node-drainer/go.mod
  • node-drainer/main.go
  • node-drainer/pkg/evaluator/evaluator.go
  • node-drainer/pkg/informers/informers.go
  • node-drainer/pkg/initializer/init.go
  • node-drainer/pkg/queue/queue.go
  • node-drainer/pkg/queue/types.go
  • node-drainer/pkg/queue/worker.go
  • node-drainer/pkg/reconciler/reconciler.go
  • node-drainer/pkg/reconciler/reconciler_integration_test.go

@tanishagoyal2 tanishagoyal2 force-pushed the otel-setup-node-drainer branch from 3f2dcc3 to 779f9f6 Compare April 11, 2026 11:30
@github-actions
Copy link
Copy Markdown

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/commons/pkg/statemanager 15.84% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/queue 44.87% (-1.62%) 👎

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/commons/pkg/statemanager/statemanager.go 16.13% (ø) 595 96 499
github.com/nvidia/nvsentinel/node-drainer/pkg/queue/queue.go 66.18% (+0.11%) 68 (+12) 45 (+8) 23 (+4) 👍
github.com/nvidia/nvsentinel/node-drainer/pkg/queue/worker.go 37.44% (-0.55%) 195 (+66) 73 (+24) 122 (+42) 👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@tanishagoyal2 tanishagoyal2 marked this pull request as ready for review April 11, 2026 12:00
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
node-drainer/pkg/queue/worker.go (2)

97-103: Consider logging parse failures at debug level for troubleshooting.

If ParseHealthEventFromEvent fails, the error is silently discarded and processing continues without tracing. While this defensive approach is correct (tracing failures shouldn't break processing), adding a debug log would aid troubleshooting when traces are unexpectedly missing.

🔧 Proposed improvement
 	traceID := ""
 	parentSpanID := ""

-	if healthEvent, err := eventutil.ParseHealthEventFromEvent(event); err == nil {
+	healthEvent, err := eventutil.ParseHealthEventFromEvent(event)
+	if err != nil {
+		slog.DebugContext(ctx, "Could not parse health event for tracing",
+			"node", nodeEvent.NodeName, "eventID", nodeEvent.EventID, "error", err)
+	} else {
 		traceID = tracing.TraceIDFromMetadata(healthEvent.HealthEvent.GetMetadata())
 		parentSpanID = tracing.ParentSpanID(healthEvent.HealthEventStatus.SpanIds, tracing.ServiceFaultQuarantine)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/queue/worker.go` around lines 97 - 103, When
ParseHealthEventFromEvent fails its error is ignored causing missing trace info
to be hard to debug; add a debug-level log when the error != nil inside the
conditional so failures are recorded for troubleshooting. Specifically, after
calling eventutil.ParseHealthEventFromEvent(event) check the err branch and emit
a debug message (including the event and err) before continuing, leaving the
success path that sets traceID via tracing.TraceIDFromMetadata and parentSpanID
via tracing.ParentSpanID unchanged.

105-106: Sessions may accumulate indefinitely for permanently failing events.

The sync.Map storing DrainSession entries is only cleaned up on successful processing (line 144). If an event fails during processing and is repeatedly requeued, its session persists in the map. While validation and database fetch failures don't create sessions, events that pass those checks but fail processing will accumulate sessions without cleanup, since there is no maximum retry limit—NumRequeues is logged but never checked. Over long-running instances, this can lead to unbounded session accumulation, each holding a reference to a trace span.

Consider one of these approaches:

  1. Delete the session when the event is dropped after exceeding a retry threshold.
  2. Add a TTL-based cleanup goroutine for stale sessions.
  3. Implement a maximum retry limit and delete the session when that limit is reached.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/queue/worker.go` around lines 105 - 106, The DrainSession
stored via m.sessions.LoadOrStore(nodeEvent.EventID, &DrainSession{}) can
accumulate for permanently failing events; update the failure/requeue handling
to enforce a max retry threshold (inspect DrainSession.NumRequeues) and when
that limit is exceeded call m.sessions.Delete(nodeEvent.EventID) and drop the
event (or mark it as failed), so sessions are removed instead of kept forever;
alternatively, add a TTL-based cleaner goroutine that scans m.sessions for stale
DrainSession entries and deletes ones older than the TTL—ensure references to
DrainSession, NumRequeues, and m.sessions.Delete(nodeEvent.EventID) are used to
locate and implement the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@node-drainer/pkg/queue/worker.go`:
- Around line 97-103: When ParseHealthEventFromEvent fails its error is ignored
causing missing trace info to be hard to debug; add a debug-level log when the
error != nil inside the conditional so failures are recorded for
troubleshooting. Specifically, after calling
eventutil.ParseHealthEventFromEvent(event) check the err branch and emit a debug
message (including the event and err) before continuing, leaving the success
path that sets traceID via tracing.TraceIDFromMetadata and parentSpanID via
tracing.ParentSpanID unchanged.
- Around line 105-106: The DrainSession stored via
m.sessions.LoadOrStore(nodeEvent.EventID, &DrainSession{}) can accumulate for
permanently failing events; update the failure/requeue handling to enforce a max
retry threshold (inspect DrainSession.NumRequeues) and when that limit is
exceeded call m.sessions.Delete(nodeEvent.EventID) and drop the event (or mark
it as failed), so sessions are removed instead of kept forever; alternatively,
add a TTL-based cleaner goroutine that scans m.sessions for stale DrainSession
entries and deletes ones older than the TTL—ensure references to DrainSession,
NumRequeues, and m.sessions.Delete(nodeEvent.EventID) are used to locate and
implement the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1d09a282-db68-4477-a8df-902b3e3e8eb5

📥 Commits

Reviewing files that changed from the base of the PR and between a3a6ce1 and 779f9f6.

📒 Files selected for processing (2)
  • node-drainer/pkg/queue/queue.go
  • node-drainer/pkg/queue/worker.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • node-drainer/pkg/queue/queue.go

@github-actions
Copy link
Copy Markdown

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/metadata-collector/pkg/collector 35.27% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/config 5.71% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql 4.25% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/utils 6.64% (-0.23%) 👎

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/metadata-collector/pkg/collector/collector.go 35.27% (ø) 241 85 156
github.com/nvidia/nvsentinel/store-client/pkg/config/env_loader.go 5.71% (ø) 1927 110 1817
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql/datastore.go 4.32% (ø) 1411 61 1350
github.com/nvidia/nvsentinel/store-client/pkg/utils/pq.go 27.12% (-3.39%) 59 16 (-2) 43 (+2) 👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/metadata-collector/pkg/collector/collector_test.go
  • github.com/nvidia/nvsentinel/store-client/pkg/config/env_loader_test.go
  • github.com/nvidia/nvsentinel/store-client/pkg/utils/pq_test.go

Signed-off-by: Tanisha goyal <tanishag@nvidia.com>
Signed-off-by: Tanisha goyal <tanishag@nvidia.com>
@tanishagoyal2 tanishagoyal2 force-pushed the otel-setup-node-drainer branch from c5921c1 to f952364 Compare April 13, 2026 07:28
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
node-drainer/pkg/reconciler/reconciler.go (1)

150-156: Use the same parent-span source as the worker.

This fallback makes node_drainer.enqueue_event resolve a different upstream span than node_drainer.drain_session when metadata is malformed. The worker path and the recorded invariant both expect the fault-quarantine span ID to be the only parent source here.

♻️ Suggested change
 traceID := tracing.TraceIDFromMetadata(healthEventWithStatus.HealthEvent.GetMetadata())
 parentSpanID := tracing.ParentSpanID(
 	healthEventWithStatus.HealthEventStatus.SpanIds, tracing.ServiceFaultQuarantine)
-
-if parentSpanID == "" {
-	parentSpanID = tracing.ParentSpanID(healthEventWithStatus.HealthEventStatus.SpanIds, tracing.ServicePlatformConnector)
-}

Based on learnings, "In node-drainer/pkg/queue/worker.go, events processed by the node-drainer are always expected to carry a tracing.ServiceFaultQuarantine span ID. No fallback to tracing.ServicePlatformConnector is needed when resolving the parent span ID for creating the node_drainer.drain_session span."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler.go` around lines 150 - 156, The
parent-span fallback to tracing.ServicePlatformConnector should be removed so
the parentSpanID resolution always uses tracing.ServiceFaultQuarantine; update
the block that sets parentSpanID (currently calling tracing.ParentSpanID with
healthEventWithStatus.HealthEventStatus.SpanIds) to only call
tracing.ParentSpanID(..., tracing.ServiceFaultQuarantine) and drop the
conditional that substitutes tracing.ServicePlatformConnector—ensure traceID is
still obtained via tracing.TraceIDFromMetadata and downstream use in
node_drainer.drain_session / node_drainer.enqueue_event continues to read the
single fault-quarantine parent source.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@node-drainer/main.go`:
- Around line 74-77: InitTracing is being called in run() but ShutdownTracing is
only invoked later inside shutdownComponents() which can receive an
already-canceled run context (and run() can return early), so ensure tracing is
always shut down with a live context: immediately after successfully calling
tracing.InitTracing(tracing.ServiceNodeDrainer) register a deferred shutdown
that calls tracing.ShutdownTracing with a fresh non-canceled context (e.g.,
context.Background() or a short context.WithTimeout) so shutdown always runs
even on early returns; similarly apply the same pattern where InitTracing is
used around the code paths referenced at the 372-377 area so ShutdownTracing is
never passed the canceled run context.

In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 1164-1173: The tracing attribute node_drainer.error.type should be
a stable, low-cardinality value instead of err.Error(); change the
span.SetAttributes call in the error path around
r.Config.StateManager.UpdateNVSentinelStateNodeLabel(...) so that
attribute.String("node_drainer.error.type", ...) uses a fixed category (e.g.
"state_manager.update_failed" or "update_nv_sentinel_state_node_label_failed")
and keep the full error text in attribute.String("node_drainer.error.message",
err.Error()); leave tracing.RecordError(span, err) as-is and ensure the
slog.ErrorContext log still includes the err for debugging.

---

Nitpick comments:
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 150-156: The parent-span fallback to
tracing.ServicePlatformConnector should be removed so the parentSpanID
resolution always uses tracing.ServiceFaultQuarantine; update the block that
sets parentSpanID (currently calling tracing.ParentSpanID with
healthEventWithStatus.HealthEventStatus.SpanIds) to only call
tracing.ParentSpanID(..., tracing.ServiceFaultQuarantine) and drop the
conditional that substitutes tracing.ServicePlatformConnector—ensure traceID is
still obtained via tracing.TraceIDFromMetadata and downstream use in
node_drainer.drain_session / node_drainer.enqueue_event continues to read the
single fault-quarantine parent source.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5315abeb-9473-4b7b-8406-942a1e30ebc8

📥 Commits

Reviewing files that changed from the base of the PR and between 779f9f6 and f952364.

📒 Files selected for processing (12)
  • commons/pkg/tracing/tracing.go
  • distros/kubernetes/nvsentinel/charts/node-drainer/templates/deployment.yaml
  • node-drainer/go.mod
  • node-drainer/main.go
  • node-drainer/pkg/evaluator/evaluator.go
  • node-drainer/pkg/informers/informers.go
  • node-drainer/pkg/initializer/init.go
  • node-drainer/pkg/queue/queue.go
  • node-drainer/pkg/queue/types.go
  • node-drainer/pkg/queue/worker.go
  • node-drainer/pkg/reconciler/reconciler.go
  • node-drainer/pkg/reconciler/reconciler_integration_test.go
✅ Files skipped from review due to trivial changes (3)
  • commons/pkg/tracing/tracing.go
  • node-drainer/pkg/initializer/init.go
  • node-drainer/go.mod
🚧 Files skipped from review as they are similar to previous changes (3)
  • node-drainer/pkg/queue/types.go
  • node-drainer/pkg/informers/informers.go
  • node-drainer/pkg/queue/queue.go

@github-actions
Copy link
Copy Markdown

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/commons/pkg/statemanager 15.84% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/queue 44.87% (-1.62%) 👎

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/commons/pkg/statemanager/statemanager.go 16.13% (ø) 595 96 499
github.com/nvidia/nvsentinel/node-drainer/pkg/queue/queue.go 66.18% (+0.11%) 68 (+12) 45 (+8) 23 (+4) 👍
github.com/nvidia/nvsentinel/node-drainer/pkg/queue/worker.go 37.44% (-0.55%) 195 (+66) 73 (+24) 122 (+42) 👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Signed-off-by: Tanisha goyal <tanishag@nvidia.com>
@tanishagoyal2 tanishagoyal2 force-pushed the otel-setup-node-drainer branch from 898a308 to d9b72b1 Compare April 13, 2026 08:00
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
node-drainer/pkg/reconciler/reconciler.go (1)

329-337: Use typed sentinel errors instead of classifying errors by message text.

The isRequeueSignal function determines retry behavior based on substring matching ("requeu") and prefix matching ("waiting for") in error messages. This creates a fragile contract: any change to the wording of these error messages will silently break requeue behavior without compilation failures. Instead, define typed sentinel errors and use errors.Is or errors.As for classification.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler.go` around lines 329 - 337, The
current isRequeueSignal(err error) implementation relies on fragile string
matching; replace that with typed sentinel errors (e.g., declare package-level
vars like ErrRequeue and ErrWaitingFor) and update callers to return or wrap
those sentinel errors using fmt.Errorf(...%w...) so callers can be identified;
then change isRequeueSignal to return errors.Is(err, ErrRequeue) ||
errors.Is(err, ErrWaitingFor). Ensure you add the sentinel error declarations
and update any places that currently construct `"requeu..."` or `"waiting
for..."` error messages to return or wrap the new sentinel errors instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 329-337: The current isRequeueSignal(err error) implementation
relies on fragile string matching; replace that with typed sentinel errors
(e.g., declare package-level vars like ErrRequeue and ErrWaitingFor) and update
callers to return or wrap those sentinel errors using fmt.Errorf(...%w...) so
callers can be identified; then change isRequeueSignal to return errors.Is(err,
ErrRequeue) || errors.Is(err, ErrWaitingFor). Ensure you add the sentinel error
declarations and update any places that currently construct `"requeu..."` or
`"waiting for..."` error messages to return or wrap the new sentinel errors
instead.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1cace18f-3e11-4dbe-b2d9-e3a092de1278

📥 Commits

Reviewing files that changed from the base of the PR and between f952364 and 898a308.

📒 Files selected for processing (2)
  • node-drainer/main.go
  • node-drainer/pkg/reconciler/reconciler.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
node-drainer/pkg/reconciler/reconciler.go (1)

1185-1193: ⚠️ Potential issue | 🟡 Minor

Record the early document-ID failure on the custom-drain span.

node_drainer.execute_custom_drain is already open here, but the ExtractDocumentID failure path returns without tracing.RecordError(...) or error attributes. In traces, this looks like a clean span for a failed operation.

💡 Proposed fix
 	eventID, err := utils.ExtractDocumentID(event)
 	if err != nil {
+		tracing.RecordError(span, err)
+		span.SetAttributes(
+			attribute.String("node_drainer.error.type", "extract_document_id_error"),
+			attribute.String("node_drainer.error.message", err.Error()),
+		)
+
 		return fmt.Errorf("failed to extract document ID for custom drain: %w", err)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node-drainer/pkg/reconciler/reconciler.go` around lines 1185 - 1193, When
utils.ExtractDocumentID(event) returns an error inside the
node_drainer.execute_custom_drain span, record that failure on the open span
before returning: call tracing.RecordError(ctx, span, err) (and optionally
span.SetAttributes with a short attribute like
attribute.String("event.extract_error", err.Error())) so the trace shows the
error; ensure this is done immediately after the ExtractDocumentID error and
before the function returns using the existing span variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 505-509: The phase-marker calls (e.g., the call to
r.endPhaseMarkers using traceID := tracing.TraceIDFromMetadata(...) and
parentSpanID := tracing.ParentSpanID(healthEvent.HealthEventStatus.SpanIds,
tracing.ServiceFaultQuarantine)) must preserve the same upstream fallback used
on enqueue: if ParentSpanID with tracing.ServiceFaultQuarantine returns an
empty/zero value, call ParentSpanID again with tracing.ServicePlatformConnector
and use that result as parentSpanID; update the parentSpanID selection logic
before calling r.endPhaseMarkers and apply the identical fallback change to the
other occurrences that build parentSpanID (the other blocks that currently pass
tracing.ServiceFaultQuarantine).
- Around line 288-297: The error returned from the EnqueueEventGeneric call is
returned raw (return err) so callers lose that this failure occurred during
enqueue; wrap the error with local context using fmt.Errorf (for example
"enqueue event failed for node %s document %s: %w") before returning to preserve
where it happened. Update the block around r.queueManager.EnqueueEventGeneric /
tracing.RecordError / span.SetAttributes to return a wrapped error (including
nodeName and documentID) instead of the raw err.

---

Outside diff comments:
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 1185-1193: When utils.ExtractDocumentID(event) returns an error
inside the node_drainer.execute_custom_drain span, record that failure on the
open span before returning: call tracing.RecordError(ctx, span, err) (and
optionally span.SetAttributes with a short attribute like
attribute.String("event.extract_error", err.Error())) so the trace shows the
error; ensure this is done immediately after the ExtractDocumentID error and
before the function returns using the existing span variable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 228c0d62-9e8b-407b-bfe0-bfcbf115f4a1

📥 Commits

Reviewing files that changed from the base of the PR and between 898a308 and d9b72b1.

📒 Files selected for processing (2)
  • node-drainer/main.go
  • node-drainer/pkg/reconciler/reconciler.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • node-drainer/main.go

@github-actions
Copy link
Copy Markdown

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/node-drainer 0.00% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler 33.25% (-4.54%) 👎

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/node-drainer/main.go 0.00% (ø) 507 (+39) 0 507 (+39)
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler/reconciler.go 33.25% (-4.54%) 1591 (+506) 529 (+119) 1062 (+387) 👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@github-actions
Copy link
Copy Markdown

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/commons/pkg/statemanager 15.84% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/client 5.26% (+0.01%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/commons/pkg/statemanager/statemanager.go 16.13% (ø) 595 96 499
github.com/nvidia/nvsentinel/store-client/pkg/client/mongodb_client.go 2.82% (ø) 3794 107 3687

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/commons/pkg/statemanager/statemanager_test.go


span.SetAttributes(
attribute.String("node_drainer.custom_cr.name", crName),
attribute.String("node_drainer.custom_cr.status", "in_progress"),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the status? won't that be evident from the trace not completing?

defer span.End()

span.SetAttributes(
attribute.String("node_drainer.user_pods_eviction_status.status", string(model.StatusInProgress)),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as earlier, aren't the status evident from the span itself?

Comment on lines +520 to +521
attribute.String("node_drainer.partial_drain.entity_type", action.PartialDrainEntity.EntityType),
attribute.String("node_drainer.partial_drain.entity_value", action.PartialDrainEntity.EntityValue),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't these already in the health event, do we want these in the trace too?

Comment on lines +627 to +629
_, startSpan := tracing.StartSpanWithLinkFromTraceContext(
ctx, traceID, parentSpanID, "node_drainer.phase.immediate_eviction.start")
startSpan.End()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by this, shouldn't the span continue till the immediate_eviction is completed? why do we want short-lived spans like this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this intentionally to avoid noise of spans in trace. We check the pods in each mode in every reconciliation which runs at 2min of interval and pods can run for days. So, to avoid adding new spans for every reconcile I am adding short lived span.
We are adding mode.start as soon as we enter in that mode of eviction for the first time and mode.end when we are moving to next eviction mode (immediate mode to deleteaftertimeout mode)

Copy link
Copy Markdown
Collaborator

@lalitadithya lalitadithya Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just keep start one span and leave it open till completed?

We do need to trace every get api call made during that duration as well irrespective of how ever long the reconcile cycle lasts

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave it open but if node-drainer crashed then we won't get old created spans as long running span never finish during crash.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we save the id to mongodb like we do with the parent ones?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we go with current approach then we will have all spans traced. For example shown in the below screenshot where we have two allowcompletion.start because node-drainer was crashing for ~5 min

Screenshot 2026-04-13 at 8 56 41 PM

slog.InfoContext(ctx, "Deleted custom drain CR",
"node", nodeName,
"crName", crName)
span.SetAttributes(attribute.Bool("node_drainer.custom_cr.deleted", true))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this attribute, isn't it understood due to the execution flow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants