feat: add node-drainer otel changes#1146
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds OpenTelemetry tracing to node-drainer: new exported service constant, tracing init/shutdown and trace-correlated logger in main, per-event DrainSession tracing with spans in queue/worker, broad tracing and span-attribute instrumentation in reconciler/evaluator, Helm OTLP env injection, and module dependency adjustments. Changes
Sequence DiagramssequenceDiagram
participant App as Application (main)
participant OTel as OpenTelemetry SDK
participant Logger as Logger (trace-corr)
App->>OTel: InitTracing(ServiceNodeDrainer)
activate OTel
OTel-->>App: Tracer ready
deactivate OTel
App->>Logger: SetDefaultStructuredLoggerWithTraceCorrelation()
Logger-->>App: Logger configured
Note over App,OTel: Shutdown on exit
App->>OTel: ShutdownTracing(ctx)
activate OTel
OTel-->>App: Flush complete
deactivate OTel
sequenceDiagram
participant Worker as Queue Worker
participant Event as Datastore (stored event)
participant Sessions as Sessions Map
participant Span as Trace Span
participant Reconciler as Processor / Reconciler
Worker->>Event: Read event + trace metadata (traceID,parentSpanID)
Worker->>Sessions: Load or create DrainSession
alt existing session span
Sessions-->>Worker: DrainSession with span
else start new linked span
Worker->>Span: Start linked drain-session span
Span-->>Sessions: Store session with span
end
Worker->>Reconciler: Process event with context+session
alt success
Reconciler-->>Worker: Success
Worker->>Span: End span
Worker->>Sessions: Delete session
else error (requeue)
Reconciler-->>Worker: Error
Note right of Sessions: Session preserved for retry
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
node-drainer/pkg/reconciler/reconciler.go (1)
1371-1378:⚠️ Potential issue | 🟠 MajorPersist failed drain status in the same shape as the other status updates.
This path writes a generated
protos.OperationStatusdirectly, whileupdateNodeUserPodsEvictedStatusexplicitly flattens the same field to a plain{"status","message"}map. Using a different encoder only for failed drains risks a different document shape across stores and breaks consumers that readhealtheventstatus.userpodsevictionstatus.status/message.Suggested fix
+ evictionStatusMap := map[string]any{ + "status": string(model.StatusFailed), + "message": reason, + } + update := map[string]any{ "$set": map[string]any{ - "healtheventstatus.userpodsevictionstatus": protos.OperationStatus{ - Status: string(model.StatusFailed), - Message: reason, - }, + "healtheventstatus.userpodsevictionstatus": evictionStatusMap, "healtheventstatus.spanids." + tracing.ServiceNodeDrainer: tracing.SpanIDFromSpan(tracing.SpanFromContext(ctx)), }, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler.go` around lines 1371 - 1378, The update currently writes a protos.OperationStatus object directly into the DB, causing an inconsistent document shape versus updateNodeUserPodsEvictedStatus which writes a plain map; change the "$set" value for "healtheventstatus.userpodsevictionstatus" to a flattened map with keys "status" (string(model.StatusFailed)) and "message" (reason) instead of protos.OperationStatus, leaving the "healtheventstatus.spanids."+tracing.ServiceNodeDrainer assignment as-is so consumers reading healtheventstatus.userpodsevictionstatus.status/message see the same structure as produced by updateNodeUserPodsEvictedStatus.node-drainer/pkg/queue/queue.go (1)
64-74:⚠️ Potential issue | 🟠 MajorDon’t propagate an empty
EventIDinto the queue item.The new drain-session store is keyed by
nodeEvent.EventID, but this function still leavesEventIDempty whenutils.ExtractEventID(event)cannot find one.ProcessEventGenericalready has to backfill missing IDs from the document later, so this case is real. With the current code, unrelated legacy events without IDs will share oneDrainSessionand reuse span/phase state across retries.Suggested fix
eventID := utils.ExtractEventID(event) + if eventID == "" { + eventID = fmt.Sprintf("%v", documentID) + } nodeEvent := NodeEvent{ NodeName: nodeName,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/queue/queue.go` around lines 64 - 74, Extracted eventID may be empty; before constructing the NodeEvent (and enqueuing), backfill it so drain-session keys aren't empty: call utils.ExtractEventID(event) into eventID, and if eventID == "" set eventID = documentID (or derive a stable non-empty id such as documentID + suffix) then assign that to NodeEvent.EventID; update the code around the NodeEvent creation (the nodeEvent variable) so ProcessEventGeneric still sees the backfilled ID and no empty EventID is enqueued.
🧹 Nitpick comments (1)
node-drainer/main.go (1)
372-377: Consider making tracing shutdown failure non-fatal.Currently, if
ShutdownTracingfails, the function returns early without callingQueueManager.Shutdown(). This could leave work items in the queue unprocessed. For consistency with the non-fatal tracing initialization (lines 74-76), consider logging the error and continuing:Suggested change
- if err := tracing.ShutdownTracing(ctx); err != nil { - return fmt.Errorf("failed to shutdown tracing: %w", err) - } + if err := tracing.ShutdownTracing(ctx); err != nil { + slog.WarnContext(ctx, "Failed to shutdown tracing", "error", err) + } components.QueueManager.Shutdown()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/main.go` around lines 372 - 377, tracing.ShutdownTracing currently returns an error which causes an early return and prevents components.QueueManager.Shutdown() and slog.InfoContext(ctx, ...) from running; change this to log the error (e.g., use slog.ErrorContext(ctx, "failed to shutdown tracing", "err", err)) and continue execution so that components.QueueManager.Shutdown() is always called and the final slog.InfoContext(ctx, "Node drainer stopped") still runs; keep ctx, tracing.ShutdownTracing, components.QueueManager.Shutdown and slog.InfoContext as the referenced symbols.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@commons/pkg/statemanager/statemanager.go`:
- Line 318: The code now allows the transition QuarantinedLabelValue ->
DrainSucceededLabelValue (see QuarantinedLabelValue, DrainingLabelValue,
DrainSucceededLabelValue), but tests and package docs still treat it as invalid;
update the test cases in statemanager_test.go (the cases around the
quarantined->drain-succeeded assertion) so they expect a successful transition
instead of an error, and update the package-level documentation/comments in
statemanager.go to list quarantined -> drain-succeeded under valid transitions
(remove it from the invalid transitions list) so behavior, tests, and docs are
consistent.
In `@node-drainer/pkg/queue/queue.go`:
- Around line 120-129: The logging in the extraction branches currently calls
slog.Error with the full event document (variable event) when toMap fails for
healthevent and metadata; change these slog.Error calls in the toMap failure
paths (around the heMap/toMap("healthevent") and
metaMap/toMap(heMap["metadata"]) checks) to log only identifying keys (e.g.,
extract and log _id or eventID from doc or heMap if present) and include the
name of the missing field (e.g., "missing field: healthevent" or "missing field:
metadata") instead of dumping the whole event; apply the same change to the
similar failure branches referenced (the other toMap failures around lines
149-158) so all extraction errors log minimal identifying info rather than the
full document.
In `@node-drainer/pkg/queue/worker.go`:
- Around line 99-123: PreprocessAndEnqueueEvent uses
tracing.ServicePlatformConnector as a fallback when
tracing.ServiceFaultQuarantine is missing, but the drain session creation only
reads tracing.ServiceFaultQuarantine, causing node_drainer.drain_session to be
unlinked for events that only carry the platform-connector span; update the
parentSpanID selection in the DrainSession creation (where
DrainSession.DrainSessionSpan is set and node_drainer.drain_session is started)
to fall back to spanIDs[tracing.ServicePlatformConnector] when
spanIDs[tracing.ServiceFaultQuarantine] is empty so the new drain session span
links to the same upstream span as preprocess.
---
Outside diff comments:
In `@node-drainer/pkg/queue/queue.go`:
- Around line 64-74: Extracted eventID may be empty; before constructing the
NodeEvent (and enqueuing), backfill it so drain-session keys aren't empty: call
utils.ExtractEventID(event) into eventID, and if eventID == "" set eventID =
documentID (or derive a stable non-empty id such as documentID + suffix) then
assign that to NodeEvent.EventID; update the code around the NodeEvent creation
(the nodeEvent variable) so ProcessEventGeneric still sees the backfilled ID and
no empty EventID is enqueued.
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 1371-1378: The update currently writes a protos.OperationStatus
object directly into the DB, causing an inconsistent document shape versus
updateNodeUserPodsEvictedStatus which writes a plain map; change the "$set"
value for "healtheventstatus.userpodsevictionstatus" to a flattened map with
keys "status" (string(model.StatusFailed)) and "message" (reason) instead of
protos.OperationStatus, leaving the
"healtheventstatus.spanids."+tracing.ServiceNodeDrainer assignment as-is so
consumers reading healtheventstatus.userpodsevictionstatus.status/message see
the same structure as produced by updateNodeUserPodsEvictedStatus.
---
Nitpick comments:
In `@node-drainer/main.go`:
- Around line 372-377: tracing.ShutdownTracing currently returns an error which
causes an early return and prevents components.QueueManager.Shutdown() and
slog.InfoContext(ctx, ...) from running; change this to log the error (e.g., use
slog.ErrorContext(ctx, "failed to shutdown tracing", "err", err)) and continue
execution so that components.QueueManager.Shutdown() is always called and the
final slog.InfoContext(ctx, "Node drainer stopped") still runs; keep ctx,
tracing.ShutdownTracing, components.QueueManager.Shutdown and slog.InfoContext
as the referenced symbols.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: afa6a9dd-b00f-4a74-a184-9ac6f3307ee3
📒 Files selected for processing (13)
commons/pkg/statemanager/statemanager.gocommons/pkg/tracing/tracing.godistros/kubernetes/nvsentinel/charts/node-drainer/templates/deployment.yamlnode-drainer/go.modnode-drainer/main.gonode-drainer/pkg/evaluator/evaluator.gonode-drainer/pkg/informers/informers.gonode-drainer/pkg/initializer/init.gonode-drainer/pkg/queue/queue.gonode-drainer/pkg/queue/types.gonode-drainer/pkg/queue/worker.gonode-drainer/pkg/reconciler/reconciler.gonode-drainer/pkg/reconciler/reconciler_integration_test.go
3f2dcc3 to
779f9f6
Compare
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
node-drainer/pkg/queue/worker.go (2)
97-103: Consider logging parse failures at debug level for troubleshooting.If
ParseHealthEventFromEventfails, the error is silently discarded and processing continues without tracing. While this defensive approach is correct (tracing failures shouldn't break processing), adding a debug log would aid troubleshooting when traces are unexpectedly missing.🔧 Proposed improvement
traceID := "" parentSpanID := "" - if healthEvent, err := eventutil.ParseHealthEventFromEvent(event); err == nil { + healthEvent, err := eventutil.ParseHealthEventFromEvent(event) + if err != nil { + slog.DebugContext(ctx, "Could not parse health event for tracing", + "node", nodeEvent.NodeName, "eventID", nodeEvent.EventID, "error", err) + } else { traceID = tracing.TraceIDFromMetadata(healthEvent.HealthEvent.GetMetadata()) parentSpanID = tracing.ParentSpanID(healthEvent.HealthEventStatus.SpanIds, tracing.ServiceFaultQuarantine) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/queue/worker.go` around lines 97 - 103, When ParseHealthEventFromEvent fails its error is ignored causing missing trace info to be hard to debug; add a debug-level log when the error != nil inside the conditional so failures are recorded for troubleshooting. Specifically, after calling eventutil.ParseHealthEventFromEvent(event) check the err branch and emit a debug message (including the event and err) before continuing, leaving the success path that sets traceID via tracing.TraceIDFromMetadata and parentSpanID via tracing.ParentSpanID unchanged.
105-106: Sessions may accumulate indefinitely for permanently failing events.The
sync.MapstoringDrainSessionentries is only cleaned up on successful processing (line 144). If an event fails during processing and is repeatedly requeued, its session persists in the map. While validation and database fetch failures don't create sessions, events that pass those checks but fail processing will accumulate sessions without cleanup, since there is no maximum retry limit—NumRequeuesis logged but never checked. Over long-running instances, this can lead to unbounded session accumulation, each holding a reference to a trace span.Consider one of these approaches:
- Delete the session when the event is dropped after exceeding a retry threshold.
- Add a TTL-based cleanup goroutine for stale sessions.
- Implement a maximum retry limit and delete the session when that limit is reached.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/queue/worker.go` around lines 105 - 106, The DrainSession stored via m.sessions.LoadOrStore(nodeEvent.EventID, &DrainSession{}) can accumulate for permanently failing events; update the failure/requeue handling to enforce a max retry threshold (inspect DrainSession.NumRequeues) and when that limit is exceeded call m.sessions.Delete(nodeEvent.EventID) and drop the event (or mark it as failed), so sessions are removed instead of kept forever; alternatively, add a TTL-based cleaner goroutine that scans m.sessions for stale DrainSession entries and deletes ones older than the TTL—ensure references to DrainSession, NumRequeues, and m.sessions.Delete(nodeEvent.EventID) are used to locate and implement the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@node-drainer/pkg/queue/worker.go`:
- Around line 97-103: When ParseHealthEventFromEvent fails its error is ignored
causing missing trace info to be hard to debug; add a debug-level log when the
error != nil inside the conditional so failures are recorded for
troubleshooting. Specifically, after calling
eventutil.ParseHealthEventFromEvent(event) check the err branch and emit a debug
message (including the event and err) before continuing, leaving the success
path that sets traceID via tracing.TraceIDFromMetadata and parentSpanID via
tracing.ParentSpanID unchanged.
- Around line 105-106: The DrainSession stored via
m.sessions.LoadOrStore(nodeEvent.EventID, &DrainSession{}) can accumulate for
permanently failing events; update the failure/requeue handling to enforce a max
retry threshold (inspect DrainSession.NumRequeues) and when that limit is
exceeded call m.sessions.Delete(nodeEvent.EventID) and drop the event (or mark
it as failed), so sessions are removed instead of kept forever; alternatively,
add a TTL-based cleaner goroutine that scans m.sessions for stale DrainSession
entries and deletes ones older than the TTL—ensure references to DrainSession,
NumRequeues, and m.sessions.Delete(nodeEvent.EventID) are used to locate and
implement the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1d09a282-db68-4477-a8df-902b3e3e8eb5
📒 Files selected for processing (2)
node-drainer/pkg/queue/queue.gonode-drainer/pkg/queue/worker.go
🚧 Files skipped from review as they are similar to previous changes (1)
- node-drainer/pkg/queue/queue.go
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
Signed-off-by: Tanisha goyal <tanishag@nvidia.com>
Signed-off-by: Tanisha goyal <tanishag@nvidia.com>
c5921c1 to
f952364
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
node-drainer/pkg/reconciler/reconciler.go (1)
150-156: Use the same parent-span source as the worker.This fallback makes
node_drainer.enqueue_eventresolve a different upstream span thannode_drainer.drain_sessionwhen metadata is malformed. The worker path and the recorded invariant both expect the fault-quarantine span ID to be the only parent source here.♻️ Suggested change
traceID := tracing.TraceIDFromMetadata(healthEventWithStatus.HealthEvent.GetMetadata()) parentSpanID := tracing.ParentSpanID( healthEventWithStatus.HealthEventStatus.SpanIds, tracing.ServiceFaultQuarantine) - -if parentSpanID == "" { - parentSpanID = tracing.ParentSpanID(healthEventWithStatus.HealthEventStatus.SpanIds, tracing.ServicePlatformConnector) -}Based on learnings, "In
node-drainer/pkg/queue/worker.go, events processed by the node-drainer are always expected to carry atracing.ServiceFaultQuarantinespan ID. No fallback totracing.ServicePlatformConnectoris needed when resolving the parent span ID for creating thenode_drainer.drain_sessionspan."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler.go` around lines 150 - 156, The parent-span fallback to tracing.ServicePlatformConnector should be removed so the parentSpanID resolution always uses tracing.ServiceFaultQuarantine; update the block that sets parentSpanID (currently calling tracing.ParentSpanID with healthEventWithStatus.HealthEventStatus.SpanIds) to only call tracing.ParentSpanID(..., tracing.ServiceFaultQuarantine) and drop the conditional that substitutes tracing.ServicePlatformConnector—ensure traceID is still obtained via tracing.TraceIDFromMetadata and downstream use in node_drainer.drain_session / node_drainer.enqueue_event continues to read the single fault-quarantine parent source.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@node-drainer/main.go`:
- Around line 74-77: InitTracing is being called in run() but ShutdownTracing is
only invoked later inside shutdownComponents() which can receive an
already-canceled run context (and run() can return early), so ensure tracing is
always shut down with a live context: immediately after successfully calling
tracing.InitTracing(tracing.ServiceNodeDrainer) register a deferred shutdown
that calls tracing.ShutdownTracing with a fresh non-canceled context (e.g.,
context.Background() or a short context.WithTimeout) so shutdown always runs
even on early returns; similarly apply the same pattern where InitTracing is
used around the code paths referenced at the 372-377 area so ShutdownTracing is
never passed the canceled run context.
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 1164-1173: The tracing attribute node_drainer.error.type should be
a stable, low-cardinality value instead of err.Error(); change the
span.SetAttributes call in the error path around
r.Config.StateManager.UpdateNVSentinelStateNodeLabel(...) so that
attribute.String("node_drainer.error.type", ...) uses a fixed category (e.g.
"state_manager.update_failed" or "update_nv_sentinel_state_node_label_failed")
and keep the full error text in attribute.String("node_drainer.error.message",
err.Error()); leave tracing.RecordError(span, err) as-is and ensure the
slog.ErrorContext log still includes the err for debugging.
---
Nitpick comments:
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 150-156: The parent-span fallback to
tracing.ServicePlatformConnector should be removed so the parentSpanID
resolution always uses tracing.ServiceFaultQuarantine; update the block that
sets parentSpanID (currently calling tracing.ParentSpanID with
healthEventWithStatus.HealthEventStatus.SpanIds) to only call
tracing.ParentSpanID(..., tracing.ServiceFaultQuarantine) and drop the
conditional that substitutes tracing.ServicePlatformConnector—ensure traceID is
still obtained via tracing.TraceIDFromMetadata and downstream use in
node_drainer.drain_session / node_drainer.enqueue_event continues to read the
single fault-quarantine parent source.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5315abeb-9473-4b7b-8406-942a1e30ebc8
📒 Files selected for processing (12)
commons/pkg/tracing/tracing.godistros/kubernetes/nvsentinel/charts/node-drainer/templates/deployment.yamlnode-drainer/go.modnode-drainer/main.gonode-drainer/pkg/evaluator/evaluator.gonode-drainer/pkg/informers/informers.gonode-drainer/pkg/initializer/init.gonode-drainer/pkg/queue/queue.gonode-drainer/pkg/queue/types.gonode-drainer/pkg/queue/worker.gonode-drainer/pkg/reconciler/reconciler.gonode-drainer/pkg/reconciler/reconciler_integration_test.go
✅ Files skipped from review due to trivial changes (3)
- commons/pkg/tracing/tracing.go
- node-drainer/pkg/initializer/init.go
- node-drainer/go.mod
🚧 Files skipped from review as they are similar to previous changes (3)
- node-drainer/pkg/queue/types.go
- node-drainer/pkg/informers/informers.go
- node-drainer/pkg/queue/queue.go
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
Signed-off-by: Tanisha goyal <tanishag@nvidia.com>
898a308 to
d9b72b1
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
node-drainer/pkg/reconciler/reconciler.go (1)
329-337: Use typed sentinel errors instead of classifying errors by message text.The
isRequeueSignalfunction determines retry behavior based on substring matching ("requeu") and prefix matching ("waiting for") in error messages. This creates a fragile contract: any change to the wording of these error messages will silently break requeue behavior without compilation failures. Instead, define typed sentinel errors and useerrors.Isorerrors.Asfor classification.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler.go` around lines 329 - 337, The current isRequeueSignal(err error) implementation relies on fragile string matching; replace that with typed sentinel errors (e.g., declare package-level vars like ErrRequeue and ErrWaitingFor) and update callers to return or wrap those sentinel errors using fmt.Errorf(...%w...) so callers can be identified; then change isRequeueSignal to return errors.Is(err, ErrRequeue) || errors.Is(err, ErrWaitingFor). Ensure you add the sentinel error declarations and update any places that currently construct `"requeu..."` or `"waiting for..."` error messages to return or wrap the new sentinel errors instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 329-337: The current isRequeueSignal(err error) implementation
relies on fragile string matching; replace that with typed sentinel errors
(e.g., declare package-level vars like ErrRequeue and ErrWaitingFor) and update
callers to return or wrap those sentinel errors using fmt.Errorf(...%w...) so
callers can be identified; then change isRequeueSignal to return errors.Is(err,
ErrRequeue) || errors.Is(err, ErrWaitingFor). Ensure you add the sentinel error
declarations and update any places that currently construct `"requeu..."` or
`"waiting for..."` error messages to return or wrap the new sentinel errors
instead.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1cace18f-3e11-4dbe-b2d9-e3a092de1278
📒 Files selected for processing (2)
node-drainer/main.gonode-drainer/pkg/reconciler/reconciler.go
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
node-drainer/pkg/reconciler/reconciler.go (1)
1185-1193:⚠️ Potential issue | 🟡 MinorRecord the early document-ID failure on the custom-drain span.
node_drainer.execute_custom_drainis already open here, but theExtractDocumentIDfailure path returns withouttracing.RecordError(...)or error attributes. In traces, this looks like a clean span for a failed operation.💡 Proposed fix
eventID, err := utils.ExtractDocumentID(event) if err != nil { + tracing.RecordError(span, err) + span.SetAttributes( + attribute.String("node_drainer.error.type", "extract_document_id_error"), + attribute.String("node_drainer.error.message", err.Error()), + ) + return fmt.Errorf("failed to extract document ID for custom drain: %w", err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node-drainer/pkg/reconciler/reconciler.go` around lines 1185 - 1193, When utils.ExtractDocumentID(event) returns an error inside the node_drainer.execute_custom_drain span, record that failure on the open span before returning: call tracing.RecordError(ctx, span, err) (and optionally span.SetAttributes with a short attribute like attribute.String("event.extract_error", err.Error())) so the trace shows the error; ensure this is done immediately after the ExtractDocumentID error and before the function returns using the existing span variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 505-509: The phase-marker calls (e.g., the call to
r.endPhaseMarkers using traceID := tracing.TraceIDFromMetadata(...) and
parentSpanID := tracing.ParentSpanID(healthEvent.HealthEventStatus.SpanIds,
tracing.ServiceFaultQuarantine)) must preserve the same upstream fallback used
on enqueue: if ParentSpanID with tracing.ServiceFaultQuarantine returns an
empty/zero value, call ParentSpanID again with tracing.ServicePlatformConnector
and use that result as parentSpanID; update the parentSpanID selection logic
before calling r.endPhaseMarkers and apply the identical fallback change to the
other occurrences that build parentSpanID (the other blocks that currently pass
tracing.ServiceFaultQuarantine).
- Around line 288-297: The error returned from the EnqueueEventGeneric call is
returned raw (return err) so callers lose that this failure occurred during
enqueue; wrap the error with local context using fmt.Errorf (for example
"enqueue event failed for node %s document %s: %w") before returning to preserve
where it happened. Update the block around r.queueManager.EnqueueEventGeneric /
tracing.RecordError / span.SetAttributes to return a wrapped error (including
nodeName and documentID) instead of the raw err.
---
Outside diff comments:
In `@node-drainer/pkg/reconciler/reconciler.go`:
- Around line 1185-1193: When utils.ExtractDocumentID(event) returns an error
inside the node_drainer.execute_custom_drain span, record that failure on the
open span before returning: call tracing.RecordError(ctx, span, err) (and
optionally span.SetAttributes with a short attribute like
attribute.String("event.extract_error", err.Error())) so the trace shows the
error; ensure this is done immediately after the ExtractDocumentID error and
before the function returns using the existing span variable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 228c0d62-9e8b-407b-bfe0-bfcbf115f4a1
📒 Files selected for processing (2)
node-drainer/main.gonode-drainer/pkg/reconciler/reconciler.go
🚧 Files skipped from review as they are similar to previous changes (1)
- node-drainer/main.go
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
|
|
||
| span.SetAttributes( | ||
| attribute.String("node_drainer.custom_cr.name", crName), | ||
| attribute.String("node_drainer.custom_cr.status", "in_progress"), |
There was a problem hiding this comment.
do we need the status? won't that be evident from the trace not completing?
| defer span.End() | ||
|
|
||
| span.SetAttributes( | ||
| attribute.String("node_drainer.user_pods_eviction_status.status", string(model.StatusInProgress)), |
There was a problem hiding this comment.
Same question as earlier, aren't the status evident from the span itself?
| attribute.String("node_drainer.partial_drain.entity_type", action.PartialDrainEntity.EntityType), | ||
| attribute.String("node_drainer.partial_drain.entity_value", action.PartialDrainEntity.EntityValue), |
There was a problem hiding this comment.
aren't these already in the health event, do we want these in the trace too?
| _, startSpan := tracing.StartSpanWithLinkFromTraceContext( | ||
| ctx, traceID, parentSpanID, "node_drainer.phase.immediate_eviction.start") | ||
| startSpan.End() |
There was a problem hiding this comment.
I'm a little confused by this, shouldn't the span continue till the immediate_eviction is completed? why do we want short-lived spans like this?
There was a problem hiding this comment.
I added this intentionally to avoid noise of spans in trace. We check the pods in each mode in every reconciliation which runs at 2min of interval and pods can run for days. So, to avoid adding new spans for every reconcile I am adding short lived span.
We are adding mode.start as soon as we enter in that mode of eviction for the first time and mode.end when we are moving to next eviction mode (immediate mode to deleteaftertimeout mode)
There was a problem hiding this comment.
Can't we just keep start one span and leave it open till completed?
We do need to trace every get api call made during that duration as well irrespective of how ever long the reconcile cycle lasts
There was a problem hiding this comment.
We can leave it open but if node-drainer crashed then we won't get old created spans as long running span never finish during crash.
There was a problem hiding this comment.
can't we save the id to mongodb like we do with the parent ones?
| slog.InfoContext(ctx, "Deleted custom drain CR", | ||
| "node", nodeName, | ||
| "crName", crName) | ||
| span.SetAttributes(attribute.Bool("node_drainer.custom_cr.deleted", true)) |
There was a problem hiding this comment.
do we need this attribute, isn't it understood due to the execution flow?

Summary
Type of Change
Component(s) Affected
Testing
Checklist
Testing
Tested traces integration changes in tilt and dev cluster. We have tested various negative and positive cases and output of exported traces are mentioned below
The span is created for status update along with final status that has been updated in DB
node_drainer.phase.allow_completion.startspan both at the gap of 5min.Summary by CodeRabbit
New Features
Bug Fixes
Chores