feat: Add ability to have custom remediation actions#1145
feat: Add ability to have custom remediation actions#1145XRFXLP wants to merge 11 commits intoNVIDIA:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds explicit support for custom remediation actions: new Changes
Sequence DiagramsequenceDiagram
participant Client as Client
participant PlatformConn as Platform Connector
participant EventExporter as Event Exporter
participant Remediation as Fault Remediation
participant EqGroup as Equivalence Groups
participant Config as Remediation Config
Client->>PlatformConn: Send HealthEvent (RecommendedAction=CUSTOM, customRecommendedAction="my-action")
PlatformConn->>PlatformConn: Validate customRecommendedAction present
alt customRecommendedAction empty
PlatformConn->>Client: gRPC InvalidArgument error
else valid
PlatformConn->>EventExporter: Forward / Pipeline processing
EventExporter->>Remediation: Emit/Enqueue CloudEvent (includes customRecommendedAction)
Remediation->>Remediation: actionName = GetEffectiveActionName(HealthEvent)
Remediation->>EqGroup: GetGroupConfigForEvent(actionName)
EqGroup->>Config: Lookup actionName in remediationActions
Config->>EqGroup: Return action config/template or nil
EqGroup->>Remediation: Return equivalence group / impacted scope
Remediation->>Client: Create remediation CR / update node annotation
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml (1)
124-125: Consider adding CRD-level validation for CUSTOM action payloads.Right now the schema allows
recommendedAction: CUSTOMwith an emptycustomRecommendedAction. Adding a CEL guard would prevent invalid CRs created outside gRPC ingress.🧩 Suggested schema rule
customRecommendedAction: type: string + x-kubernetes-validations: + - rule: "!(has(self.recommendedAction) && (self.recommendedAction == 27 || self.recommendedAction == 'CUSTOM')) || (has(self.customRecommendedAction) && self.customRecommendedAction != '')" + message: "customRecommendedAction must be non-empty when recommendedAction is CUSTOM"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml` around lines 124 - 125, Add a CRD-level CEL validation so that when the field recommendedAction equals "CUSTOM" the field customRecommendedAction must be non-empty; update the CRD's openAPIV3Schema/validation.rules (or the validation section where CEL expressions live) to enforce the constraint by referencing the fields recommendedAction and customRecommendedAction and using a CEL guard equivalent to: if recommendedAction == "CUSTOM" then customRecommendedAction must not be empty. Ensure the rule is attached to the health event CRD schema so CRs created outside the gRPC ingress cannot omit customRecommendedAction when choosing CUSTOM.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@data-models/protobufs/health_event.proto`:
- Line 113: The protobuf field customRecommendedAction lacks a descriptive
comment; add a clear single-line comment immediately above the field declaration
for customRecommendedAction describing its purpose, expected contents (e.g.,
free-text recommendation or structured action ID), examples or format
constraints, and any length/encoding limits so generated schemas and API docs
are meaningful; ensure the comment uses protobuf comment style (//) and update
any related documentation/generation steps if applicable.
In `@platform-connectors/pkg/server/platform_connector_server.go`:
- Around line 74-78: The check that rejects empty customRecommendedAction should
also reject whitespace-only values; update the conditional in the handler that
inspects event.RecommendedAction and event.CustomRecommendedAction to use
strings.TrimSpace(event.CustomRecommendedAction) (or equivalent) and treat a
trimmed empty string as invalid, returning the same InvalidArgument status (keep
context: event.NodeName, event.Agent) when RecommendedAction ==
pb.RecommendedAction_CUSTOM and the trimmed customRecommendedAction is empty.
---
Nitpick comments:
In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`:
- Around line 124-125: Add a CRD-level CEL validation so that when the field
recommendedAction equals "CUSTOM" the field customRecommendedAction must be
non-empty; update the CRD's openAPIV3Schema/validation.rules (or the validation
section where CEL expressions live) to enforce the constraint by referencing the
fields recommendedAction and customRecommendedAction and using a CEL guard
equivalent to: if recommendedAction == "CUSTOM" then customRecommendedAction
must not be empty. Ensure the rule is attached to the health event CRD schema so
CRs created outside the gRPC ingress cannot omit customRecommendedAction when
choosing CUSTOM.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a32f72ad-9094-4eb0-9ab4-dd23ca1046d0
⛔ Files ignored due to path filters (2)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.godata-models/pkg/protos/health_event_grpc.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (8)
data-models/pkg/model/health_event_extentions.godata-models/protobufs/health_event.protodistros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yamlevent-exporter/pkg/transformer/cloudevents.gofault-remediation/pkg/common/equivalence_groups.gofault-remediation/pkg/config/config.gofault-remediation/pkg/remediation/remediation.goplatform-connectors/pkg/server/platform_connector_server.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
fault-remediation/pkg/config/config_test.go (1)
256-276: Scenario intent and fixture are slightly misaligned.The case name says “Built-in actions other than COMPONENT_RESET…”, but only
RESTART_BMis configured withImpactedEntityScope. This mostly duplicates the previous case and does not directly validateRESTART_VMrejection.♻️ Suggested test fixture adjustment
{ name: "Built-in actions other than COMPONENT_RESET cannot have an ImpactedEntityScope", config: TomlConfig{ Template: Template{MountPath: tempDir}, RemediationActions: map[string]MaintenanceResource{ "RESTART_VM": { TemplateFileName: "template-a.yaml", Scope: "Cluster", EquivalenceGroup: "restart", + ImpactedEntityScope: "GPU_UUID", }, "RESTART_BM": { TemplateFileName: "template-b.yaml", Scope: "Namespaced", Namespace: "test-namespace", EquivalenceGroup: "reset", SupersedingEquivalenceGroups: []string{"restart"}, - ImpactedEntityScope: "GPU_UUID", }, }, }, expectError: true, - errorSubstr: "built-in action 'RESTART_BM' cannot have an ImpactedEntityScope", + errorSubstr: "built-in action 'RESTART_VM' cannot have an ImpactedEntityScope", },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-remediation/pkg/config/config_test.go` around lines 256 - 276, The test case name and fixture are misaligned: to validate that built-in actions other than COMPONENT_RESET cannot have an ImpactedEntityScope, add an ImpactedEntityScope to the "RESTART_VM" entry in the TomlConfig.RemediationActions (or alternatively rename the case to reflect that only "RESTART_BM" is tested); update the MaintenanceResource for "RESTART_VM" to include ImpactedEntityScope (e.g., "GPU_UUID") so the test actually asserts rejection for a non-COMPONENT_RESET built-in action (keep expectError and errorSubstr as-is to verify the expected failure).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@fault-remediation/pkg/config/config_test.go`:
- Around line 256-276: The test case name and fixture are misaligned: to
validate that built-in actions other than COMPONENT_RESET cannot have an
ImpactedEntityScope, add an ImpactedEntityScope to the "RESTART_VM" entry in the
TomlConfig.RemediationActions (or alternatively rename the case to reflect that
only "RESTART_BM" is tested); update the MaintenanceResource for "RESTART_VM" to
include ImpactedEntityScope (e.g., "GPU_UUID") so the test actually asserts
rejection for a non-COMPONENT_RESET built-in action (keep expectError and
errorSubstr as-is to verify the expected failure).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 20eb3a76-03e1-4371-b2d1-54bda89bac6e
⛔ Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (2)
fault-quarantine/pkg/evaluator/rule_evaluator_test.gofault-remediation/pkg/config/config_test.go
✅ Files skipped from review due to trivial changes (1)
- fault-quarantine/pkg/evaluator/rule_evaluator_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/configuration/fault-remediation.md`:
- Line 106: The doc incorrectly states that only COMPONENT_RESET and custom
actions support impactedEntityScope; instead reword to say impactedEntityScope
is honored whenever an action's config includes the impactedEntityScope field
(the implementation in equivalence_groups.go applies it universally), so remove
the action-type restriction and update the example text (mentions of
COMPONENT_RESET and RESTART_BM can remain as examples but not as exclusive
support). Reference the impactedEntityScope field and the equivalence-group
construction logic in equivalence_groups.go when editing the sentence.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 188225ed-3ac5-4381-90d0-2fec2ff7657a
📒 Files selected for processing (1)
docs/configuration/fault-remediation.md
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)
141-1528:⚠️ Potential issue | 🔴 CriticalFix custom recommended action handling and add test coverage.
The implementation has a bug:
constructHealthEventMessage()only callsGetRecommendedAction(), which returns the default enum valueNONEwhen the protobuf'soneof actionfield is set to thecustomRecommendedActionstring branch. Custom action strings are silently lost. Update the message formatting to checkGetCustomRecommendedAction()first, falling back toGetRecommendedAction()only if the custom value is empty.Add a test case with
HealthEvent_CustomRecommendedAction{CustomRecommendedAction: "CUSTOM_VALUE"}to verify the custom string appears correctly in the node condition message.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go` around lines 141 - 1528, constructHealthEventMessage() currently only uses GetRecommendedAction(), which drops protobuf oneof custom string values; change it to prefer GetCustomRecommendedAction() (use it when non-empty) and fall back to GetRecommendedAction() only if the custom string is empty so customRecommendedAction is preserved; update message formatting logic that appends "Recommended Action=" to use the chosen value, and add a unit test that constructs a HealthEvent with HealthEvent_CustomRecommendedAction{CustomRecommendedAction: "CUSTOM_VALUE"} to assert the resulting node condition message contains "Recommended Action=CUSTOM_VALUE".fault-remediation/pkg/reconciler/reconciler_test.go (1)
297-360:⚠️ Potential issue | 🟡 MinorThis test still bypasses the reconciler logic.
Both the existing cases and the new custom-action case call the mock
CreateMaintenanceResourcedirectly, so they don't verify any of the action handling inhandleRemediationEvent/performRemediation. That gives false confidence for the new feature path. Please invoke the reconciler method under test and assert on the mock inputs from there.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-remediation/pkg/reconciler/reconciler_test.go` around lines 297 - 360, The test is calling the mock CreateMaintenanceResource directly instead of exercising the reconciler flow; change the test to invoke the reconciler method under test (e.g., call r.handleRemediationEvent or r.performRemediation with the constructed healthEventData and groupConfig on the reconciler returned by NewFaultRemediationReconciler) so the reconciler code executes its action handling, and keep the MockK8sClient.createMaintenanceResourceFn assertions to validate the inputs and return values; then assert the reconciler result (error or CR name) matches tt.expectedError / expected CR name instead of calling CreateMaintenanceResource directly.health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)
335-356:⚠️ Potential issue | 🟠 MajorPreserve custom remediation names instead of downgrading them to
NONE.The
HealthEventprotobuf message defines acustomRecommendedActionfield (via oneof) to support custom actions, but the Go code only uses theRecommendedActionenum path and silently converts all unknown action strings toNONE. This loses custom action names before they reach the platform connector.Update the action assignment to check the enum first, then fall back to the custom action field for unknown non-empty strings:
Proposed fix
- actionEnum, ok := pb.RecommendedAction_value[event.RecommendedAction] - if !ok { - slog.Warn( - "Unknown recommended action; defaulting to NONE.", - "recommendedAction", - event.RecommendedAction, - "eventID", - event.EventID, - ) - - actionEnum = int32(pb.RecommendedAction_NONE) - } - healthEvent := &pb.HealthEvent{ Agent: "csp-health-monitor", // Consistent agent name ComponentClass: event.ResourceType, // e.g., "EC2", "gce_instance" CheckName: "CSPMaintenance", // Consistent check name IsFatal: isFatal, IsHealthy: isHealthy, ProcessingStrategy: e.processingStrategy, Message: message, - Action: &pb.HealthEvent_RecommendedAction{RecommendedAction: pb.RecommendedAction(actionEnum)}, EntitiesImpacted: []*pb.Entity{ { EntityType: event.ResourceType, EntityValue: event.ResourceID, // CSP's ID (e.g., instance-id, full gcp resource name) }, @@ GeneratedTimestamp: timestamppb.New(time.Now()), } + + if actionEnum, ok := pb.RecommendedAction_value[event.RecommendedAction]; ok { + healthEvent.Action = &pb.HealthEvent_RecommendedAction{ + RecommendedAction: pb.RecommendedAction(actionEnum), + } + } else if event.RecommendedAction != "" { + healthEvent.Action = &pb.HealthEvent_CustomRecommendedAction{ + CustomRecommendedAction: event.RecommendedAction, + } + } else { + healthEvent.Action = &pb.HealthEvent_RecommendedAction{ + RecommendedAction: pb.RecommendedAction_NONE, + } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go` around lines 335 - 356, The current code maps event.RecommendedAction only via pb.RecommendedAction_value and replaces unknowns with NONE, losing custom remediation names; change the Action assignment logic in the HealthEvent construction: first attempt to map event.RecommendedAction via pb.RecommendedAction_value (as you already do) and if found set Action = &pb.HealthEvent_RecommendedAction{RecommendedAction: pb.RecommendedAction(actionEnum)}; if not found and event.RecommendedAction is a non-empty string set Action = &pb.HealthEvent_CustomRecommendedAction{CustomRecommendedAction: event.RecommendedAction}; otherwise fall back to setting RecommendedAction to pb.RecommendedAction_NONE. Update references around pb.RecommendedAction_value, event.RecommendedAction, and the HealthEvent Action oneof to implement this fallback.event-exporter/pkg/transformer/cloudevents.go (1)
53-71:⚠️ Potential issue | 🟠 MajorDon't emit
recommendedAction: "NONE"for custom-action events.The
HealthEvent.actionfield is a protobuf oneof. When the active arm iscustomRecommendedAction, callingevent.GetRecommendedAction()returnsRecommendedAction_NONEbecause the type assertion fails. This causes custom-action events to be exported with both"recommendedAction": "NONE"and"customRecommendedAction": "actual_value", which mislabels the exported event for downstream consumers. Serialize the active oneof arm explicitly instead.Proposed fix
healthEventData := map[string]any{ "version": event.Version, "agent": event.Agent, "componentClass": event.ComponentClass, "checkName": event.CheckName, "isFatal": event.IsFatal, "isHealthy": event.IsHealthy, "message": event.Message, - "recommendedAction": event.GetRecommendedAction().String(), "errorCode": errorCodes, "entitiesImpacted": entities, "generatedTimestamp": timestamp, "nodeName": event.NodeName, "processingStrategy": event.ProcessingStrategy.String(), } - if event.GetCustomRecommendedAction() != "" { - healthEventData["customRecommendedAction"] = event.GetCustomRecommendedAction() + switch action := event.Action.(type) { + case *pb.HealthEvent_RecommendedAction: + healthEventData["recommendedAction"] = action.RecommendedAction.String() + case *pb.HealthEvent_CustomRecommendedAction: + healthEventData["customRecommendedAction"] = action.CustomRecommendedAction }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@event-exporter/pkg/transformer/cloudevents.go` around lines 53 - 71, The exported map healthEventData is always populated with event.GetRecommendedAction(), which yields RecommendedAction_NONE when the oneof active arm is customRecommendedAction; update the serialization to only set "recommendedAction" when the protobuf-arm for the enum is actually active (e.g., when event.GetCustomRecommendedAction() == "" or by switching on the oneof field), and keep the existing clause that sets "customRecommendedAction" from event.GetCustomRecommendedAction(); in short, only add healthEventData["recommendedAction"] = event.GetRecommendedAction().String() when the enum arm is the active one, otherwise omit it so custom-action events are not emitted with "NONE".distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml (1)
49-74:⚠️ Potential issue | 🟠 MajorEnforce the oneOf contract in the CRD schema.
The proto definition (data-models/protobufs/health_event.proto, lines 101-104) explicitly declares a
oneof actionthat enforces mutual exclusivity betweenrecommendedActionandcustomRecommendedAction. However, the generated CRD schema fails to translate this constraint into Kubernetes validation rules, allowing invalid CRs to be admitted with both fields set simultaneously or with an emptycustomRecommendedActionstring. This violates the protobuf contract and will cause downstream rejection or misinterpretation.Add
x-kubernetes-validationsto the proto source to ensure the CRD generator includes:
- Mutual exclusivity: only one of
recommendedActionorcustomRecommendedActionmay be set- Non-empty constraint:
customRecommendedActionmust be a non-empty string when presentSuggested validation shape
+ x-kubernetes-validations: + - rule: "!(has(self.recommendedAction) && has(self.customRecommendedAction))" + message: "only one of recommendedAction or customRecommendedAction may be set" + - rule: "!has(self.customRecommendedAction) || self.customRecommendedAction != ''" + message: "customRecommendedAction must be non-empty"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml` around lines 49 - 74, Add x-kubernetes-validations to the CRD schema for the oneof named "action" so the generator enforces mutual exclusivity and non-empty custom strings: add a validation rule using a CEL expression that enforces exactly one of recommendedAction or customRecommendedAction is present (e.g. has(self.recommendedAction) != has(self.customRecommendedAction)) and a second rule that, when customRecommendedAction is present, it is non-empty (e.g. !has(self.customRecommendedAction) || self.customRecommendedAction != ""), referencing the fields recommendedAction and customRecommendedAction so the generated CRD will reject objects that set both or an empty customRecommendedAction.platform-connectors/pkg/transformers/overrides/transformer.go (1)
42-47:⚠️ Potential issue | 🟠 MajorCustom actions are incorrectly flattened to
NONEin state capture and comparison.The
GetRecommendedAction()getter on line 46 returnsRecommendedAction_NONEas a default when theactiononeof is set tocustomRecommendedActioninstead ofrecommendedAction. This causes:
- Line 46: Original custom action is lost during state capture.
- Line 144: Change detection fails when comparing enum overrides against custom actions (both resolve to
NONE), potentially skipping valid override operations.- Line 165: Logging shows
NONEinstead of the actual custom action.The fix requires detecting which oneof branch is active and preserving the actual value:
Suggested fix
type originalState struct { isFatal bool isHealthy bool - recommendedAction pb.RecommendedAction + actionName string } +func getActionName(event *pb.HealthEvent) string { + switch a := event.Action.(type) { + case *pb.HealthEvent_RecommendedAction: + return a.RecommendedAction.String() + case *pb.HealthEvent_CustomRecommendedAction: + return a.CustomRecommendedAction + default: + return pb.RecommendedAction_NONE.String() + } +} + func captureOriginalState(event *pb.HealthEvent) originalState { return originalState{ - isFatal: event.IsFatal, - isHealthy: event.IsHealthy, - recommendedAction: event.GetRecommendedAction(), + isFatal: event.IsFatal, + isHealthy: event.IsHealthy, + actionName: getActionName(event), } } if rule.override.RecommendedAction != nil { newAction, err := rule.override.ParseRecommendedAction() @@ -142,10 +144,12 @@ - if newAction != original.recommendedAction { + currentRA, isRecommended := event.Action.(*pb.HealthEvent_RecommendedAction) + if !isRecommended || currentRA.RecommendedAction != newAction { event.Action = &pb.HealthEvent_RecommendedAction{RecommendedAction: newAction} slog.InfoContext(ctx, "Applied health event override", - "original_action", original.recommendedAction.String(), - "new_action", event.GetRecommendedAction().String()) + "original_action", original.actionName, + "new_action", getActionName(event)) }Also applies to: 144-146, 164-165
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform-connectors/pkg/transformers/overrides/transformer.go` around lines 42 - 47, captureOriginalState currently uses the protobuf-generated GetRecommendedAction() which returns RecommendedAction_NONE when the oneof is set to customRecommendedAction, causing custom actions to be lost; update captureOriginalState to detect which oneof branch is active (use the oneof-specific getters such as GetCustomRecommendedAction() vs GetRecommendedAction() or the generated ActionCase/XXXCase accessor on pb.HealthEvent), store both the enum recommendedAction and the customRecommendedAction (or a flag indicating which branch is set) into originalState, and then update the change-detection logic (the comparison around the earlier-mentioned enum checks) to compare the enum when recommendedAction is set and compare the custom string when customRecommendedAction is set; also update the logging to print the preserved customRecommendedAction when present instead of showing NONE.
🧹 Nitpick comments (4)
fault-quarantine/pkg/evaluator/rule_evaluator.go (1)
300-300: Consider adding a defensive check before double-dereference.The
v.Elem().Elem()pattern relies on protobuf-go's convention that oneof variants are always pointer types. While this holds for standard protobuf-generated code, a kind check would make the code more robust against unexpected inputs.🛡️ Optional defensive check
func handleOneofField(v reflect.Value, result map[string]interface{}) { if v.IsNil() { return } - concrete := v.Elem().Elem() + elem := v.Elem() + if elem.Kind() != reflect.Ptr || elem.IsNil() { + return + } + + concrete := elem.Elem() concreteType := concrete.Type()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-quarantine/pkg/evaluator/rule_evaluator.go` at line 300, The double-dereference using v.Elem().Elem() is unsafe for unexpected inputs; before assigning concrete, add defensive checks that v is valid and a pointer and that v.Elem() is valid and also a pointer (e.g., verify v.IsValid(), v.Kind()==reflect.Ptr, v.Elem().IsValid(), v.Elem().Kind()==reflect.Ptr) and handle the failure path (return an error or skip processing) so that the code around the concrete variable (the v and concrete usage in rule_evaluator.go) does not panic on non-conforming values.tilt/simple-health-client/main.go (1)
53-54: Prefer structured debug logs withsloghere.Line 53 and Line 78 still use formatted
log.Printf; please switch these updated debug logs to structuredslog.Debugfields for consistency and easier filtering.♻️ Suggested change
- log.Printf("[DEBUG] Received health event - Node: %s, CheckName: %s, Agent: %s, IsFatal: %v, RecommendedAction: %v", - healthEvent.NodeName, healthEvent.CheckName, healthEvent.Agent, healthEvent.IsFatal, healthEvent.GetRecommendedAction()) + slog.Debug("Received health event", + "node", healthEvent.NodeName, + "check_name", healthEvent.CheckName, + "agent", healthEvent.Agent, + "is_fatal", healthEvent.IsFatal, + "recommended_action", healthEvent.GetRecommendedAction().String(), + ) @@ - log.Printf("[DEBUG] Sending health event to platform-connector - Node: %s, CheckName: %s, RecommendedAction: %v", - healthEvent.NodeName, healthEvent.CheckName, healthEvent.GetRecommendedAction()) + slog.Debug("Sending health event to platform-connector", + "node", healthEvent.NodeName, + "check_name", healthEvent.CheckName, + "recommended_action", healthEvent.GetRecommendedAction().String(), + )As per coding guidelines, "Use structured logging via
log/slogin Go code".Also applies to: 78-79
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tilt/simple-health-client/main.go` around lines 53 - 54, Replace the formatted log.Printf debug calls that print healthEvent fields with structured slog logging: call slog.Debug (or use slog.With(...).Debug) and emit key/value fields for Node (healthEvent.NodeName), CheckName (healthEvent.CheckName), Agent (healthEvent.Agent), IsFatal (healthEvent.IsFatal) and RecommendedAction (healthEvent.GetRecommendedAction()); do the same for the other formatted log.Printf occurrence around the health event handling (the one using healthEvent.GetRecommendedAction()), ensuring you use consistent field keys and structured slog calls instead of printf formatting.event-exporter/pkg/transformer/cloudevents_test.go (1)
30-259: Add a custom-action CloudEvent case.This table only covers the enum arm of
HealthEvent.Action. Please add a case withCustomRecommendedActionand assert that the payload includescustomRecommendedActionwithout backfillingrecommendedActionto"NONE"; that is the failure mode most likely to slip through here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@event-exporter/pkg/transformer/cloudevents_test.go` around lines 30 - 259, Add a new table test entry in the tests slice that exercises the non-enum arm of HealthEvent.Action by setting event.Action = &pb.HealthEvent_CustomRecommendedAction{CustomRecommendedAction: "do-thing"}; in its validateFunc, assert the resulting CloudEvent (CloudEvent ce returned by TransformHealthEvent or equivalent) contains healthEvent["customRecommendedAction"] == "do-thing" and that healthEvent does not incorrectly backfill healthEvent["recommendedAction"] to "NONE" (i.e., either recommendedAction is absent or not equal to "NONE"); reference the tests slice, pb.HealthEvent and the CloudEvent struct when adding this case and its validateFunc.fault-remediation/pkg/common/equivalence_groups_test.go (1)
51-101: Add a custom-action case to this table.
TestGetGroupConfigForEventnow exercises only the enum arm ofHealthEvent.Action. ACustomRecommendedActioncase here would lock the new lookup path directly and catch regressions before they surface in higher-level reconciler tests.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-remediation/pkg/common/equivalence_groups_test.go` around lines 51 - 101, Add a table case that exercises the CustomRecommendedAction arm: add a test entry (e.g., name "Custom recommended action") where healthEvent.Action is set to &protos.HealthEvent_CustomRecommendedAction{CustomRecommendedAction: "my_custom_action"}; set expectError appropriately (usually false) and set expectedGroupConfig to the expected result for a custom action (nil if no mapping exists) so TestGetGroupConfigForEvent exercises the custom-action lookup path in the GetGroupConfigForEvent logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@commons/pkg/tracing/span_attributes.go`:
- Line 78: Replace the direct enum stringization used in the span attribute
(attribute.String("health_event.recommended_action",
event.GetRecommendedAction().String())) with the helper that preserves custom
names: call model.GetEffectiveActionName(event) and pass its result into
attribute.String so both enum and customRecommendedAction values are preserved
in tracing; update the reference in the span_attributes code where
attribute.String is invoked for "health_event.recommended_action".
In `@health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go`:
- Line 92: The current code always constructs the enum oneof branch using
Action: &pb.HealthEvent_RecommendedAction{RecommendedAction:
mapRecommendedAction(...)} which coerces unknown/custom strings to
CONTACT_SUPPORT and drops custom names; change the logic so mapRecommendedAction
returns a clear signal for “unknown/custom” (e.g., an optional/nullable enum or
an OK flag) and then, when the input policy.HealthEvent.RecommendedAction is not
a known enum, populate the HealthEvent_CustomRecommendedAction oneof with the
original string instead of the enum branch; update all occurrences (the existing
use at the Action construction and the similar block around lines 146-151) to
choose between pb.HealthEvent_RecommendedAction (for known enums) and
pb.HealthEvent_CustomRecommendedAction (for custom names) while preserving the
original string for the custom branch.
In `@health-monitors/slurm-drain-monitor/pkg/publisher/publisher.go`:
- Line 118: The code currently funnels every recommended action through
mapRecommendedAction(r.RecommendedAction) which maps unknown strings to
CONTACT_SUPPORT and thus strips custom action names; change the publisher to
preserve custom actions by sending the original string when mapRecommendedAction
would return the generic CONTACT_SUPPORT (i.e., call
mapRecommendedAction(r.RecommendedAction) but if the result equals
pb.HealthEvent_RecommendedAction_CONTACT_SUPPORT and r.RecommendedAction is a
non-empty, non-standard value, set Action to a RecommendedAction containing the
original r.RecommendedAction string instead of the CONTACT_SUPPORT enum); update
the logic around mapRecommendedAction and the Action assignment to prefer the
explicit custom string while still using the mapped enum for known values.
In `@node-drainer/pkg/evaluator/evaluator.go`:
- Line 647: The partial-drain branch currently only checks
healthEvent.GetRecommendedAction(), which misses customRecommendedAction oneof
values; update the condition (near the check using e.config.PartialDrainEnabled
and healthEvent.GetRecommendedAction()) to also detect when a
customRecommendedAction semantically maps to COMPONENT_RESET by inspecting the
customRecommendedAction oneof (e.g., via
healthEvent.GetCustomRecommendedAction() or the generated RecommendedActionCase
accessor) and comparing its semantic value to
protos.RecommendedAction_COMPONENT_RESET (or the equivalent enum/field on the
custom payload); ensure both the standard enum and customRecommendedAction paths
trigger the partial-drain branch when they represent COMPONENT_RESET.
In `@platform-connectors/pkg/connectors/kubernetes/process_node_events.go`:
- Line 484: The code appends healthEvent.GetRecommendedAction().String(), which
yields "NONE" when the oneof is set to customRecommendedAction and thus drops
the custom text; update the logic around message assembly (where message +=
fmt.Sprintf("Recommended Action=%s;", ... ) is built) to first check
healthEvent.GetCustomRecommendedAction() and if non-empty use that string,
otherwise fall back to healthEvent.GetRecommendedAction().String(); locate the
usage by looking for GetRecommendedAction, GetCustomRecommendedAction, and the
line that appends "Recommended Action=" and change it to prefer the custom
string before the enum.
In `@platform-connectors/pkg/transformers/overrides/cel.go`:
- Line 124: The CEL event map is using event.GetRecommendedAction().String()
which returns the enum value only and misses custom remediation strings; change
the map entry for "recommendedAction" to compute the effective name first (e.g.,
var action := event.GetRecommendedAction().String(); if action == "" { action =
event.GetCustomRecommendedAction() } ) and then set "recommendedAction": action
so customRecommendedAction is used when present; reference
event.GetRecommendedAction(), event.GetCustomRecommendedAction(), and the
"recommendedAction" map entry when making the change.
In `@store-client/pkg/datastore/providers/postgresql/database_client.go`:
- Line 213: Replace the call to GetRecommendedAction().String() with the helper
GetEffectiveActionName() when building the health event (the field currently set
as recommendedAction: modelEvent.HealthEvent.GetRecommendedAction().String());
use modelEvent.HealthEvent.GetEffectiveActionName() instead so custom
remediation actions are preserved; also add the import for
"github.com/NVIDIA/NVSentinel/data-models/pkg/model" if missing. Apply the
identical change for the analogous occurrence in health_events.go (around the
code that sets recommendedAction at line ~104).
In `@store-client/pkg/datastore/providers/postgresql/health_events.go`:
- Line 104: The code sets fields.recommendedAction using
protoEvent.GetRecommendedAction().String(), which drops customRecommendedAction
values; replace that assignment to call model.GetEffectiveActionName(protoEvent)
(from data-models/pkg/model) so the helper picks the correct oneof value and
preserves custom remediation actions; apply the same replacement at the other
occurrence referenced (database_client.go:213) to ensure consistent indexing and
lookup for custom actions.
---
Outside diff comments:
In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`:
- Around line 49-74: Add x-kubernetes-validations to the CRD schema for the
oneof named "action" so the generator enforces mutual exclusivity and non-empty
custom strings: add a validation rule using a CEL expression that enforces
exactly one of recommendedAction or customRecommendedAction is present (e.g.
has(self.recommendedAction) != has(self.customRecommendedAction)) and a second
rule that, when customRecommendedAction is present, it is non-empty (e.g.
!has(self.customRecommendedAction) || self.customRecommendedAction != ""),
referencing the fields recommendedAction and customRecommendedAction so the
generated CRD will reject objects that set both or an empty
customRecommendedAction.
In `@event-exporter/pkg/transformer/cloudevents.go`:
- Around line 53-71: The exported map healthEventData is always populated with
event.GetRecommendedAction(), which yields RecommendedAction_NONE when the oneof
active arm is customRecommendedAction; update the serialization to only set
"recommendedAction" when the protobuf-arm for the enum is actually active (e.g.,
when event.GetCustomRecommendedAction() == "" or by switching on the oneof
field), and keep the existing clause that sets "customRecommendedAction" from
event.GetCustomRecommendedAction(); in short, only add
healthEventData["recommendedAction"] = event.GetRecommendedAction().String()
when the enum arm is the active one, otherwise omit it so custom-action events
are not emitted with "NONE".
In `@fault-remediation/pkg/reconciler/reconciler_test.go`:
- Around line 297-360: The test is calling the mock CreateMaintenanceResource
directly instead of exercising the reconciler flow; change the test to invoke
the reconciler method under test (e.g., call r.handleRemediationEvent or
r.performRemediation with the constructed healthEventData and groupConfig on the
reconciler returned by NewFaultRemediationReconciler) so the reconciler code
executes its action handling, and keep the
MockK8sClient.createMaintenanceResourceFn assertions to validate the inputs and
return values; then assert the reconciler result (error or CR name) matches
tt.expectedError / expected CR name instead of calling CreateMaintenanceResource
directly.
In `@health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go`:
- Around line 335-356: The current code maps event.RecommendedAction only via
pb.RecommendedAction_value and replaces unknowns with NONE, losing custom
remediation names; change the Action assignment logic in the HealthEvent
construction: first attempt to map event.RecommendedAction via
pb.RecommendedAction_value (as you already do) and if found set Action =
&pb.HealthEvent_RecommendedAction{RecommendedAction:
pb.RecommendedAction(actionEnum)}; if not found and event.RecommendedAction is a
non-empty string set Action =
&pb.HealthEvent_CustomRecommendedAction{CustomRecommendedAction:
event.RecommendedAction}; otherwise fall back to setting RecommendedAction to
pb.RecommendedAction_NONE. Update references around pb.RecommendedAction_value,
event.RecommendedAction, and the HealthEvent Action oneof to implement this
fallback.
In
`@platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go`:
- Around line 141-1528: constructHealthEventMessage() currently only uses
GetRecommendedAction(), which drops protobuf oneof custom string values; change
it to prefer GetCustomRecommendedAction() (use it when non-empty) and fall back
to GetRecommendedAction() only if the custom string is empty so
customRecommendedAction is preserved; update message formatting logic that
appends "Recommended Action=" to use the chosen value, and add a unit test that
constructs a HealthEvent with
HealthEvent_CustomRecommendedAction{CustomRecommendedAction: "CUSTOM_VALUE"} to
assert the resulting node condition message contains "Recommended
Action=CUSTOM_VALUE".
In `@platform-connectors/pkg/transformers/overrides/transformer.go`:
- Around line 42-47: captureOriginalState currently uses the protobuf-generated
GetRecommendedAction() which returns RecommendedAction_NONE when the oneof is
set to customRecommendedAction, causing custom actions to be lost; update
captureOriginalState to detect which oneof branch is active (use the
oneof-specific getters such as GetCustomRecommendedAction() vs
GetRecommendedAction() or the generated ActionCase/XXXCase accessor on
pb.HealthEvent), store both the enum recommendedAction and the
customRecommendedAction (or a flag indicating which branch is set) into
originalState, and then update the change-detection logic (the comparison around
the earlier-mentioned enum checks) to compare the enum when recommendedAction is
set and compare the custom string when customRecommendedAction is set; also
update the logging to print the preserved customRecommendedAction when present
instead of showing NONE.
---
Nitpick comments:
In `@event-exporter/pkg/transformer/cloudevents_test.go`:
- Around line 30-259: Add a new table test entry in the tests slice that
exercises the non-enum arm of HealthEvent.Action by setting event.Action =
&pb.HealthEvent_CustomRecommendedAction{CustomRecommendedAction: "do-thing"}; in
its validateFunc, assert the resulting CloudEvent (CloudEvent ce returned by
TransformHealthEvent or equivalent) contains
healthEvent["customRecommendedAction"] == "do-thing" and that healthEvent does
not incorrectly backfill healthEvent["recommendedAction"] to "NONE" (i.e.,
either recommendedAction is absent or not equal to "NONE"); reference the tests
slice, pb.HealthEvent and the CloudEvent struct when adding this case and its
validateFunc.
In `@fault-quarantine/pkg/evaluator/rule_evaluator.go`:
- Line 300: The double-dereference using v.Elem().Elem() is unsafe for
unexpected inputs; before assigning concrete, add defensive checks that v is
valid and a pointer and that v.Elem() is valid and also a pointer (e.g., verify
v.IsValid(), v.Kind()==reflect.Ptr, v.Elem().IsValid(),
v.Elem().Kind()==reflect.Ptr) and handle the failure path (return an error or
skip processing) so that the code around the concrete variable (the v and
concrete usage in rule_evaluator.go) does not panic on non-conforming values.
In `@fault-remediation/pkg/common/equivalence_groups_test.go`:
- Around line 51-101: Add a table case that exercises the
CustomRecommendedAction arm: add a test entry (e.g., name "Custom recommended
action") where healthEvent.Action is set to
&protos.HealthEvent_CustomRecommendedAction{CustomRecommendedAction:
"my_custom_action"}; set expectError appropriately (usually false) and set
expectedGroupConfig to the expected result for a custom action (nil if no
mapping exists) so TestGetGroupConfigForEvent exercises the custom-action lookup
path in the GetGroupConfigForEvent logic.
In `@tilt/simple-health-client/main.go`:
- Around line 53-54: Replace the formatted log.Printf debug calls that print
healthEvent fields with structured slog logging: call slog.Debug (or use
slog.With(...).Debug) and emit key/value fields for Node (healthEvent.NodeName),
CheckName (healthEvent.CheckName), Agent (healthEvent.Agent), IsFatal
(healthEvent.IsFatal) and RecommendedAction
(healthEvent.GetRecommendedAction()); do the same for the other formatted
log.Printf occurrence around the health event handling (the one using
healthEvent.GetRecommendedAction()), ensuring you use consistent field keys and
structured slog calls instead of printf formatting.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1afd73ad-5aa5-4c2c-a01b-6fb5193c0c04
⛔ Files ignored due to path filters (6)
api/gen/go/csp/v1alpha1/provider.pb.gois excluded by!**/*.pb.go,!**/gen/**api/gen/go/csp/v1alpha1/provider_grpc.pb.gois excluded by!**/*.pb.go,!**/gen/**api/gen/go/device/v1alpha1/gpu.pb.gois excluded by!**/*.pb.go,!**/gen/**api/gen/go/device/v1alpha1/gpu_grpc.pb.gois excluded by!**/*.pb.go,!**/gen/**data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.godata-models/pkg/protos/health_event_grpc.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (47)
commons/pkg/tracing/span_attributes.godata-models/pkg/model/health_event_extentions.godata-models/protobufs/health_event.protodistros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yamlevent-exporter/pkg/transformer/cloudevents.goevent-exporter/pkg/transformer/cloudevents_test.gofault-quarantine/pkg/evaluator/rule_evaluator.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gofault-remediation/pkg/common/equivalence_groups_test.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/remediation_test.gohealth-events-analyzer/pkg/publisher/publisher.gohealth-events-analyzer/pkg/reconciler/reconciler_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyihealth-monitors/kubernetes-object-monitor/pkg/publisher/publisher.gohealth-monitors/slurm-drain-monitor/pkg/publisher/publisher.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gonode-drainer/pkg/evaluator/evaluator.goplatform-connectors/pkg/connectors/kubernetes/k8s_connector_envtest_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.goplatform-connectors/pkg/server/platform_connector_server.goplatform-connectors/pkg/transformers/overrides/cel.goplatform-connectors/pkg/transformers/overrides/transformer.goplatform-connectors/pkg/transformers/overrides/transformer_test.gopreflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pypreflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyipreflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pypreflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyipreflight-checks/nccl-loopback/pkg/health/reporter.gostore-client/pkg/datastore/providers/postgresql/database_client.gostore-client/pkg/datastore/providers/postgresql/health_events.gotests/scale-tests/event-generator/main.gotilt/simple-health-client/main.go
✅ Files skipped from review due to trivial changes (2)
- fault-remediation/pkg/reconciler/reconciler.go
- preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi
🚧 Files skipped from review as they are similar to previous changes (10)
- fault-remediation/pkg/remediation/remediation.go
- data-models/pkg/model/health_event_extentions.go
- platform-connectors/pkg/server/platform_connector_server.go
- preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py
- preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi
- health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
- data-models/protobufs/health_event.proto
- health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
- preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py
- fault-quarantine/pkg/evaluator/rule_evaluator_test.go
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
Outdated
Show resolved
Hide resolved
platform-connectors/pkg/connectors/kubernetes/process_node_events.go
Outdated
Show resolved
Hide resolved
store-client/pkg/datastore/providers/postgresql/database_client.go
Outdated
Show resolved
Hide resolved
store-client/pkg/datastore/providers/postgresql/health_events.go
Outdated
Show resolved
Hide resolved
1ea0919 to
27ed466
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
platform-connectors/pkg/transformers/overrides/transformer.go (1)
43-48:⚠️ Potential issue | 🟠 MajorPreserve the original oneof action when deciding whether the override changed it.
GetRecommendedAction()returnsRecommendedAction_NONEwhencustomRecommendedActionis set, making it impossible to distinguish a custom action from an actualNONEvalue in the comparison at line 145. A rule attempting to override a custom action toNONEwill be treated as a no-op, and the logging on lines 165–166 will misreport the action. Store the original oneof branch state separately instead of capturing the flattened enum value.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform-connectors/pkg/transformers/overrides/transformer.go` around lines 43 - 48, The code flattens the oneof by storing only event.GetRecommendedAction(), which loses whether the oneof branch was custom and causes incorrect no-op detection and logging; update captureOriginalState and the originalState type to record the oneof branch (e.g., add a bool like hasCustomRecommendedAction or an enum originalActionOneof) by checking the oneof field (e.g., event.CustomRecommendedAction != nil or equivalent getter) in captureOriginalState, and then change the override comparison logic that currently uses GetRecommendedAction() (and the logging around that decision) to consider both the enum value and the stored oneof branch so a custom action vs explicit NONE is distinguished.
♻️ Duplicate comments (3)
store-client/pkg/datastore/providers/postgresql/database_client.go (1)
213-213:⚠️ Potential issue | 🔴 CriticalCustom remediation action is lost in index field extraction (Line 213).
Using
GetRecommendedAction().String()collapsescustomRecommendedActionto enum default (NONE) for oneof-backed events. This breaks custom-action preservation in the PostgreSQL indexed field.Suggested fix
- recommendedAction: modelEvent.HealthEvent.GetRecommendedAction().String(), + recommendedAction: model.GetEffectiveActionName(modelEvent.HealthEvent),#!/bin/bash set -euo pipefail # 1) Find enum-only extraction patterns that may drop custom oneof values rg -n 'recommendedAction:\s*.*GetRecommendedAction\(\)\.String\(\)' \ store-client/pkg/datastore/providers/postgresql/database_client.go \ store-client/pkg/datastore/providers/postgresql/health_events.go # 2) Verify effective-action helper exists in model package rg -n 'func\s+GetEffectiveActionName\s*\(' data-models/pkg/model --iglob '*.go'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@store-client/pkg/datastore/providers/postgresql/database_client.go` at line 213, The indexed field extraction uses GetRecommendedAction().String(), which loses oneof-backed customRecommendedAction values (collapsing to NONE); update the mapping for recommendedAction to use the model-level effective-action helper (e.g., call GetEffectiveActionName or similar) that returns the actual action name including customRecommendedAction, ensuring the code paths around the recommendedAction assignment in database_client.go and any similar spots (where GetRecommendedAction().String() is used) are replaced to preserve custom actions.health-monitors/slurm-drain-monitor/pkg/publisher/publisher.go (1)
119-119:⚠️ Potential issue | 🟠 MajorCustom remediation actions are still being dropped in unhealthy events.
Line 119 still funnels
r.RecommendedActionthroughmapRecommendedAction, and unknown values are downgraded toCONTACT_SUPPORT(Line 196). That loses custom action names and conflicts with the feature goal.💡 Proposed fix
@@ } else { for _, r := range reasons { - events = append(events, &pb.HealthEvent{ + event := &pb.HealthEvent{ Version: 1, Agent: agentName, CheckName: r.CheckName, ComponentClass: r.ComponentClass, GeneratedTimestamp: timestamppb.New(time.Now()), Message: r.Message, IsFatal: r.IsFatal, IsHealthy: false, NodeName: nodeName, - Action: model.BuiltinAction(mapRecommendedAction(r.RecommendedAction)), ProcessingStrategy: p.processingStrategy, EntitiesImpacted: entitiesImpacted, - }) + } + applyAction(event, r.RecommendedAction) + events = append(events, event) } } @@ -func mapRecommendedAction(action string) pb.RecommendedAction { +func applyAction(event *pb.HealthEvent, action string) { if action == "" { - return pb.RecommendedAction_CONTACT_SUPPORT + event.Action = model.BuiltinAction(pb.RecommendedAction_CONTACT_SUPPORT) + return } if value, exists := pb.RecommendedAction_value[action]; exists { - return pb.RecommendedAction(value) + event.Action = model.BuiltinAction(pb.RecommendedAction(value)) + return } - return pb.RecommendedAction_CONTACT_SUPPORT + event.Action = model.CustomAction(action) }Also applies to: 187-197
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/slurm-drain-monitor/pkg/publisher/publisher.go` at line 119, The code currently always maps r.RecommendedAction through mapRecommendedAction and falls back to CONTACT_SUPPORT, losing custom remediation names; change the assignment for Action so that if mapRecommendedAction(r.RecommendedAction) returns a known builtin value you use model.BuiltinAction(...) but if it does not, preserve the original string as a custom action (e.g. set Action to model.CustomAction(r.RecommendedAction) or whatever the existing CustomAction constructor/type is) instead of coercing to CONTACT_SUPPORT; update both the instance at Action: model.BuiltinAction(mapRecommendedAction(r.RecommendedAction)) and the similar logic in mapRecommendedAction’s fallback path (lines ~187-197) to return/preserve custom actions.platform-connectors/pkg/server/platform_connector_server.go (1)
17-27:⚠️ Potential issue | 🟠 MajorTrim whitespace before accepting
customRecommendedAction.
" "still passes this check and later behaves like an invalid action name. Reject trimmed-empty values here too.Proposed fix
import ( "context" "log/slog" + "strings" "github.com/golang/protobuf/ptypes/empty" @@ - if custom, ok := event.Action.(*pb.HealthEvent_CustomRecommendedAction); ok && custom.CustomRecommendedAction == "" { + if custom, ok := event.Action.(*pb.HealthEvent_CustomRecommendedAction); ok && + strings.TrimSpace(custom.CustomRecommendedAction) == "" { return nil, status.Errorf(codes.InvalidArgument, "recommendedAction is CUSTOM but customRecommendedAction is empty (node=%s, agent=%s)", event.NodeName, event.Agent) }Also applies to: 74-77
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform-connectors/pkg/server/platform_connector_server.go` around lines 17 - 27, The validation currently accepts values like " " for customRecommendedAction; fix by trimming whitespace before checking emptiness: call strings.TrimSpace on the customRecommendedAction input where it's validated (the same check around lines 74-77) and treat a trimmed-empty string as invalid, returning an appropriate gRPC InvalidArgument error (using status.Error(codes.InvalidArgument, ...)); ensure you import strings and update all places that validate customRecommendedAction so whitespace-only values are rejected.
🧹 Nitpick comments (1)
event-exporter/pkg/transformer/cloudevents_test.go (1)
39-131: Add a custom-action CloudEvent test case.This table covers built-in actions, but it doesn’t assert the new custom-action branch. Please add a case that sets
Actiontomodel.CustomAction(...)and verifiescustomRecommendedActionis emitted correctly inhealthEvent.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@event-exporter/pkg/transformer/cloudevents_test.go` around lines 39 - 131, Add a new table test case in cloudevents_test.go that covers the custom-action branch by creating a pb.HealthEvent with Action set via model.CustomAction("MY_CUSTOM_ACTION"); in that case update the test's validateFunc to assert that ce.Data["healthEvent"].(map[string]any)["customRecommendedAction"] == "MY_CUSTOM_ACTION" (and that the built-in recommendedAction field is not used), referencing the HealthEvent struct's Action field and the model.CustomAction helper so the custom-action branch in the CloudEvent transformer is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@platform-connectors/pkg/transformers/overrides/transformer.go`:
- Around line 43-48: The code flattens the oneof by storing only
event.GetRecommendedAction(), which loses whether the oneof branch was custom
and causes incorrect no-op detection and logging; update captureOriginalState
and the originalState type to record the oneof branch (e.g., add a bool like
hasCustomRecommendedAction or an enum originalActionOneof) by checking the oneof
field (e.g., event.CustomRecommendedAction != nil or equivalent getter) in
captureOriginalState, and then change the override comparison logic that
currently uses GetRecommendedAction() (and the logging around that decision) to
consider both the enum value and the stored oneof branch so a custom action vs
explicit NONE is distinguished.
---
Duplicate comments:
In `@health-monitors/slurm-drain-monitor/pkg/publisher/publisher.go`:
- Line 119: The code currently always maps r.RecommendedAction through
mapRecommendedAction and falls back to CONTACT_SUPPORT, losing custom
remediation names; change the assignment for Action so that if
mapRecommendedAction(r.RecommendedAction) returns a known builtin value you use
model.BuiltinAction(...) but if it does not, preserve the original string as a
custom action (e.g. set Action to model.CustomAction(r.RecommendedAction) or
whatever the existing CustomAction constructor/type is) instead of coercing to
CONTACT_SUPPORT; update both the instance at Action:
model.BuiltinAction(mapRecommendedAction(r.RecommendedAction)) and the similar
logic in mapRecommendedAction’s fallback path (lines ~187-197) to
return/preserve custom actions.
In `@platform-connectors/pkg/server/platform_connector_server.go`:
- Around line 17-27: The validation currently accepts values like " " for
customRecommendedAction; fix by trimming whitespace before checking emptiness:
call strings.TrimSpace on the customRecommendedAction input where it's validated
(the same check around lines 74-77) and treat a trimmed-empty string as invalid,
returning an appropriate gRPC InvalidArgument error (using
status.Error(codes.InvalidArgument, ...)); ensure you import strings and update
all places that validate customRecommendedAction so whitespace-only values are
rejected.
In `@store-client/pkg/datastore/providers/postgresql/database_client.go`:
- Line 213: The indexed field extraction uses GetRecommendedAction().String(),
which loses oneof-backed customRecommendedAction values (collapsing to NONE);
update the mapping for recommendedAction to use the model-level effective-action
helper (e.g., call GetEffectiveActionName or similar) that returns the actual
action name including customRecommendedAction, ensuring the code paths around
the recommendedAction assignment in database_client.go and any similar spots
(where GetRecommendedAction().String() is used) are replaced to preserve custom
actions.
---
Nitpick comments:
In `@event-exporter/pkg/transformer/cloudevents_test.go`:
- Around line 39-131: Add a new table test case in cloudevents_test.go that
covers the custom-action branch by creating a pb.HealthEvent with Action set via
model.CustomAction("MY_CUSTOM_ACTION"); in that case update the test's
validateFunc to assert that
ce.Data["healthEvent"].(map[string]any)["customRecommendedAction"] ==
"MY_CUSTOM_ACTION" (and that the built-in recommendedAction field is not used),
referencing the HealthEvent struct's Action field and the model.CustomAction
helper so the custom-action branch in the CloudEvent transformer is exercised.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8a3d30b0-036d-4f21-85f5-1463d9922ba0
⛔ Files ignored due to path filters (8)
api/gen/go/csp/v1alpha1/provider.pb.gois excluded by!**/*.pb.go,!**/gen/**api/gen/go/csp/v1alpha1/provider_grpc.pb.gois excluded by!**/*.pb.go,!**/gen/**api/gen/go/device/v1alpha1/gpu.pb.gois excluded by!**/*.pb.go,!**/gen/**api/gen/go/device/v1alpha1/gpu_grpc.pb.gois excluded by!**/*.pb.go,!**/gen/**data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.godata-models/pkg/protos/health_event_grpc.pb.gois excluded by!**/*.pb.gopreflight-checks/nccl-loopback/go.sumis excluded by!**/*.sumtests/scale-tests/event-generator/go.sumis excluded by!**/*.sum
📒 Files selected for processing (55)
commons/pkg/tracing/span_attributes.godata-models/pkg/model/health_event_extentions.godata-models/protobufs/health_event.protodistros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yamldocs/configuration/fault-remediation.mdevent-exporter/pkg/transformer/cloudevents.goevent-exporter/pkg/transformer/cloudevents_test.gofault-quarantine/pkg/evaluator/rule_evaluator.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gofault-remediation/pkg/common/equivalence_groups.gofault-remediation/pkg/common/equivalence_groups_test.gofault-remediation/pkg/config/config.gofault-remediation/pkg/config/config_test.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/remediation_test.gohealth-events-analyzer/pkg/publisher/publisher.gohealth-events-analyzer/pkg/reconciler/reconciler_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyihealth-monitors/kubernetes-object-monitor/pkg/publisher/publisher.gohealth-monitors/slurm-drain-monitor/pkg/publisher/publisher.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gonode-drainer/pkg/evaluator/evaluator.gonode-drainer/pkg/evaluator/evaluator_integration_test.gonode-drainer/pkg/reconciler/reconciler_integration_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_connector_envtest_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.goplatform-connectors/pkg/server/platform_connector_server.goplatform-connectors/pkg/transformers/overrides/cel.goplatform-connectors/pkg/transformers/overrides/transformer.goplatform-connectors/pkg/transformers/overrides/transformer_test.gopreflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pypreflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyipreflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pypreflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyipreflight-checks/nccl-loopback/go.modpreflight-checks/nccl-loopback/pkg/health/reporter.gostore-client/pkg/datastore/providers/postgresql/database_client.gostore-client/pkg/datastore/providers/postgresql/health_events.gotests/scale-tests/event-generator/go.modtests/scale-tests/event-generator/main.gotilt/simple-health-client/main.go
✅ Files skipped from review due to trivial changes (10)
- health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
- tilt/simple-health-client/main.go
- platform-connectors/pkg/connectors/kubernetes/process_node_events.go
- health-events-analyzer/pkg/reconciler/reconciler_test.go
- platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
- fault-remediation/pkg/config/config_test.go
- fault-remediation/pkg/reconciler/reconciler_e2e_test.go
- preflight-checks/nccl-loopback/go.mod
- fault-remediation/pkg/reconciler/reconciler.go
- preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi
🚧 Files skipped from review as they are similar to previous changes (29)
- health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
- platform-connectors/pkg/transformers/overrides/cel.go
- event-exporter/pkg/transformer/cloudevents.go
- commons/pkg/tracing/span_attributes.go
- health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
- store-client/pkg/datastore/providers/postgresql/health_events.go
- fault-remediation/pkg/common/equivalence_groups.go
- docs/configuration/fault-remediation.md
- health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
- health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
- data-models/protobufs/health_event.proto
- node-drainer/pkg/evaluator/evaluator.go
- platform-connectors/pkg/transformers/overrides/transformer_test.go
- preflight-checks/nccl-loopback/pkg/health/reporter.go
- health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
- distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml
- health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
- fault-remediation/pkg/common/equivalence_groups_test.go
- fault-remediation/pkg/remediation/remediation_test.go
- health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
- platform-connectors/pkg/connectors/kubernetes/k8s_connector_envtest_test.go
- health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
- preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi
- preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py
- health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
- data-models/pkg/model/health_event_extentions.go
- fault-remediation/pkg/reconciler/reconciler_test.go
- fault-remediation/pkg/config/config.go
- preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
22f82a6 to
1a9b93d
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
platform-connectors/pkg/server/platform_connector_server.go (1)
74-78:⚠️ Potential issue | 🟠 MajorReject whitespace-only
customRecommendedActionvalues.Line 74 only rejects
""; values like" "still pass validation.Proposed fix
import ( "context" "log/slog" + "strings" "github.com/golang/protobuf/ptypes/empty" @@ - if event.RecommendedAction == pb.RecommendedAction_CUSTOM && event.CustomRecommendedAction == "" { + if event.RecommendedAction == pb.RecommendedAction_CUSTOM && + strings.TrimSpace(event.CustomRecommendedAction) == "" { return nil, status.Errorf(codes.InvalidArgument, "recommendedAction is CUSTOM but customRecommendedAction is empty (node=%s, agent=%s)", event.NodeName, event.Agent) }#!/bin/bash # Verify whether CUSTOM validation still uses raw empty-string check (no TrimSpace). rg -n -C2 'RecommendedAction_CUSTOM' platform-connectors/pkg/server/platform_connector_server.go rg -n -C2 'CustomRecommendedAction == ""' platform-connectors/pkg/server/platform_connector_server.goExpected result: the second query matches Line 74 in current code, confirming whitespace-only values are not rejected.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform-connectors/pkg/server/platform_connector_server.go` around lines 74 - 78, The validation currently checks event.CustomRecommendedAction == "" when event.RecommendedAction == pb.RecommendedAction_CUSTOM, which allows whitespace-only strings; update the check to reject strings that are empty after trimming whitespace (use strings.TrimSpace on event.CustomRecommendedAction) and return the same InvalidArgument error if the trimmed result is empty; adjust imports to include the strings package if needed and keep the error message and referenced symbols (event.RecommendedAction, pb.RecommendedAction_CUSTOM, event.CustomRecommendedAction) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@platform-connectors/pkg/server/platform_connector_server.go`:
- Around line 74-78: The validation currently checks
event.CustomRecommendedAction == "" when event.RecommendedAction ==
pb.RecommendedAction_CUSTOM, which allows whitespace-only strings; update the
check to reject strings that are empty after trimming whitespace (use
strings.TrimSpace on event.CustomRecommendedAction) and return the same
InvalidArgument error if the trimmed result is empty; adjust imports to include
the strings package if needed and keep the error message and referenced symbols
(event.RecommendedAction, pb.RecommendedAction_CUSTOM,
event.CustomRecommendedAction) unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 86733330-a8a4-4374-aa1b-0b099e2e6f5b
⛔ Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (15)
data-models/pkg/model/health_event_extentions.godata-models/protobufs/health_event.protodistros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yamlevent-exporter/pkg/transformer/cloudevents.gofault-remediation/pkg/common/equivalence_groups.gofault-remediation/pkg/config/config.gofault-remediation/pkg/config/config_test.gofault-remediation/pkg/remediation/remediation.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyiplatform-connectors/pkg/server/platform_connector_server.gopreflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pypreflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyipreflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pypreflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi
✅ Files skipped from review due to trivial changes (1)
- health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
🚧 Files skipped from review as they are similar to previous changes (11)
- event-exporter/pkg/transformer/cloudevents.go
- fault-remediation/pkg/remediation/remediation.go
- fault-remediation/pkg/common/equivalence_groups.go
- distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml
- data-models/protobufs/health_event.proto
- fault-remediation/pkg/config/config_test.go
- preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py
- preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi
- preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi
- preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py
- data-models/pkg/model/health_event_extentions.go
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Merging this branch will increase overall coverage
Coverage by fileChanged unit test files
|
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
fault-remediation/pkg/reconciler/reconciler_test.go (1)
648-675: Exercise the real custom-action lookup path inshouldSkiptests.Line 657 injects
groupConfigdirectly, so this can pass even if CUSTOM action resolution/regression happens in config lookup. Consider derivinggroupConfigviacommon.GetGroupConfigForEvent(...)from a remediation-actions map keyed by"REPLACE_DISK".♻️ Suggested test adjustment
t.Run("Process custom action with group config", func(t *testing.T) { healthEvent := &protos.HealthEvent{ NodeName: "test-node-custom", RecommendedAction: protos.RecommendedAction_CUSTOM, CustomRecommendedAction: "REPLACE_DISK", } healthEventWithStatus := model.HealthEventWithStatus{ HealthEvent: healthEvent, } - groupConfig := getGroupConfig("disk-replace", nil) + remediationActions := map[string]config.MaintenanceResource{ + "REPLACE_DISK": {EquivalenceGroup: "disk-replace"}, + } + groupConfig, err := common.GetGroupConfigForEvent(remediationActions, healthEvent) + assert.NoError(t, err) + assert.NotNil(t, groupConfig) - result := r.shouldSkipEvent(t.Context(), healthEventWithStatus, groupConfig) + result := r.shouldSkipEvent(t.Context(), healthEventWithStatus, groupConfig) assert.False(t, result, "Custom actions with a matching group config should not be skipped") })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-remediation/pkg/reconciler/reconciler_test.go` around lines 648 - 675, The test is bypassing the CUSTOM action lookup by passing groupConfig directly to shouldSkipEvent; change the test to build a remediation-actions map with an entry keyed by "REPLACE_DISK" and call common.GetGroupConfigForEvent(...) to derive the groupConfig used by shouldSkipEvent so the real custom-action resolution path is exercised; specifically, replace the direct use of getGroupConfig("disk-replace", ...) with creating the map (e.g., remediationActions := map[string]RemediationAction{ "REPLACE_DISK": ... }) and call common.GetGroupConfigForEvent(event.HealthEvent, remediationActions) before invoking r.shouldSkipEvent(...) so the CUSTOM lookup logic in shouldSkipEvent is actually tested.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@fault-remediation/pkg/reconciler/reconciler_test.go`:
- Around line 648-675: The test is bypassing the CUSTOM action lookup by passing
groupConfig directly to shouldSkipEvent; change the test to build a
remediation-actions map with an entry keyed by "REPLACE_DISK" and call
common.GetGroupConfigForEvent(...) to derive the groupConfig used by
shouldSkipEvent so the real custom-action resolution path is exercised;
specifically, replace the direct use of getGroupConfig("disk-replace", ...) with
creating the map (e.g., remediationActions := map[string]RemediationAction{
"REPLACE_DISK": ... }) and call common.GetGroupConfigForEvent(event.HealthEvent,
remediationActions) before invoking r.shouldSkipEvent(...) so the CUSTOM lookup
logic in shouldSkipEvent is actually tested.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3c16239d-2a35-42e1-a041-d7d527f1d2bf
📒 Files selected for processing (2)
fault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_test.go
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
| if event.CustomRecommendedAction != "" { | ||
| healthEventData["customRecommendedAction"] = event.CustomRecommendedAction | ||
| } |
There was a problem hiding this comment.
can we add a test for this?
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
|
Can we add a e2e test in reconciler_e2e_test.go for FR as well? to test the e2e flow with the custom action? |
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@fault-remediation/pkg/reconciler/reconciler_e2e_test.go`:
- Around line 1327-1403: The test CustomAction_CreatesCorrectCR currently calls
performRemediation directly and should instead exercise the ingestion/reconcile
path: replace the direct performRemediation invocation with sending a
datastore-style event created by createCustomActionQuarantineEvent into
mockWatcher.EventsChan (same pattern as CompleteFlow_WithEventLoop), then
wait/assert eventually for the CR to exist (testDynamic.Resource(gvr).Get) and
the annotation/RemediationState via
reconcilerInstance.annotationManager.GetRemediationState; keep the same
assertions about CR spec/labels and equivalence group but use eventual
assertions/timeouts to allow the reconciler loop to process the event.
In `@fault-remediation/pkg/reconciler/templates/custom-action-template.yaml`:
- Around line 15-16: The template hardcodes the resource kind as "RebootNode"
and ignores the configured MaintenanceResource.Kind; update the
custom-action-template.yaml to render the kind using the template variable (use
{{.Kind}}) so the resource kind comes from the configuration alongside the
existing {{.ApiGroup}} and {{.Version}}; ensure any references that expect the
hardcoded name (if any) are updated to use the templated kind.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d9b807e0-13c6-427d-8b87-f81f2c707f38
📒 Files selected for processing (4)
event-exporter/pkg/transformer/cloudevents_test.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/templates/custom-action-template.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- fault-remediation/pkg/reconciler/reconciler.go
fault-remediation/pkg/reconciler/templates/custom-action-template.yaml
Outdated
Show resolved
Hide resolved
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Merging this branch will increase overall coverage
Coverage by fileChanged unit test files
|
added |
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
Summary
Design in ADR-036: Data Model — Custom Remediation Actions
PR for design: #1144
Testing
Test Environment
REPLACE_DISKcustom actionConfiguration
Fault-Remediation Helm Values (values-tilt.yaml)
Deployed ConfigMap Verification
Test 1: Custom Action E2E — Full Pipeline (PASS)
Request
Response
{"message":"Health event sent","status":"success"}Pipeline Verification
Event accepted without error. Stored in MongoDB with
recommendedAction: 27andcustomRecommendedAction: "REPLACE_DISK".CEL ruleset "GPU fatal error ruleset" matched (
agent == 'gpu-health-monitor' && componentClass == 'GPU' && isFatal == true). Node cordoned and quarantined. BothrecommendedActionandcustomRecommendedActionfields visible in the event:GetEffectiveActionNameresolved the action to"REPLACE_DISK", found it in config, and created the maintenance CR:Test 2: Empty Custom Action Rejected at gRPC Boundary (PASS)
Request
Response
Platform Connector correctly rejected the malformed event at the gRPC boundary with
InvalidArgument. The event was never stored or processed by downstream modules.Test 3: Custom Drain CR Created with CUSTOM Action (PASS)
Configuration
Custom drain enabled in node-drainer with
userNamespaces: [](mutually exclusive with custom drain):Request
Result
DrainRequestCR created by node-drainer:The node-drainer created the
DrainRequestCR with the correct node, pods to drain, andrecommendedAction: CUSTOM. The custom drain path does not depend on theRecommendedActionvalue — it is triggered purely bycustomDrain.enabledin config.Note: Partial Drain
Partial drain does not work with CUSTOM actions. The node-drainer's
shouldExecutePartialDrainis hardcoded toprotos.RecommendedAction_COMPONENT_RESETonly. A CUSTOM action will always trigger a full drain. Supporting partial drain for custom actions would require extending the evaluator with a configurable allowlist — this is a known limitation and out of scope for this change.Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Validation
Tests