Skip to content

feat: Add ability to have custom remediation actions#1145

Open
XRFXLP wants to merge 11 commits intoNVIDIA:mainfrom
XRFXLP:xrfxlp/1072/implementation
Open

feat: Add ability to have custom remediation actions#1145
XRFXLP wants to merge 11 commits intoNVIDIA:mainfrom
XRFXLP:xrfxlp/1072/implementation

Conversation

@XRFXLP
Copy link
Copy Markdown
Member

@XRFXLP XRFXLP commented Apr 10, 2026

Summary

Design in ADR-036: Data Model — Custom Remediation Actions

PR for design: #1144

Testing

Test Environment

  • Kind cluster with 3 control-plane nodes, 2 worker nodes, 50+ KWOK simulated nodes
  • NVSentinel deployed via Tilt (auto-refresh from source branch)
  • Fault-remediation configured with REPLACE_DISK custom action

Configuration

Fault-Remediation Helm Values (values-tilt.yaml)

fault-remediation:
  maintenance:
    actions:
      "REPLACE_DISK":
        apiGroup: "janitor.dgxc.nvidia.com"
        version: "v1alpha1"
        kind: "RebootNode"
        scope: "Cluster"
        completeConditionType: "NodeReady"
        templateFileName: "replace-disk-template.yaml"
        equivalenceGroup: "disk-replace"

    templates:
      "replace-disk-template.yaml": |
        apiVersion: {{ .ApiGroup }}/{{ .Version }}
        kind: RebootNode
        metadata:
          name: disk-maintenance-{{ .HealthEvent.NodeName }}-{{ .HealthEventID }}
          labels:
            app.kubernetes.io/managed-by: nvsentinel
            nvsentinel.nvidia.com/custom-action: "REPLACE_DISK"
        spec:
          nodeName: {{ .HealthEvent.NodeName }}

Deployed ConfigMap Verification

$ kubectl get configmap fault-remediation -n nvsentinel -o jsonpath='{.data.config\.toml}' | grep -A5 "REPLACE_DISK"

[remediationActions."REPLACE_DISK"]
apiGroup = "janitor.dgxc.nvidia.com"
version = "v1alpha1"
kind = "RebootNode"
scope = "Cluster"
completeConditionType = "NodeReady"

Test 1: Custom Action E2E — Full Pipeline (PASS)

Request

curl -X POST http://localhost:8080/health-event \
  -H "Content-Type: application/json" \
  -d '{
    "nodeName": "kwok-node-10",
    "agent": "gpu-health-monitor",
    "componentClass": "GPU",
    "checkName": "test-custom-action-e2e",
    "isFatal": true,
    "isHealthy": false,
    "message": "E2E custom action test",
    "recommendedAction": 27,
    "customRecommendedAction": "REPLACE_DISK",
    "errorCode": ["DISK-001"]
  }'

Response

{"message":"Health event sent","status":"success"}

Pipeline Verification

  1. Platform Connector — Accepted

Event accepted without error. Stored in MongoDB with recommendedAction: 27 and customRecommendedAction: "REPLACE_DISK".

  1. Fault-Quarantine — Quarantined

CEL ruleset "GPU fatal error ruleset" matched (agent == 'gpu-health-monitor' && componentClass == 'GPU' && isFatal == true). Node cordoned and quarantined. Both recommendedAction and customRecommendedAction fields visible in the event:

"HealthEvent":{
  "agent":"gpu-health-monitor",
  "componentClass":"GPU",
  "isFatal":true,
  "recommendedAction":27,
  "customRecommendedAction":"REPLACE_DISK",
  "nodeName":"kwok-node-10"
}
  1. Node-Drainer — Drained
"msg":"Set initial eviction status to InProgress","node":"kwok-node-10"
  1. Fault-Remediation — CR Created

GetEffectiveActionName resolved the action to "REPLACE_DISK", found it in config, and created the maintenance CR:

"msg":"Creating maintenance CR","node":"kwok-node-10","template":"REPLACE_DISK"
"msg":"Generated YAML from template","template":"REPLACE_DISK","yaml":"apiVersion: janitor.dgxc.nvidia.com/v1alpha1\nkind: RebootNode\nmetadata:\n  name: disk-maintenance-kwok-node-10-69dc87315d5d0968608d82c0\n  labels:\n    app.kubernetes.io/managed-by: nvsentinel\n    nvsentinel.nvidia.com/custom-action: \"REPLACE_DISK\"\nspec:\n  nodeName: kwok-node-10\n"
"msg":"Created Maintenance CR successfully","crName":"disk-maintenance-kwok-node-10-69dc87315d5d0968608d82c0","node":"kwok-node-10","template":"REPLACE_DISK"
"msg":"Updated remediation state annotation for node","node":"kwok-node-10","group":"disk-replace","crName":"disk-maintenance-kwok-node-10-69dc87315d5d0968608d82c0"
  1. CR Verified in Cluster
$ kubectl get rebootnodes -A | grep disk-maintenance

disk-maintenance-kwok-node-10-69dc87315d5d0968608d82c0   kwok-node-10   false   True   114s

Test 2: Empty Custom Action Rejected at gRPC Boundary (PASS)

Request

curl -X POST http://localhost:8080/health-event \
  -H "Content-Type: application/json" \
  -d '{
    "nodeName": "kwok-node-1",
    "agent": "test-agent",
    "componentClass": "Disk",
    "checkName": "negative-test",
    "isFatal": true,
    "isHealthy": false,
    "message": "should be rejected",
    "recommendedAction": 27,
    "customRecommendedAction": ""
  }'

Response

Failed to send health event: rpc error: code = InvalidArgument desc = recommendedAction is CUSTOM but customRecommendedAction is empty (node=kwok-node-1, agent=test-agent)

Platform Connector correctly rejected the malformed event at the gRPC boundary with InvalidArgument. The event was never stored or processed by downstream modules.

Test 3: Custom Drain CR Created with CUSTOM Action (PASS)

Configuration

Custom drain enabled in node-drainer with userNamespaces: [] (mutually exclusive with custom drain):

node-drainer:
  userNamespaces: []
  customDrain:
    enabled: true
    templateConfigMapName: "slinky-drain-template"
    templateMountPath: "/etc/drain-template"
    templateFileName: "drain-template.yaml"
    namespace: "nvsentinel"
    apiGroup: "nvsentinel.nvidia.com"
    version: "v1alpha1"
    kind: "DrainRequest"
    resource: "drainrequests"
    statusConditionType: "Complete"
    statusConditionStatus: "True"

Request

curl -X POST http://localhost:8080/health-event \
  -H "Content-Type: application/json" \
  -d '{
    "nodeName": "nvsentinel-worker2",
    "agent": "gpu-health-monitor",
    "componentClass": "GPU",
    "checkName": "custom-drain-final-v2",
    "isFatal": true,
    "isHealthy": false,
    "message": "Final custom drain CR test",
    "recommendedAction": 27,
    "customRecommendedAction": "REPLACE_DISK",
    "errorCode": ["FINAL-001"]
  }'

Result

DrainRequest CR created by node-drainer:

$ kubectl get drainrequests -A
NAMESPACE    NAME                                                AGE
nvsentinel   drain-nvsentinel-worker2-69dca9f45d5d0968608d82c5   88s
apiVersion: nvsentinel.nvidia.com/v1alpha1
kind: DrainRequest
metadata:
  labels:
    nvsentinel.nvidia.com/node-name: nvsentinel-worker2
  name: drain-nvsentinel-worker2-69dca9f45d5d0968608d82c5
  namespace: nvsentinel
spec:
  checkName: custom-drain-final-v2
  errorCode:
  - FINAL-001
  healthEventID: 69dca9f45d5d0968608d82c5
  nodeName: nvsentinel-worker2
  podsToDrain:
    default:
    - test-drain-final-v2
  reason: Final custom drain CR test
  recommendedAction: CUSTOM

The node-drainer created the DrainRequest CR with the correct node, pods to drain, and recommendedAction: CUSTOM. The custom drain path does not depend on the RecommendedAction value — it is triggered purely by customDrain.enabled in config.

Note: Partial Drain

Partial drain does not work with CUSTOM actions. The node-drainer's shouldExecutePartialDrain is hardcoded to protos.RecommendedAction_COMPONENT_RESET only. A CUSTOM action will always trigger a full drain. Supporting partial drain for custom actions would require extending the evaluator with a configurable allowlist — this is a known limitation and out of scope for this change.

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • New Features

    • Added support for a CUSTOM recommended action and propagation of a custom action name end-to-end (events, export, remediation selection, and annotations).
    • Added templated custom remediation resource generation.
  • Validation

    • Requests using CUSTOM must include a non-empty custom action name.
    • Config validation now disallows impacted-entity scope for built-in actions except the reset action; custom actions may define it.
  • Tests

    • Added unit and end-to-end tests covering custom-action handling, export, validation, and remediation routing.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a034748a-43d8-4b42-9e20-694714646aad

📥 Commits

Reviewing files that changed from the base of the PR and between 6fa230d and 9b4432c.

📒 Files selected for processing (1)
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go

📝 Walkthrough

Walkthrough

Adds explicit support for custom remediation actions: new CUSTOM enum and customRecommendedAction field in HealthEvent, helper GetEffectiveActionName, propagation/validation of custom actions through platform connector, event exporter, remediation routing, config validation, CRD/protobufs, templates, and tests.

Changes

Cohort / File(s) Summary
Protobuf Schema & CRD
data-models/protobufs/health_event.proto, distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml
Added RecommendedAction.CUSTOM = 27 and string customRecommendedAction = 18 to HealthEvent; updated CRD schema to accept 27/"CUSTOM" and added spec.customRecommendedAction.
Model helper
data-models/pkg/model/health_event_extentions.go
Added exported GetEffectiveActionName(he *protos.HealthEvent) string returning customRecommendedAction when RecommendedAction == CUSTOM, otherwise the enum string.
Remediation routing & templates
fault-remediation/pkg/remediation/remediation.go, fault-remediation/pkg/common/equivalence_groups.go, fault-remediation/pkg/reconciler/reconciler.go, fault-remediation/pkg/reconciler/templates/custom-action-template.yaml
Switched action-name resolution to GetEffectiveActionName; applied it to action/template selection, metrics/log labels, and node annotation updates. Added templated custom-action CR manifest.
Validation rules & tests
fault-remediation/pkg/config/config.go, fault-remediation/pkg/config/config_test.go
Validation now disallows ImpactedEntityScope for built-in actions except COMPONENT_RESET; allows custom actions to define ImpactedEntityScope. Tests updated and added.
Platform Connector
platform-connectors/pkg/server/platform_connector_server.go
Added request validation and gRPC error returns: reject RecommendedAction == CUSTOM events when CustomRecommendedAction is empty; preserves pipeline/enqueue flow otherwise.
Event Exporter
event-exporter/pkg/transformer/cloudevents.go, event-exporter/pkg/transformer/cloudevents_test.go
CloudEvent payload now includes customRecommendedAction when non-empty; tests added to assert inclusion.
Generated Python bindings & stubs
**/protos/health_event_pb2.py, **/protos/health_event_pb2.pyi (gpu-health-monitor, preflight-checks, nccl-allreduce)
Regenerated descriptors and stubs to include CUSTOM enum member and customRecommendedAction field; updated serialized descriptor bytes and offsets; .pyi stubs expose the new enum/field.
Quarantine & Reconciler tests
fault-quarantine/pkg/evaluator/rule_evaluator_test.go, fault-remediation/pkg/reconciler/reconciler_test.go, fault-remediation/pkg/reconciler/reconciler_e2e_test.go
Updated expected serialized output to include customRecommendedAction; added unit and E2E tests covering CUSTOM actions, group-config presence, CR creation, and labels/annotations.
Miscellaneous
Various regenerated proto descriptor files under health-monitors/..., preflight-checks/..., nccl-allreduce/...
Regenerated protobuf descriptors and updated offsets across several subprojects to reflect the new field and enum value.

Sequence Diagram

sequenceDiagram
    participant Client as Client
    participant PlatformConn as Platform Connector
    participant EventExporter as Event Exporter
    participant Remediation as Fault Remediation
    participant EqGroup as Equivalence Groups
    participant Config as Remediation Config

    Client->>PlatformConn: Send HealthEvent (RecommendedAction=CUSTOM, customRecommendedAction="my-action")
    PlatformConn->>PlatformConn: Validate customRecommendedAction present
    alt customRecommendedAction empty
        PlatformConn->>Client: gRPC InvalidArgument error
    else valid
        PlatformConn->>EventExporter: Forward / Pipeline processing
        EventExporter->>Remediation: Emit/Enqueue CloudEvent (includes customRecommendedAction)
        Remediation->>Remediation: actionName = GetEffectiveActionName(HealthEvent)
        Remediation->>EqGroup: GetGroupConfigForEvent(actionName)
        EqGroup->>Config: Lookup actionName in remediationActions
        Config->>EqGroup: Return action config/template or nil
        EqGroup->>Remediation: Return equivalence group / impacted scope
        Remediation->>Client: Create remediation CR / update node annotation
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I hopped through proto fields so bright,
CUSTOM now guides the remedial light,
A string that tells where fixes roam,
GetEffectiveActionName brings it home,
Remediation hops — full speed tonight!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: Add ability to have custom remediation actions' directly and clearly summarizes the main change: introducing support for custom remediation actions as a new feature.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml (1)

124-125: Consider adding CRD-level validation for CUSTOM action payloads.

Right now the schema allows recommendedAction: CUSTOM with an empty customRecommendedAction. Adding a CEL guard would prevent invalid CRs created outside gRPC ingress.

🧩 Suggested schema rule
                 customRecommendedAction:
                   type: string
+              x-kubernetes-validations:
+                - rule: "!(has(self.recommendedAction) && (self.recommendedAction == 27 || self.recommendedAction == 'CUSTOM')) || (has(self.customRecommendedAction) && self.customRecommendedAction != '')"
+                  message: "customRecommendedAction must be non-empty when recommendedAction is CUSTOM"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`
around lines 124 - 125, Add a CRD-level CEL validation so that when the field
recommendedAction equals "CUSTOM" the field customRecommendedAction must be
non-empty; update the CRD's openAPIV3Schema/validation.rules (or the validation
section where CEL expressions live) to enforce the constraint by referencing the
fields recommendedAction and customRecommendedAction and using a CEL guard
equivalent to: if recommendedAction == "CUSTOM" then customRecommendedAction
must not be empty. Ensure the rule is attached to the health event CRD schema so
CRs created outside the gRPC ingress cannot omit customRecommendedAction when
choosing CUSTOM.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@data-models/protobufs/health_event.proto`:
- Line 113: The protobuf field customRecommendedAction lacks a descriptive
comment; add a clear single-line comment immediately above the field declaration
for customRecommendedAction describing its purpose, expected contents (e.g.,
free-text recommendation or structured action ID), examples or format
constraints, and any length/encoding limits so generated schemas and API docs
are meaningful; ensure the comment uses protobuf comment style (//) and update
any related documentation/generation steps if applicable.

In `@platform-connectors/pkg/server/platform_connector_server.go`:
- Around line 74-78: The check that rejects empty customRecommendedAction should
also reject whitespace-only values; update the conditional in the handler that
inspects event.RecommendedAction and event.CustomRecommendedAction to use
strings.TrimSpace(event.CustomRecommendedAction) (or equivalent) and treat a
trimmed empty string as invalid, returning the same InvalidArgument status (keep
context: event.NodeName, event.Agent) when RecommendedAction ==
pb.RecommendedAction_CUSTOM and the trimmed customRecommendedAction is empty.

---

Nitpick comments:
In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`:
- Around line 124-125: Add a CRD-level CEL validation so that when the field
recommendedAction equals "CUSTOM" the field customRecommendedAction must be
non-empty; update the CRD's openAPIV3Schema/validation.rules (or the validation
section where CEL expressions live) to enforce the constraint by referencing the
fields recommendedAction and customRecommendedAction and using a CEL guard
equivalent to: if recommendedAction == "CUSTOM" then customRecommendedAction
must not be empty. Ensure the rule is attached to the health event CRD schema so
CRs created outside the gRPC ingress cannot omit customRecommendedAction when
choosing CUSTOM.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a32f72ad-9094-4eb0-9ab4-dd23ca1046d0

📥 Commits

Reviewing files that changed from the base of the PR and between 8101fc8 and b008319.

⛔ Files ignored due to path filters (2)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
  • data-models/pkg/protos/health_event_grpc.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (8)
  • data-models/pkg/model/health_event_extentions.go
  • data-models/protobufs/health_event.proto
  • distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml
  • event-exporter/pkg/transformer/cloudevents.go
  • fault-remediation/pkg/common/equivalence_groups.go
  • fault-remediation/pkg/config/config.go
  • fault-remediation/pkg/remediation/remediation.go
  • platform-connectors/pkg/server/platform_connector_server.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
fault-remediation/pkg/config/config_test.go (1)

256-276: Scenario intent and fixture are slightly misaligned.

The case name says “Built-in actions other than COMPONENT_RESET…”, but only RESTART_BM is configured with ImpactedEntityScope. This mostly duplicates the previous case and does not directly validate RESTART_VM rejection.

♻️ Suggested test fixture adjustment
 		{
 			name: "Built-in actions other than COMPONENT_RESET cannot have an ImpactedEntityScope",
 			config: TomlConfig{
 				Template: Template{MountPath: tempDir},
 				RemediationActions: map[string]MaintenanceResource{
 					"RESTART_VM": {
 						TemplateFileName: "template-a.yaml",
 						Scope:            "Cluster",
 						EquivalenceGroup: "restart",
+						ImpactedEntityScope: "GPU_UUID",
 					},
 					"RESTART_BM": {
 						TemplateFileName:             "template-b.yaml",
 						Scope:                        "Namespaced",
 						Namespace:                    "test-namespace",
 						EquivalenceGroup:             "reset",
 						SupersedingEquivalenceGroups: []string{"restart"},
-						ImpactedEntityScope:          "GPU_UUID",
 					},
 				},
 			},
 			expectError: true,
-			errorSubstr: "built-in action 'RESTART_BM' cannot have an ImpactedEntityScope",
+			errorSubstr: "built-in action 'RESTART_VM' cannot have an ImpactedEntityScope",
 		},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-remediation/pkg/config/config_test.go` around lines 256 - 276, The test
case name and fixture are misaligned: to validate that built-in actions other
than COMPONENT_RESET cannot have an ImpactedEntityScope, add an
ImpactedEntityScope to the "RESTART_VM" entry in the
TomlConfig.RemediationActions (or alternatively rename the case to reflect that
only "RESTART_BM" is tested); update the MaintenanceResource for "RESTART_VM" to
include ImpactedEntityScope (e.g., "GPU_UUID") so the test actually asserts
rejection for a non-COMPONENT_RESET built-in action (keep expectError and
errorSubstr as-is to verify the expected failure).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@fault-remediation/pkg/config/config_test.go`:
- Around line 256-276: The test case name and fixture are misaligned: to
validate that built-in actions other than COMPONENT_RESET cannot have an
ImpactedEntityScope, add an ImpactedEntityScope to the "RESTART_VM" entry in the
TomlConfig.RemediationActions (or alternatively rename the case to reflect that
only "RESTART_BM" is tested); update the MaintenanceResource for "RESTART_VM" to
include ImpactedEntityScope (e.g., "GPU_UUID") so the test actually asserts
rejection for a non-COMPONENT_RESET built-in action (keep expectError and
errorSubstr as-is to verify the expected failure).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 20eb3a76-03e1-4371-b2d1-54bda89bac6e

📥 Commits

Reviewing files that changed from the base of the PR and between b99d35b and 04071e8.

⛔ Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (2)
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • fault-remediation/pkg/config/config_test.go
✅ Files skipped from review due to trivial changes (1)
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/configuration/fault-remediation.md`:
- Line 106: The doc incorrectly states that only COMPONENT_RESET and custom
actions support impactedEntityScope; instead reword to say impactedEntityScope
is honored whenever an action's config includes the impactedEntityScope field
(the implementation in equivalence_groups.go applies it universally), so remove
the action-type restriction and update the example text (mentions of
COMPONENT_RESET and RESTART_BM can remain as examples but not as exclusive
support). Reference the impactedEntityScope field and the equivalence-group
construction logic in equivalence_groups.go when editing the sentence.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 188225ed-3ac5-4381-90d0-2fec2ff7657a

📥 Commits

Reviewing files that changed from the base of the PR and between 04071e8 and 1130672.

📒 Files selected for processing (1)
  • docs/configuration/fault-remediation.md

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)

141-1528: ⚠️ Potential issue | 🔴 Critical

Fix custom recommended action handling and add test coverage.

The implementation has a bug: constructHealthEventMessage() only calls GetRecommendedAction(), which returns the default enum value NONE when the protobuf's oneof action field is set to the customRecommendedAction string branch. Custom action strings are silently lost. Update the message formatting to check GetCustomRecommendedAction() first, falling back to GetRecommendedAction() only if the custom value is empty.

Add a test case with HealthEvent_CustomRecommendedAction{CustomRecommendedAction: "CUSTOM_VALUE"} to verify the custom string appears correctly in the node condition message.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go`
around lines 141 - 1528, constructHealthEventMessage() currently only uses
GetRecommendedAction(), which drops protobuf oneof custom string values; change
it to prefer GetCustomRecommendedAction() (use it when non-empty) and fall back
to GetRecommendedAction() only if the custom string is empty so
customRecommendedAction is preserved; update message formatting logic that
appends "Recommended Action=" to use the chosen value, and add a unit test that
constructs a HealthEvent with
HealthEvent_CustomRecommendedAction{CustomRecommendedAction: "CUSTOM_VALUE"} to
assert the resulting node condition message contains "Recommended
Action=CUSTOM_VALUE".
fault-remediation/pkg/reconciler/reconciler_test.go (1)

297-360: ⚠️ Potential issue | 🟡 Minor

This test still bypasses the reconciler logic.

Both the existing cases and the new custom-action case call the mock CreateMaintenanceResource directly, so they don't verify any of the action handling in handleRemediationEvent / performRemediation. That gives false confidence for the new feature path. Please invoke the reconciler method under test and assert on the mock inputs from there.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-remediation/pkg/reconciler/reconciler_test.go` around lines 297 - 360,
The test is calling the mock CreateMaintenanceResource directly instead of
exercising the reconciler flow; change the test to invoke the reconciler method
under test (e.g., call r.handleRemediationEvent or r.performRemediation with the
constructed healthEventData and groupConfig on the reconciler returned by
NewFaultRemediationReconciler) so the reconciler code executes its action
handling, and keep the MockK8sClient.createMaintenanceResourceFn assertions to
validate the inputs and return values; then assert the reconciler result (error
or CR name) matches tt.expectedError / expected CR name instead of calling
CreateMaintenanceResource directly.
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)

335-356: ⚠️ Potential issue | 🟠 Major

Preserve custom remediation names instead of downgrading them to NONE.

The HealthEvent protobuf message defines a customRecommendedAction field (via oneof) to support custom actions, but the Go code only uses the RecommendedAction enum path and silently converts all unknown action strings to NONE. This loses custom action names before they reach the platform connector.

Update the action assignment to check the enum first, then fall back to the custom action field for unknown non-empty strings:

Proposed fix
-	actionEnum, ok := pb.RecommendedAction_value[event.RecommendedAction]
-	if !ok {
-		slog.Warn(
-			"Unknown recommended action; defaulting to NONE.",
-			"recommendedAction",
-			event.RecommendedAction,
-			"eventID",
-			event.EventID,
-		)
-
-		actionEnum = int32(pb.RecommendedAction_NONE)
-	}
-
 	healthEvent := &pb.HealthEvent{
 		Agent:              "csp-health-monitor", // Consistent agent name
 		ComponentClass:     event.ResourceType,   // e.g., "EC2", "gce_instance"
 		CheckName:          "CSPMaintenance",     // Consistent check name
 		IsFatal:            isFatal,
 		IsHealthy:          isHealthy,
 		ProcessingStrategy: e.processingStrategy,
 		Message:            message,
-		Action:             &pb.HealthEvent_RecommendedAction{RecommendedAction: pb.RecommendedAction(actionEnum)},
 		EntitiesImpacted: []*pb.Entity{
 			{
 				EntityType:  event.ResourceType,
 				EntityValue: event.ResourceID, // CSP's ID (e.g., instance-id, full gcp resource name)
 			},
@@
 		GeneratedTimestamp: timestamppb.New(time.Now()),
 	}
+
+	if actionEnum, ok := pb.RecommendedAction_value[event.RecommendedAction]; ok {
+		healthEvent.Action = &pb.HealthEvent_RecommendedAction{
+			RecommendedAction: pb.RecommendedAction(actionEnum),
+		}
+	} else if event.RecommendedAction != "" {
+		healthEvent.Action = &pb.HealthEvent_CustomRecommendedAction{
+			CustomRecommendedAction: event.RecommendedAction,
+		}
+	} else {
+		healthEvent.Action = &pb.HealthEvent_RecommendedAction{
+			RecommendedAction: pb.RecommendedAction_NONE,
+		}
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go` around lines
335 - 356, The current code maps event.RecommendedAction only via
pb.RecommendedAction_value and replaces unknowns with NONE, losing custom
remediation names; change the Action assignment logic in the HealthEvent
construction: first attempt to map event.RecommendedAction via
pb.RecommendedAction_value (as you already do) and if found set Action =
&pb.HealthEvent_RecommendedAction{RecommendedAction:
pb.RecommendedAction(actionEnum)}; if not found and event.RecommendedAction is a
non-empty string set Action =
&pb.HealthEvent_CustomRecommendedAction{CustomRecommendedAction:
event.RecommendedAction}; otherwise fall back to setting RecommendedAction to
pb.RecommendedAction_NONE. Update references around pb.RecommendedAction_value,
event.RecommendedAction, and the HealthEvent Action oneof to implement this
fallback.
event-exporter/pkg/transformer/cloudevents.go (1)

53-71: ⚠️ Potential issue | 🟠 Major

Don't emit recommendedAction: "NONE" for custom-action events.

The HealthEvent.action field is a protobuf oneof. When the active arm is customRecommendedAction, calling event.GetRecommendedAction() returns RecommendedAction_NONE because the type assertion fails. This causes custom-action events to be exported with both "recommendedAction": "NONE" and "customRecommendedAction": "actual_value", which mislabels the exported event for downstream consumers. Serialize the active oneof arm explicitly instead.

Proposed fix
 	healthEventData := map[string]any{
 		"version":            event.Version,
 		"agent":              event.Agent,
 		"componentClass":     event.ComponentClass,
 		"checkName":          event.CheckName,
 		"isFatal":            event.IsFatal,
 		"isHealthy":          event.IsHealthy,
 		"message":            event.Message,
-		"recommendedAction":  event.GetRecommendedAction().String(),
 		"errorCode":          errorCodes,
 		"entitiesImpacted":   entities,
 		"generatedTimestamp": timestamp,
 		"nodeName":           event.NodeName,
 		"processingStrategy": event.ProcessingStrategy.String(),
 	}
 
-	if event.GetCustomRecommendedAction() != "" {
-		healthEventData["customRecommendedAction"] = event.GetCustomRecommendedAction()
+	switch action := event.Action.(type) {
+	case *pb.HealthEvent_RecommendedAction:
+		healthEventData["recommendedAction"] = action.RecommendedAction.String()
+	case *pb.HealthEvent_CustomRecommendedAction:
+		healthEventData["customRecommendedAction"] = action.CustomRecommendedAction
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@event-exporter/pkg/transformer/cloudevents.go` around lines 53 - 71, The
exported map healthEventData is always populated with
event.GetRecommendedAction(), which yields RecommendedAction_NONE when the oneof
active arm is customRecommendedAction; update the serialization to only set
"recommendedAction" when the protobuf-arm for the enum is actually active (e.g.,
when event.GetCustomRecommendedAction() == "" or by switching on the oneof
field), and keep the existing clause that sets "customRecommendedAction" from
event.GetCustomRecommendedAction(); in short, only add
healthEventData["recommendedAction"] = event.GetRecommendedAction().String()
when the enum arm is the active one, otherwise omit it so custom-action events
are not emitted with "NONE".
distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml (1)

49-74: ⚠️ Potential issue | 🟠 Major

Enforce the oneOf contract in the CRD schema.

The proto definition (data-models/protobufs/health_event.proto, lines 101-104) explicitly declares a oneof action that enforces mutual exclusivity between recommendedAction and customRecommendedAction. However, the generated CRD schema fails to translate this constraint into Kubernetes validation rules, allowing invalid CRs to be admitted with both fields set simultaneously or with an empty customRecommendedAction string. This violates the protobuf contract and will cause downstream rejection or misinterpretation.

Add x-kubernetes-validations to the proto source to ensure the CRD generator includes:

  • Mutual exclusivity: only one of recommendedAction or customRecommendedAction may be set
  • Non-empty constraint: customRecommendedAction must be a non-empty string when present
Suggested validation shape
+ x-kubernetes-validations:
+   - rule: "!(has(self.recommendedAction) && has(self.customRecommendedAction))"
+     message: "only one of recommendedAction or customRecommendedAction may be set"
+   - rule: "!has(self.customRecommendedAction) || self.customRecommendedAction != ''"
+     message: "customRecommendedAction must be non-empty"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`
around lines 49 - 74, Add x-kubernetes-validations to the CRD schema for the
oneof named "action" so the generator enforces mutual exclusivity and non-empty
custom strings: add a validation rule using a CEL expression that enforces
exactly one of recommendedAction or customRecommendedAction is present (e.g.
has(self.recommendedAction) != has(self.customRecommendedAction)) and a second
rule that, when customRecommendedAction is present, it is non-empty (e.g.
!has(self.customRecommendedAction) || self.customRecommendedAction != ""),
referencing the fields recommendedAction and customRecommendedAction so the
generated CRD will reject objects that set both or an empty
customRecommendedAction.
platform-connectors/pkg/transformers/overrides/transformer.go (1)

42-47: ⚠️ Potential issue | 🟠 Major

Custom actions are incorrectly flattened to NONE in state capture and comparison.

The GetRecommendedAction() getter on line 46 returns RecommendedAction_NONE as a default when the action oneof is set to customRecommendedAction instead of recommendedAction. This causes:

  1. Line 46: Original custom action is lost during state capture.
  2. Line 144: Change detection fails when comparing enum overrides against custom actions (both resolve to NONE), potentially skipping valid override operations.
  3. Line 165: Logging shows NONE instead of the actual custom action.

The fix requires detecting which oneof branch is active and preserving the actual value:

Suggested fix
 type originalState struct {
 	isFatal           bool
 	isHealthy         bool
-	recommendedAction pb.RecommendedAction
+	actionName        string
 }

+func getActionName(event *pb.HealthEvent) string {
+	switch a := event.Action.(type) {
+	case *pb.HealthEvent_RecommendedAction:
+		return a.RecommendedAction.String()
+	case *pb.HealthEvent_CustomRecommendedAction:
+		return a.CustomRecommendedAction
+	default:
+		return pb.RecommendedAction_NONE.String()
+	}
+}
+
 func captureOriginalState(event *pb.HealthEvent) originalState {
 	return originalState{
-		isFatal:           event.IsFatal,
-		isHealthy:         event.IsHealthy,
-		recommendedAction: event.GetRecommendedAction(),
+		isFatal:    event.IsFatal,
+		isHealthy:  event.IsHealthy,
+		actionName: getActionName(event),
 	}
 }

 	if rule.override.RecommendedAction != nil {
 		newAction, err := rule.override.ParseRecommendedAction()
@@ -142,10 +144,12 @@
-		if newAction != original.recommendedAction {
+		currentRA, isRecommended := event.Action.(*pb.HealthEvent_RecommendedAction)
+		if !isRecommended || currentRA.RecommendedAction != newAction {
 			event.Action = &pb.HealthEvent_RecommendedAction{RecommendedAction: newAction}

 			slog.InfoContext(ctx, "Applied health event override",
-			"original_action", original.recommendedAction.String(),
-			"new_action", event.GetRecommendedAction().String())
+			"original_action", original.actionName,
+			"new_action", getActionName(event))
 	}

Also applies to: 144-146, 164-165

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform-connectors/pkg/transformers/overrides/transformer.go` around lines
42 - 47, captureOriginalState currently uses the protobuf-generated
GetRecommendedAction() which returns RecommendedAction_NONE when the oneof is
set to customRecommendedAction, causing custom actions to be lost; update
captureOriginalState to detect which oneof branch is active (use the
oneof-specific getters such as GetCustomRecommendedAction() vs
GetRecommendedAction() or the generated ActionCase/XXXCase accessor on
pb.HealthEvent), store both the enum recommendedAction and the
customRecommendedAction (or a flag indicating which branch is set) into
originalState, and then update the change-detection logic (the comparison around
the earlier-mentioned enum checks) to compare the enum when recommendedAction is
set and compare the custom string when customRecommendedAction is set; also
update the logging to print the preserved customRecommendedAction when present
instead of showing NONE.
🧹 Nitpick comments (4)
fault-quarantine/pkg/evaluator/rule_evaluator.go (1)

300-300: Consider adding a defensive check before double-dereference.

The v.Elem().Elem() pattern relies on protobuf-go's convention that oneof variants are always pointer types. While this holds for standard protobuf-generated code, a kind check would make the code more robust against unexpected inputs.

🛡️ Optional defensive check
 func handleOneofField(v reflect.Value, result map[string]interface{}) {
 	if v.IsNil() {
 		return
 	}

-	concrete := v.Elem().Elem()
+	elem := v.Elem()
+	if elem.Kind() != reflect.Ptr || elem.IsNil() {
+		return
+	}
+
+	concrete := elem.Elem()
 	concreteType := concrete.Type()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-quarantine/pkg/evaluator/rule_evaluator.go` at line 300, The
double-dereference using v.Elem().Elem() is unsafe for unexpected inputs; before
assigning concrete, add defensive checks that v is valid and a pointer and that
v.Elem() is valid and also a pointer (e.g., verify v.IsValid(),
v.Kind()==reflect.Ptr, v.Elem().IsValid(), v.Elem().Kind()==reflect.Ptr) and
handle the failure path (return an error or skip processing) so that the code
around the concrete variable (the v and concrete usage in rule_evaluator.go)
does not panic on non-conforming values.
tilt/simple-health-client/main.go (1)

53-54: Prefer structured debug logs with slog here.

Line 53 and Line 78 still use formatted log.Printf; please switch these updated debug logs to structured slog.Debug fields for consistency and easier filtering.

♻️ Suggested change
-	log.Printf("[DEBUG] Received health event - Node: %s, CheckName: %s, Agent: %s, IsFatal: %v, RecommendedAction: %v",
-		healthEvent.NodeName, healthEvent.CheckName, healthEvent.Agent, healthEvent.IsFatal, healthEvent.GetRecommendedAction())
+	slog.Debug("Received health event",
+		"node", healthEvent.NodeName,
+		"check_name", healthEvent.CheckName,
+		"agent", healthEvent.Agent,
+		"is_fatal", healthEvent.IsFatal,
+		"recommended_action", healthEvent.GetRecommendedAction().String(),
+	)
@@
-	log.Printf("[DEBUG] Sending health event to platform-connector - Node: %s, CheckName: %s, RecommendedAction: %v",
-		healthEvent.NodeName, healthEvent.CheckName, healthEvent.GetRecommendedAction())
+	slog.Debug("Sending health event to platform-connector",
+		"node", healthEvent.NodeName,
+		"check_name", healthEvent.CheckName,
+		"recommended_action", healthEvent.GetRecommendedAction().String(),
+	)

As per coding guidelines, "Use structured logging via log/slog in Go code".

Also applies to: 78-79

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tilt/simple-health-client/main.go` around lines 53 - 54, Replace the
formatted log.Printf debug calls that print healthEvent fields with structured
slog logging: call slog.Debug (or use slog.With(...).Debug) and emit key/value
fields for Node (healthEvent.NodeName), CheckName (healthEvent.CheckName), Agent
(healthEvent.Agent), IsFatal (healthEvent.IsFatal) and RecommendedAction
(healthEvent.GetRecommendedAction()); do the same for the other formatted
log.Printf occurrence around the health event handling (the one using
healthEvent.GetRecommendedAction()), ensuring you use consistent field keys and
structured slog calls instead of printf formatting.
event-exporter/pkg/transformer/cloudevents_test.go (1)

30-259: Add a custom-action CloudEvent case.

This table only covers the enum arm of HealthEvent.Action. Please add a case with CustomRecommendedAction and assert that the payload includes customRecommendedAction without backfilling recommendedAction to "NONE"; that is the failure mode most likely to slip through here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@event-exporter/pkg/transformer/cloudevents_test.go` around lines 30 - 259,
Add a new table test entry in the tests slice that exercises the non-enum arm of
HealthEvent.Action by setting event.Action =
&pb.HealthEvent_CustomRecommendedAction{CustomRecommendedAction: "do-thing"}; in
its validateFunc, assert the resulting CloudEvent (CloudEvent ce returned by
TransformHealthEvent or equivalent) contains
healthEvent["customRecommendedAction"] == "do-thing" and that healthEvent does
not incorrectly backfill healthEvent["recommendedAction"] to "NONE" (i.e.,
either recommendedAction is absent or not equal to "NONE"); reference the tests
slice, pb.HealthEvent and the CloudEvent struct when adding this case and its
validateFunc.
fault-remediation/pkg/common/equivalence_groups_test.go (1)

51-101: Add a custom-action case to this table.

TestGetGroupConfigForEvent now exercises only the enum arm of HealthEvent.Action. A CustomRecommendedAction case here would lock the new lookup path directly and catch regressions before they surface in higher-level reconciler tests.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-remediation/pkg/common/equivalence_groups_test.go` around lines 51 -
101, Add a table case that exercises the CustomRecommendedAction arm: add a test
entry (e.g., name "Custom recommended action") where healthEvent.Action is set
to &protos.HealthEvent_CustomRecommendedAction{CustomRecommendedAction:
"my_custom_action"}; set expectError appropriately (usually false) and set
expectedGroupConfig to the expected result for a custom action (nil if no
mapping exists) so TestGetGroupConfigForEvent exercises the custom-action lookup
path in the GetGroupConfigForEvent logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@commons/pkg/tracing/span_attributes.go`:
- Line 78: Replace the direct enum stringization used in the span attribute
(attribute.String("health_event.recommended_action",
event.GetRecommendedAction().String())) with the helper that preserves custom
names: call model.GetEffectiveActionName(event) and pass its result into
attribute.String so both enum and customRecommendedAction values are preserved
in tracing; update the reference in the span_attributes code where
attribute.String is invoked for "health_event.recommended_action".

In `@health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go`:
- Line 92: The current code always constructs the enum oneof branch using
Action: &pb.HealthEvent_RecommendedAction{RecommendedAction:
mapRecommendedAction(...)} which coerces unknown/custom strings to
CONTACT_SUPPORT and drops custom names; change the logic so mapRecommendedAction
returns a clear signal for “unknown/custom” (e.g., an optional/nullable enum or
an OK flag) and then, when the input policy.HealthEvent.RecommendedAction is not
a known enum, populate the HealthEvent_CustomRecommendedAction oneof with the
original string instead of the enum branch; update all occurrences (the existing
use at the Action construction and the similar block around lines 146-151) to
choose between pb.HealthEvent_RecommendedAction (for known enums) and
pb.HealthEvent_CustomRecommendedAction (for custom names) while preserving the
original string for the custom branch.

In `@health-monitors/slurm-drain-monitor/pkg/publisher/publisher.go`:
- Line 118: The code currently funnels every recommended action through
mapRecommendedAction(r.RecommendedAction) which maps unknown strings to
CONTACT_SUPPORT and thus strips custom action names; change the publisher to
preserve custom actions by sending the original string when mapRecommendedAction
would return the generic CONTACT_SUPPORT (i.e., call
mapRecommendedAction(r.RecommendedAction) but if the result equals
pb.HealthEvent_RecommendedAction_CONTACT_SUPPORT and r.RecommendedAction is a
non-empty, non-standard value, set Action to a RecommendedAction containing the
original r.RecommendedAction string instead of the CONTACT_SUPPORT enum); update
the logic around mapRecommendedAction and the Action assignment to prefer the
explicit custom string while still using the mapped enum for known values.

In `@node-drainer/pkg/evaluator/evaluator.go`:
- Line 647: The partial-drain branch currently only checks
healthEvent.GetRecommendedAction(), which misses customRecommendedAction oneof
values; update the condition (near the check using e.config.PartialDrainEnabled
and healthEvent.GetRecommendedAction()) to also detect when a
customRecommendedAction semantically maps to COMPONENT_RESET by inspecting the
customRecommendedAction oneof (e.g., via
healthEvent.GetCustomRecommendedAction() or the generated RecommendedActionCase
accessor) and comparing its semantic value to
protos.RecommendedAction_COMPONENT_RESET (or the equivalent enum/field on the
custom payload); ensure both the standard enum and customRecommendedAction paths
trigger the partial-drain branch when they represent COMPONENT_RESET.

In `@platform-connectors/pkg/connectors/kubernetes/process_node_events.go`:
- Line 484: The code appends healthEvent.GetRecommendedAction().String(), which
yields "NONE" when the oneof is set to customRecommendedAction and thus drops
the custom text; update the logic around message assembly (where message +=
fmt.Sprintf("Recommended Action=%s;", ... ) is built) to first check
healthEvent.GetCustomRecommendedAction() and if non-empty use that string,
otherwise fall back to healthEvent.GetRecommendedAction().String(); locate the
usage by looking for GetRecommendedAction, GetCustomRecommendedAction, and the
line that appends "Recommended Action=" and change it to prefer the custom
string before the enum.

In `@platform-connectors/pkg/transformers/overrides/cel.go`:
- Line 124: The CEL event map is using event.GetRecommendedAction().String()
which returns the enum value only and misses custom remediation strings; change
the map entry for "recommendedAction" to compute the effective name first (e.g.,
var action := event.GetRecommendedAction().String(); if action == "" { action =
event.GetCustomRecommendedAction() } ) and then set "recommendedAction": action
so customRecommendedAction is used when present; reference
event.GetRecommendedAction(), event.GetCustomRecommendedAction(), and the
"recommendedAction" map entry when making the change.

In `@store-client/pkg/datastore/providers/postgresql/database_client.go`:
- Line 213: Replace the call to GetRecommendedAction().String() with the helper
GetEffectiveActionName() when building the health event (the field currently set
as recommendedAction: modelEvent.HealthEvent.GetRecommendedAction().String());
use modelEvent.HealthEvent.GetEffectiveActionName() instead so custom
remediation actions are preserved; also add the import for
"github.com/NVIDIA/NVSentinel/data-models/pkg/model" if missing. Apply the
identical change for the analogous occurrence in health_events.go (around the
code that sets recommendedAction at line ~104).

In `@store-client/pkg/datastore/providers/postgresql/health_events.go`:
- Line 104: The code sets fields.recommendedAction using
protoEvent.GetRecommendedAction().String(), which drops customRecommendedAction
values; replace that assignment to call model.GetEffectiveActionName(protoEvent)
(from data-models/pkg/model) so the helper picks the correct oneof value and
preserves custom remediation actions; apply the same replacement at the other
occurrence referenced (database_client.go:213) to ensure consistent indexing and
lookup for custom actions.

---

Outside diff comments:
In
`@distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml`:
- Around line 49-74: Add x-kubernetes-validations to the CRD schema for the
oneof named "action" so the generator enforces mutual exclusivity and non-empty
custom strings: add a validation rule using a CEL expression that enforces
exactly one of recommendedAction or customRecommendedAction is present (e.g.
has(self.recommendedAction) != has(self.customRecommendedAction)) and a second
rule that, when customRecommendedAction is present, it is non-empty (e.g.
!has(self.customRecommendedAction) || self.customRecommendedAction != ""),
referencing the fields recommendedAction and customRecommendedAction so the
generated CRD will reject objects that set both or an empty
customRecommendedAction.

In `@event-exporter/pkg/transformer/cloudevents.go`:
- Around line 53-71: The exported map healthEventData is always populated with
event.GetRecommendedAction(), which yields RecommendedAction_NONE when the oneof
active arm is customRecommendedAction; update the serialization to only set
"recommendedAction" when the protobuf-arm for the enum is actually active (e.g.,
when event.GetCustomRecommendedAction() == "" or by switching on the oneof
field), and keep the existing clause that sets "customRecommendedAction" from
event.GetCustomRecommendedAction(); in short, only add
healthEventData["recommendedAction"] = event.GetRecommendedAction().String()
when the enum arm is the active one, otherwise omit it so custom-action events
are not emitted with "NONE".

In `@fault-remediation/pkg/reconciler/reconciler_test.go`:
- Around line 297-360: The test is calling the mock CreateMaintenanceResource
directly instead of exercising the reconciler flow; change the test to invoke
the reconciler method under test (e.g., call r.handleRemediationEvent or
r.performRemediation with the constructed healthEventData and groupConfig on the
reconciler returned by NewFaultRemediationReconciler) so the reconciler code
executes its action handling, and keep the
MockK8sClient.createMaintenanceResourceFn assertions to validate the inputs and
return values; then assert the reconciler result (error or CR name) matches
tt.expectedError / expected CR name instead of calling CreateMaintenanceResource
directly.

In `@health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go`:
- Around line 335-356: The current code maps event.RecommendedAction only via
pb.RecommendedAction_value and replaces unknowns with NONE, losing custom
remediation names; change the Action assignment logic in the HealthEvent
construction: first attempt to map event.RecommendedAction via
pb.RecommendedAction_value (as you already do) and if found set Action =
&pb.HealthEvent_RecommendedAction{RecommendedAction:
pb.RecommendedAction(actionEnum)}; if not found and event.RecommendedAction is a
non-empty string set Action =
&pb.HealthEvent_CustomRecommendedAction{CustomRecommendedAction:
event.RecommendedAction}; otherwise fall back to setting RecommendedAction to
pb.RecommendedAction_NONE. Update references around pb.RecommendedAction_value,
event.RecommendedAction, and the HealthEvent Action oneof to implement this
fallback.

In
`@platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go`:
- Around line 141-1528: constructHealthEventMessage() currently only uses
GetRecommendedAction(), which drops protobuf oneof custom string values; change
it to prefer GetCustomRecommendedAction() (use it when non-empty) and fall back
to GetRecommendedAction() only if the custom string is empty so
customRecommendedAction is preserved; update message formatting logic that
appends "Recommended Action=" to use the chosen value, and add a unit test that
constructs a HealthEvent with
HealthEvent_CustomRecommendedAction{CustomRecommendedAction: "CUSTOM_VALUE"} to
assert the resulting node condition message contains "Recommended
Action=CUSTOM_VALUE".

In `@platform-connectors/pkg/transformers/overrides/transformer.go`:
- Around line 42-47: captureOriginalState currently uses the protobuf-generated
GetRecommendedAction() which returns RecommendedAction_NONE when the oneof is
set to customRecommendedAction, causing custom actions to be lost; update
captureOriginalState to detect which oneof branch is active (use the
oneof-specific getters such as GetCustomRecommendedAction() vs
GetRecommendedAction() or the generated ActionCase/XXXCase accessor on
pb.HealthEvent), store both the enum recommendedAction and the
customRecommendedAction (or a flag indicating which branch is set) into
originalState, and then update the change-detection logic (the comparison around
the earlier-mentioned enum checks) to compare the enum when recommendedAction is
set and compare the custom string when customRecommendedAction is set; also
update the logging to print the preserved customRecommendedAction when present
instead of showing NONE.

---

Nitpick comments:
In `@event-exporter/pkg/transformer/cloudevents_test.go`:
- Around line 30-259: Add a new table test entry in the tests slice that
exercises the non-enum arm of HealthEvent.Action by setting event.Action =
&pb.HealthEvent_CustomRecommendedAction{CustomRecommendedAction: "do-thing"}; in
its validateFunc, assert the resulting CloudEvent (CloudEvent ce returned by
TransformHealthEvent or equivalent) contains
healthEvent["customRecommendedAction"] == "do-thing" and that healthEvent does
not incorrectly backfill healthEvent["recommendedAction"] to "NONE" (i.e.,
either recommendedAction is absent or not equal to "NONE"); reference the tests
slice, pb.HealthEvent and the CloudEvent struct when adding this case and its
validateFunc.

In `@fault-quarantine/pkg/evaluator/rule_evaluator.go`:
- Line 300: The double-dereference using v.Elem().Elem() is unsafe for
unexpected inputs; before assigning concrete, add defensive checks that v is
valid and a pointer and that v.Elem() is valid and also a pointer (e.g., verify
v.IsValid(), v.Kind()==reflect.Ptr, v.Elem().IsValid(),
v.Elem().Kind()==reflect.Ptr) and handle the failure path (return an error or
skip processing) so that the code around the concrete variable (the v and
concrete usage in rule_evaluator.go) does not panic on non-conforming values.

In `@fault-remediation/pkg/common/equivalence_groups_test.go`:
- Around line 51-101: Add a table case that exercises the
CustomRecommendedAction arm: add a test entry (e.g., name "Custom recommended
action") where healthEvent.Action is set to
&protos.HealthEvent_CustomRecommendedAction{CustomRecommendedAction:
"my_custom_action"}; set expectError appropriately (usually false) and set
expectedGroupConfig to the expected result for a custom action (nil if no
mapping exists) so TestGetGroupConfigForEvent exercises the custom-action lookup
path in the GetGroupConfigForEvent logic.

In `@tilt/simple-health-client/main.go`:
- Around line 53-54: Replace the formatted log.Printf debug calls that print
healthEvent fields with structured slog logging: call slog.Debug (or use
slog.With(...).Debug) and emit key/value fields for Node (healthEvent.NodeName),
CheckName (healthEvent.CheckName), Agent (healthEvent.Agent), IsFatal
(healthEvent.IsFatal) and RecommendedAction
(healthEvent.GetRecommendedAction()); do the same for the other formatted
log.Printf occurrence around the health event handling (the one using
healthEvent.GetRecommendedAction()), ensuring you use consistent field keys and
structured slog calls instead of printf formatting.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1afd73ad-5aa5-4c2c-a01b-6fb5193c0c04

📥 Commits

Reviewing files that changed from the base of the PR and between 1130672 and a45866f.

⛔ Files ignored due to path filters (6)
  • api/gen/go/csp/v1alpha1/provider.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • api/gen/go/csp/v1alpha1/provider_grpc.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • api/gen/go/device/v1alpha1/gpu.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • api/gen/go/device/v1alpha1/gpu_grpc.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
  • data-models/pkg/protos/health_event_grpc.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (47)
  • commons/pkg/tracing/span_attributes.go
  • data-models/pkg/model/health_event_extentions.go
  • data-models/protobufs/health_event.proto
  • distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml
  • event-exporter/pkg/transformer/cloudevents.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • fault-quarantine/pkg/evaluator/rule_evaluator.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • fault-remediation/pkg/common/equivalence_groups_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • health-events-analyzer/pkg/publisher/publisher.go
  • health-events-analyzer/pkg/reconciler/reconciler_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
  • health-monitors/slurm-drain-monitor/pkg/publisher/publisher.go
  • health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go
  • health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • node-drainer/pkg/evaluator/evaluator.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_connector_envtest_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • platform-connectors/pkg/server/platform_connector_server.go
  • platform-connectors/pkg/transformers/overrides/cel.go
  • platform-connectors/pkg/transformers/overrides/transformer.go
  • platform-connectors/pkg/transformers/overrides/transformer_test.go
  • preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py
  • preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi
  • preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py
  • preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi
  • preflight-checks/nccl-loopback/pkg/health/reporter.go
  • store-client/pkg/datastore/providers/postgresql/database_client.go
  • store-client/pkg/datastore/providers/postgresql/health_events.go
  • tests/scale-tests/event-generator/main.go
  • tilt/simple-health-client/main.go
✅ Files skipped from review due to trivial changes (2)
  • fault-remediation/pkg/reconciler/reconciler.go
  • preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi
🚧 Files skipped from review as they are similar to previous changes (10)
  • fault-remediation/pkg/remediation/remediation.go
  • data-models/pkg/model/health_event_extentions.go
  • platform-connectors/pkg/server/platform_connector_server.go
  • preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py
  • preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • data-models/protobufs/health_event.proto
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go

@XRFXLP XRFXLP force-pushed the xrfxlp/1072/implementation branch from 1ea0919 to 27ed466 Compare April 13, 2026 03:12
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
platform-connectors/pkg/transformers/overrides/transformer.go (1)

43-48: ⚠️ Potential issue | 🟠 Major

Preserve the original oneof action when deciding whether the override changed it.

GetRecommendedAction() returns RecommendedAction_NONE when customRecommendedAction is set, making it impossible to distinguish a custom action from an actual NONE value in the comparison at line 145. A rule attempting to override a custom action to NONE will be treated as a no-op, and the logging on lines 165–166 will misreport the action. Store the original oneof branch state separately instead of capturing the flattened enum value.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform-connectors/pkg/transformers/overrides/transformer.go` around lines
43 - 48, The code flattens the oneof by storing only
event.GetRecommendedAction(), which loses whether the oneof branch was custom
and causes incorrect no-op detection and logging; update captureOriginalState
and the originalState type to record the oneof branch (e.g., add a bool like
hasCustomRecommendedAction or an enum originalActionOneof) by checking the oneof
field (e.g., event.CustomRecommendedAction != nil or equivalent getter) in
captureOriginalState, and then change the override comparison logic that
currently uses GetRecommendedAction() (and the logging around that decision) to
consider both the enum value and the stored oneof branch so a custom action vs
explicit NONE is distinguished.
♻️ Duplicate comments (3)
store-client/pkg/datastore/providers/postgresql/database_client.go (1)

213-213: ⚠️ Potential issue | 🔴 Critical

Custom remediation action is lost in index field extraction (Line 213).

Using GetRecommendedAction().String() collapses customRecommendedAction to enum default (NONE) for oneof-backed events. This breaks custom-action preservation in the PostgreSQL indexed field.

Suggested fix
-				recommendedAction: modelEvent.HealthEvent.GetRecommendedAction().String(),
+				recommendedAction: model.GetEffectiveActionName(modelEvent.HealthEvent),
#!/bin/bash
set -euo pipefail

# 1) Find enum-only extraction patterns that may drop custom oneof values
rg -n 'recommendedAction:\s*.*GetRecommendedAction\(\)\.String\(\)' \
  store-client/pkg/datastore/providers/postgresql/database_client.go \
  store-client/pkg/datastore/providers/postgresql/health_events.go

# 2) Verify effective-action helper exists in model package
rg -n 'func\s+GetEffectiveActionName\s*\(' data-models/pkg/model --iglob '*.go'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@store-client/pkg/datastore/providers/postgresql/database_client.go` at line
213, The indexed field extraction uses GetRecommendedAction().String(), which
loses oneof-backed customRecommendedAction values (collapsing to NONE); update
the mapping for recommendedAction to use the model-level effective-action helper
(e.g., call GetEffectiveActionName or similar) that returns the actual action
name including customRecommendedAction, ensuring the code paths around the
recommendedAction assignment in database_client.go and any similar spots (where
GetRecommendedAction().String() is used) are replaced to preserve custom
actions.
health-monitors/slurm-drain-monitor/pkg/publisher/publisher.go (1)

119-119: ⚠️ Potential issue | 🟠 Major

Custom remediation actions are still being dropped in unhealthy events.

Line 119 still funnels r.RecommendedAction through mapRecommendedAction, and unknown values are downgraded to CONTACT_SUPPORT (Line 196). That loses custom action names and conflicts with the feature goal.

💡 Proposed fix
@@
 	} else {
 		for _, r := range reasons {
-			events = append(events, &pb.HealthEvent{
+			event := &pb.HealthEvent{
 				Version:            1,
 				Agent:              agentName,
 				CheckName:          r.CheckName,
 				ComponentClass:     r.ComponentClass,
 				GeneratedTimestamp: timestamppb.New(time.Now()),
 				Message:            r.Message,
 				IsFatal:            r.IsFatal,
 				IsHealthy:          false,
 				NodeName:           nodeName,
-				Action:             model.BuiltinAction(mapRecommendedAction(r.RecommendedAction)),
 				ProcessingStrategy: p.processingStrategy,
 				EntitiesImpacted:   entitiesImpacted,
-			})
+			}
+			applyAction(event, r.RecommendedAction)
+			events = append(events, event)
 		}
 	}
@@
-func mapRecommendedAction(action string) pb.RecommendedAction {
+func applyAction(event *pb.HealthEvent, action string) {
 	if action == "" {
-		return pb.RecommendedAction_CONTACT_SUPPORT
+		event.Action = model.BuiltinAction(pb.RecommendedAction_CONTACT_SUPPORT)
+		return
 	}
 
 	if value, exists := pb.RecommendedAction_value[action]; exists {
-		return pb.RecommendedAction(value)
+		event.Action = model.BuiltinAction(pb.RecommendedAction(value))
+		return
 	}
 
-	return pb.RecommendedAction_CONTACT_SUPPORT
+	event.Action = model.CustomAction(action)
 }

Also applies to: 187-197

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-monitors/slurm-drain-monitor/pkg/publisher/publisher.go` at line 119,
The code currently always maps r.RecommendedAction through mapRecommendedAction
and falls back to CONTACT_SUPPORT, losing custom remediation names; change the
assignment for Action so that if mapRecommendedAction(r.RecommendedAction)
returns a known builtin value you use model.BuiltinAction(...) but if it does
not, preserve the original string as a custom action (e.g. set Action to
model.CustomAction(r.RecommendedAction) or whatever the existing CustomAction
constructor/type is) instead of coercing to CONTACT_SUPPORT; update both the
instance at Action:
model.BuiltinAction(mapRecommendedAction(r.RecommendedAction)) and the similar
logic in mapRecommendedAction’s fallback path (lines ~187-197) to
return/preserve custom actions.
platform-connectors/pkg/server/platform_connector_server.go (1)

17-27: ⚠️ Potential issue | 🟠 Major

Trim whitespace before accepting customRecommendedAction.

" " still passes this check and later behaves like an invalid action name. Reject trimmed-empty values here too.

Proposed fix
 import (
 	"context"
 	"log/slog"
+	"strings"
 
 	"github.com/golang/protobuf/ptypes/empty"
@@
-		if custom, ok := event.Action.(*pb.HealthEvent_CustomRecommendedAction); ok && custom.CustomRecommendedAction == "" {
+		if custom, ok := event.Action.(*pb.HealthEvent_CustomRecommendedAction); ok &&
+			strings.TrimSpace(custom.CustomRecommendedAction) == "" {
 			return nil, status.Errorf(codes.InvalidArgument,
 				"recommendedAction is CUSTOM but customRecommendedAction is empty (node=%s, agent=%s)",
 				event.NodeName, event.Agent)
 		}

Also applies to: 74-77

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform-connectors/pkg/server/platform_connector_server.go` around lines 17
- 27, The validation currently accepts values like "   " for
customRecommendedAction; fix by trimming whitespace before checking emptiness:
call strings.TrimSpace on the customRecommendedAction input where it's validated
(the same check around lines 74-77) and treat a trimmed-empty string as invalid,
returning an appropriate gRPC InvalidArgument error (using
status.Error(codes.InvalidArgument, ...)); ensure you import strings and update
all places that validate customRecommendedAction so whitespace-only values are
rejected.
🧹 Nitpick comments (1)
event-exporter/pkg/transformer/cloudevents_test.go (1)

39-131: Add a custom-action CloudEvent test case.

This table covers built-in actions, but it doesn’t assert the new custom-action branch. Please add a case that sets Action to model.CustomAction(...) and verifies customRecommendedAction is emitted correctly in healthEvent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@event-exporter/pkg/transformer/cloudevents_test.go` around lines 39 - 131,
Add a new table test case in cloudevents_test.go that covers the custom-action
branch by creating a pb.HealthEvent with Action set via
model.CustomAction("MY_CUSTOM_ACTION"); in that case update the test's
validateFunc to assert that
ce.Data["healthEvent"].(map[string]any)["customRecommendedAction"] ==
"MY_CUSTOM_ACTION" (and that the built-in recommendedAction field is not used),
referencing the HealthEvent struct's Action field and the model.CustomAction
helper so the custom-action branch in the CloudEvent transformer is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@platform-connectors/pkg/transformers/overrides/transformer.go`:
- Around line 43-48: The code flattens the oneof by storing only
event.GetRecommendedAction(), which loses whether the oneof branch was custom
and causes incorrect no-op detection and logging; update captureOriginalState
and the originalState type to record the oneof branch (e.g., add a bool like
hasCustomRecommendedAction or an enum originalActionOneof) by checking the oneof
field (e.g., event.CustomRecommendedAction != nil or equivalent getter) in
captureOriginalState, and then change the override comparison logic that
currently uses GetRecommendedAction() (and the logging around that decision) to
consider both the enum value and the stored oneof branch so a custom action vs
explicit NONE is distinguished.

---

Duplicate comments:
In `@health-monitors/slurm-drain-monitor/pkg/publisher/publisher.go`:
- Line 119: The code currently always maps r.RecommendedAction through
mapRecommendedAction and falls back to CONTACT_SUPPORT, losing custom
remediation names; change the assignment for Action so that if
mapRecommendedAction(r.RecommendedAction) returns a known builtin value you use
model.BuiltinAction(...) but if it does not, preserve the original string as a
custom action (e.g. set Action to model.CustomAction(r.RecommendedAction) or
whatever the existing CustomAction constructor/type is) instead of coercing to
CONTACT_SUPPORT; update both the instance at Action:
model.BuiltinAction(mapRecommendedAction(r.RecommendedAction)) and the similar
logic in mapRecommendedAction’s fallback path (lines ~187-197) to
return/preserve custom actions.

In `@platform-connectors/pkg/server/platform_connector_server.go`:
- Around line 17-27: The validation currently accepts values like "   " for
customRecommendedAction; fix by trimming whitespace before checking emptiness:
call strings.TrimSpace on the customRecommendedAction input where it's validated
(the same check around lines 74-77) and treat a trimmed-empty string as invalid,
returning an appropriate gRPC InvalidArgument error (using
status.Error(codes.InvalidArgument, ...)); ensure you import strings and update
all places that validate customRecommendedAction so whitespace-only values are
rejected.

In `@store-client/pkg/datastore/providers/postgresql/database_client.go`:
- Line 213: The indexed field extraction uses GetRecommendedAction().String(),
which loses oneof-backed customRecommendedAction values (collapsing to NONE);
update the mapping for recommendedAction to use the model-level effective-action
helper (e.g., call GetEffectiveActionName or similar) that returns the actual
action name including customRecommendedAction, ensuring the code paths around
the recommendedAction assignment in database_client.go and any similar spots
(where GetRecommendedAction().String() is used) are replaced to preserve custom
actions.

---

Nitpick comments:
In `@event-exporter/pkg/transformer/cloudevents_test.go`:
- Around line 39-131: Add a new table test case in cloudevents_test.go that
covers the custom-action branch by creating a pb.HealthEvent with Action set via
model.CustomAction("MY_CUSTOM_ACTION"); in that case update the test's
validateFunc to assert that
ce.Data["healthEvent"].(map[string]any)["customRecommendedAction"] ==
"MY_CUSTOM_ACTION" (and that the built-in recommendedAction field is not used),
referencing the HealthEvent struct's Action field and the model.CustomAction
helper so the custom-action branch in the CloudEvent transformer is exercised.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8a3d30b0-036d-4f21-85f5-1463d9922ba0

📥 Commits

Reviewing files that changed from the base of the PR and between 1130672 and c86f476.

⛔ Files ignored due to path filters (8)
  • api/gen/go/csp/v1alpha1/provider.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • api/gen/go/csp/v1alpha1/provider_grpc.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • api/gen/go/device/v1alpha1/gpu.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • api/gen/go/device/v1alpha1/gpu_grpc.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
  • data-models/pkg/protos/health_event_grpc.pb.go is excluded by !**/*.pb.go
  • preflight-checks/nccl-loopback/go.sum is excluded by !**/*.sum
  • tests/scale-tests/event-generator/go.sum is excluded by !**/*.sum
📒 Files selected for processing (55)
  • commons/pkg/tracing/span_attributes.go
  • data-models/pkg/model/health_event_extentions.go
  • data-models/protobufs/health_event.proto
  • distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml
  • docs/configuration/fault-remediation.md
  • event-exporter/pkg/transformer/cloudevents.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • fault-quarantine/pkg/evaluator/rule_evaluator.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • fault-remediation/pkg/common/equivalence_groups.go
  • fault-remediation/pkg/common/equivalence_groups_test.go
  • fault-remediation/pkg/config/config.go
  • fault-remediation/pkg/config/config_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • health-events-analyzer/pkg/publisher/publisher.go
  • health-events-analyzer/pkg/reconciler/reconciler_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
  • health-monitors/slurm-drain-monitor/pkg/publisher/publisher.go
  • health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go
  • health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • node-drainer/pkg/evaluator/evaluator.go
  • node-drainer/pkg/evaluator/evaluator_integration_test.go
  • node-drainer/pkg/reconciler/reconciler_integration_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_connector_envtest_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • platform-connectors/pkg/server/platform_connector_server.go
  • platform-connectors/pkg/transformers/overrides/cel.go
  • platform-connectors/pkg/transformers/overrides/transformer.go
  • platform-connectors/pkg/transformers/overrides/transformer_test.go
  • preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py
  • preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi
  • preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py
  • preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi
  • preflight-checks/nccl-loopback/go.mod
  • preflight-checks/nccl-loopback/pkg/health/reporter.go
  • store-client/pkg/datastore/providers/postgresql/database_client.go
  • store-client/pkg/datastore/providers/postgresql/health_events.go
  • tests/scale-tests/event-generator/go.mod
  • tests/scale-tests/event-generator/main.go
  • tilt/simple-health-client/main.go
✅ Files skipped from review due to trivial changes (10)
  • health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
  • tilt/simple-health-client/main.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • health-events-analyzer/pkg/reconciler/reconciler_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • fault-remediation/pkg/config/config_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • preflight-checks/nccl-loopback/go.mod
  • fault-remediation/pkg/reconciler/reconciler.go
  • preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi
🚧 Files skipped from review as they are similar to previous changes (29)
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
  • platform-connectors/pkg/transformers/overrides/cel.go
  • event-exporter/pkg/transformer/cloudevents.go
  • commons/pkg/tracing/span_attributes.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
  • store-client/pkg/datastore/providers/postgresql/health_events.go
  • fault-remediation/pkg/common/equivalence_groups.go
  • docs/configuration/fault-remediation.md
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
  • data-models/protobufs/health_event.proto
  • node-drainer/pkg/evaluator/evaluator.go
  • platform-connectors/pkg/transformers/overrides/transformer_test.go
  • preflight-checks/nccl-loopback/pkg/health/reporter.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • fault-remediation/pkg/common/equivalence_groups_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_connector_envtest_test.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi
  • preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • data-models/pkg/model/health_event_extentions.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/config/config.go
  • preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
@XRFXLP XRFXLP force-pushed the xrfxlp/1072/implementation branch from 22f82a6 to 1a9b93d Compare April 13, 2026 04:41
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
platform-connectors/pkg/server/platform_connector_server.go (1)

74-78: ⚠️ Potential issue | 🟠 Major

Reject whitespace-only customRecommendedAction values.

Line 74 only rejects ""; values like " " still pass validation.

Proposed fix
 import (
 	"context"
 	"log/slog"
+	"strings"

 	"github.com/golang/protobuf/ptypes/empty"
@@
-		if event.RecommendedAction == pb.RecommendedAction_CUSTOM && event.CustomRecommendedAction == "" {
+		if event.RecommendedAction == pb.RecommendedAction_CUSTOM &&
+			strings.TrimSpace(event.CustomRecommendedAction) == "" {
 			return nil, status.Errorf(codes.InvalidArgument,
 				"recommendedAction is CUSTOM but customRecommendedAction is empty (node=%s, agent=%s)",
 				event.NodeName, event.Agent)
 		}
#!/bin/bash
# Verify whether CUSTOM validation still uses raw empty-string check (no TrimSpace).
rg -n -C2 'RecommendedAction_CUSTOM' platform-connectors/pkg/server/platform_connector_server.go
rg -n -C2 'CustomRecommendedAction == ""' platform-connectors/pkg/server/platform_connector_server.go

Expected result: the second query matches Line 74 in current code, confirming whitespace-only values are not rejected.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform-connectors/pkg/server/platform_connector_server.go` around lines 74
- 78, The validation currently checks event.CustomRecommendedAction == "" when
event.RecommendedAction == pb.RecommendedAction_CUSTOM, which allows
whitespace-only strings; update the check to reject strings that are empty after
trimming whitespace (use strings.TrimSpace on event.CustomRecommendedAction) and
return the same InvalidArgument error if the trimmed result is empty; adjust
imports to include the strings package if needed and keep the error message and
referenced symbols (event.RecommendedAction, pb.RecommendedAction_CUSTOM,
event.CustomRecommendedAction) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@platform-connectors/pkg/server/platform_connector_server.go`:
- Around line 74-78: The validation currently checks
event.CustomRecommendedAction == "" when event.RecommendedAction ==
pb.RecommendedAction_CUSTOM, which allows whitespace-only strings; update the
check to reject strings that are empty after trimming whitespace (use
strings.TrimSpace on event.CustomRecommendedAction) and return the same
InvalidArgument error if the trimmed result is empty; adjust imports to include
the strings package if needed and keep the error message and referenced symbols
(event.RecommendedAction, pb.RecommendedAction_CUSTOM,
event.CustomRecommendedAction) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 86733330-a8a4-4374-aa1b-0b099e2e6f5b

📥 Commits

Reviewing files that changed from the base of the PR and between 22f82a6 and 1a9b93d.

⛔ Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (15)
  • data-models/pkg/model/health_event_extentions.go
  • data-models/protobufs/health_event.proto
  • distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml
  • event-exporter/pkg/transformer/cloudevents.go
  • fault-remediation/pkg/common/equivalence_groups.go
  • fault-remediation/pkg/config/config.go
  • fault-remediation/pkg/config/config_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • platform-connectors/pkg/server/platform_connector_server.go
  • preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py
  • preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi
  • preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py
  • preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi
✅ Files skipped from review due to trivial changes (1)
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
🚧 Files skipped from review as they are similar to previous changes (11)
  • event-exporter/pkg/transformer/cloudevents.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/common/equivalence_groups.go
  • distros/kubernetes/nvsentinel/charts/k8s-datastore/crds/health_event.crd.yaml
  • data-models/protobufs/health_event.proto
  • fault-remediation/pkg/config/config_test.go
  • preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.py
  • preflight-checks/nccl-allreduce/nccl_allreduce/protos/health_event_pb2.pyi
  • preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.pyi
  • preflight-checks/dcgm-diag/dcgm_diag/protos/health_event_pb2.py
  • data-models/pkg/model/health_event_extentions.go

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
@github-actions
Copy link
Copy Markdown

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/evaluator 42.70% (+0.45%) 👍

Coverage by file

Changed unit test files

  • github.com/nvidia/nvsentinel/fault-quarantine/pkg/evaluator/rule_evaluator_test.go

@XRFXLP XRFXLP self-assigned this Apr 13, 2026
XRFXLP added 2 commits April 13, 2026 11:40
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
fault-remediation/pkg/reconciler/reconciler_test.go (1)

648-675: Exercise the real custom-action lookup path in shouldSkip tests.

Line 657 injects groupConfig directly, so this can pass even if CUSTOM action resolution/regression happens in config lookup. Consider deriving groupConfig via common.GetGroupConfigForEvent(...) from a remediation-actions map keyed by "REPLACE_DISK".

♻️ Suggested test adjustment
  t.Run("Process custom action with group config", func(t *testing.T) {
      healthEvent := &protos.HealthEvent{
          NodeName:                "test-node-custom",
          RecommendedAction:       protos.RecommendedAction_CUSTOM,
          CustomRecommendedAction: "REPLACE_DISK",
      }
      healthEventWithStatus := model.HealthEventWithStatus{
          HealthEvent: healthEvent,
      }
-     groupConfig := getGroupConfig("disk-replace", nil)
+     remediationActions := map[string]config.MaintenanceResource{
+         "REPLACE_DISK": {EquivalenceGroup: "disk-replace"},
+     }
+     groupConfig, err := common.GetGroupConfigForEvent(remediationActions, healthEvent)
+     assert.NoError(t, err)
+     assert.NotNil(t, groupConfig)

-     result := r.shouldSkipEvent(t.Context(), healthEventWithStatus, groupConfig)
+     result := r.shouldSkipEvent(t.Context(), healthEventWithStatus, groupConfig)
      assert.False(t, result, "Custom actions with a matching group config should not be skipped")
  })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-remediation/pkg/reconciler/reconciler_test.go` around lines 648 - 675,
The test is bypassing the CUSTOM action lookup by passing groupConfig directly
to shouldSkipEvent; change the test to build a remediation-actions map with an
entry keyed by "REPLACE_DISK" and call common.GetGroupConfigForEvent(...) to
derive the groupConfig used by shouldSkipEvent so the real custom-action
resolution path is exercised; specifically, replace the direct use of
getGroupConfig("disk-replace", ...) with creating the map (e.g.,
remediationActions := map[string]RemediationAction{ "REPLACE_DISK": ... }) and
call common.GetGroupConfigForEvent(event.HealthEvent, remediationActions) before
invoking r.shouldSkipEvent(...) so the CUSTOM lookup logic in shouldSkipEvent is
actually tested.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@fault-remediation/pkg/reconciler/reconciler_test.go`:
- Around line 648-675: The test is bypassing the CUSTOM action lookup by passing
groupConfig directly to shouldSkipEvent; change the test to build a
remediation-actions map with an entry keyed by "REPLACE_DISK" and call
common.GetGroupConfigForEvent(...) to derive the groupConfig used by
shouldSkipEvent so the real custom-action resolution path is exercised;
specifically, replace the direct use of getGroupConfig("disk-replace", ...) with
creating the map (e.g., remediationActions := map[string]RemediationAction{
"REPLACE_DISK": ... }) and call common.GetGroupConfigForEvent(event.HealthEvent,
remediationActions) before invoking r.shouldSkipEvent(...) so the CUSTOM lookup
logic in shouldSkipEvent is actually tested.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3c16239d-2a35-42e1-a041-d7d527f1d2bf

📥 Commits

Reviewing files that changed from the base of the PR and between b83a86b and 0afc695.

📒 Files selected for processing (2)
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_test.go

@github-actions
Copy link
Copy Markdown

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler 21.53% (+0.06%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler.go 21.53% (+0.06%) 1347 (+1) 290 (+1) 1057 👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_test.go

Comment on lines +69 to +71
if event.CustomRecommendedAction != "" {
healthEventData["customRecommendedAction"] = event.CustomRecommendedAction
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a test for this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@github-actions
Copy link
Copy Markdown

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/eventwatcher 2.96% (ø)
github.com/nvidia/nvsentinel/fault-remediation 0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler 22.78% (+0.06%) 👍
github.com/nvidia/nvsentinel/node-drainer 0.00% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/evaluator 55.47% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/queue 46.49% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler 37.79% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/datastore 3.07% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/mongodb 6.32% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql 4.21% (ø)
github.com/nvidia/nvsentinel/tests 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/eventwatcher/event_watcher.go 2.96% (ø) 1282 38 1244
github.com/nvidia/nvsentinel/fault-remediation/main.go 0.00% (ø) 639 0 639
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler.go 22.78% (+0.06%) 1396 (+1) 318 (+1) 1078 👍
github.com/nvidia/nvsentinel/node-drainer/main.go 0.00% (ø) 468 0 468
github.com/nvidia/nvsentinel/store-client/pkg/datastore/interfaces.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/mongodb/health_store.go 8.88% (ø) 1870 166 1704
github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/postgresql/health_events.go 0.03% (ø) 3659 1 3658

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • github.com/nvidia/nvsentinel/node-drainer/pkg/evaluator/evaluator_integration_test.go
  • github.com/nvidia/nvsentinel/node-drainer/pkg/queue/queue_test.go
  • github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler/reconciler_integration_test.go
  • github.com/nvidia/nvsentinel/store-client/pkg/datastore/behavioral_contract_test.go
  • github.com/nvidia/nvsentinel/store-client/pkg/datastore/interface_compliance_test.go
  • github.com/nvidia/nvsentinel/store-client/pkg/datastore/providers/mongodb/health_store_test.go
  • github.com/nvidia/nvsentinel/tests/cold_start_test.go

@KaivalyaMDabhadkar
Copy link
Copy Markdown
Contributor

Can we add a e2e test in reconciler_e2e_test.go for FR as well? to test the e2e flow with the custom action?

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@fault-remediation/pkg/reconciler/reconciler_e2e_test.go`:
- Around line 1327-1403: The test CustomAction_CreatesCorrectCR currently calls
performRemediation directly and should instead exercise the ingestion/reconcile
path: replace the direct performRemediation invocation with sending a
datastore-style event created by createCustomActionQuarantineEvent into
mockWatcher.EventsChan (same pattern as CompleteFlow_WithEventLoop), then
wait/assert eventually for the CR to exist (testDynamic.Resource(gvr).Get) and
the annotation/RemediationState via
reconcilerInstance.annotationManager.GetRemediationState; keep the same
assertions about CR spec/labels and equivalence group but use eventual
assertions/timeouts to allow the reconciler loop to process the event.

In `@fault-remediation/pkg/reconciler/templates/custom-action-template.yaml`:
- Around line 15-16: The template hardcodes the resource kind as "RebootNode"
and ignores the configured MaintenanceResource.Kind; update the
custom-action-template.yaml to render the kind using the template variable (use
{{.Kind}}) so the resource kind comes from the configuration alongside the
existing {{.ApiGroup}} and {{.Version}}; ensure any references that expect the
hardcoded name (if any) are updated to use the templated kind.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d9b807e0-13c6-427d-8b87-f81f2c707f38

📥 Commits

Reviewing files that changed from the base of the PR and between 0afc695 and cac875d.

📒 Files selected for processing (4)
  • event-exporter/pkg/transformer/cloudevents_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/templates/custom-action-template.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • fault-remediation/pkg/reconciler/reconciler.go

XRFXLP added 3 commits April 13, 2026 14:28
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
@github-actions
Copy link
Copy Markdown

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler 22.78% (+0.06%) 👍

Coverage by file

Changed unit test files

  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_e2e_test.go

@XRFXLP
Copy link
Copy Markdown
Member Author

XRFXLP commented Apr 13, 2026

an we add a e2e test in reconciler_e2e_test.go for FR as well? to test the e2e flow with the custom action?

added

@github-actions
Copy link
Copy Markdown

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/commons/pkg/statemanager 15.84% (ø)
github.com/nvidia/nvsentinel/store-client/pkg/client 5.26% (+0.01%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/commons/pkg/statemanager/statemanager.go 16.13% (ø) 595 96 499
github.com/nvidia/nvsentinel/store-client/pkg/client/mongodb_client.go 2.82% (ø) 3794 107 3687

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/commons/pkg/statemanager/statemanager_test.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants