NETOBSERV-2674: report Degraded when eBPF DaemonSet has no pods scheduled#2704
NETOBSERV-2674: report Degraded when eBPF DaemonSet has no pods scheduled#2704memodi wants to merge 1 commit intonetobserv:mainfrom
Conversation
When nodeSelector on the eBPF agent matches no nodes, DesiredNumberScheduled is 0. The previous check `NumberReady < DesiredNumberScheduled` (0 < 0 = false) incorrectly reported the component as Ready. Add an explicit check for DesiredNumberScheduled == 0, reporting the component as Degraded with reason NoPodsScheduled and a message suggesting the user check nodeSelector or node availability. The same root cause also prevented CheckDaemonSetHealth from inspecting pod health in this scenario, since it relied on the same condition. Using Degraded (rather than Failure/InProgress) is appropriate because the DaemonSet is valid — the issue is a scheduling or configuration mismatch. Fixes: NETOBSERV-2674 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@memodi: This pull request references NETOBSERV-2674 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "netobserv-1.12" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
📝 WalkthroughWalkthroughThree files are modified: test data formatting alignment, DaemonSet status check logic to treat zero desired pods as Degraded instead of In Progress, and two new test cases validating the degraded behavior. ChangesDaemonSet Zero-Pods Status Handling
FLP Pipeline Test Formatting
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2704 +/- ##
==========================================
+ Coverage 72.32% 72.34% +0.02%
==========================================
Files 107 107
Lines 11482 11484 +2
==========================================
+ Hits 8304 8308 +4
+ Misses 2675 2674 -1
+ Partials 503 502 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
/ok-to-test |
|
New images: quay.io/netobserv/network-observability-operator:eba545b
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-eba545b
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-eba545bThey will expire in two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:eba545b make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-eba545bOr as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-eba545b
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
|
Verified this: $ oc get flowcollector
NAME AGENT PROCESSOR PLUGIN SAMPLING DEPLOYMENT MODEL STATUS WARNINGS
cluster Degraded Ready Ready 1 Service Ready,Degraded InstallDemoLoki option is enabled. This is useful for development and demo purposes but should not be used in production.status:
components:
agent:
desiredReplicas: 0
message: DaemonSet netobserv-ebpf-agent has no pods scheduled; check nodeSelector
or node availability
readyReplicas: 0
reason: NoPodsScheduled
state: Degraded
plugin:
desiredReplicas: 1
readyReplicas: 1
state: Ready
processor:
desiredReplicas: 3
readyReplicas: 3
state: Ready
conditions:
- lastTransitionTime: "2026-05-04T19:58:02Z"
message: 8 ready components, 0 with failure, 0 pending, 1 degraded
reason: Ready,Degraded
status: "True"
type: Ready
- lastTransitionTime: "2026-05-04T19:58:02Z"
message: ""
reason: Ready
status: "False"
type: WaitingDemoLoki
- lastTransitionTime: "2026-05-04T19:58:02Z"
message: DaemonSet netobserv-ebpf-agent has no pods scheduled; check nodeSelector
or node availability
reason: NoPodsScheduled
status: "True" |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
internal/pkg/manager/status/status_manager.go (1)
542-557: 💤 Low valueLGTM — core change is correct.
The
if/else if/elsechain is clean.defer setDaemonSetReplicas(ds)still fires and correctly writesDesiredReplicas=0/ReadyReplicas=0aftersetDegraded. The previous replicas preserved bypreserveReplicasare properly overwritten by the defer.Minor: "nodeSelector or node availability" is the most common cause, but taints/tolerations are another frequent scheduling barrier. Worth considering extending the message:
💬 Optional message expansion
- i.s.setDegraded(i.cpnt, "NoPodsScheduled", fmt.Sprintf("DaemonSet %s has no pods scheduled; check nodeSelector or node availability", ds.Name)) + i.s.setDegraded(i.cpnt, "NoPodsScheduled", fmt.Sprintf("DaemonSet %s has no pods scheduled; check nodeSelector, taints/tolerations, or node availability", ds.Name))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/pkg/manager/status/status_manager.go` around lines 542 - 557, Update the degraded message in CheckDaemonSetProgress so it mentions taints/tolerations as a common scheduling blocker; specifically, change the string passed to i.s.setDegraded (inside CheckDaemonSetProgress) from "DaemonSet %s has no pods scheduled; check nodeSelector or node availability" to include "taints/tolerations" (e.g., "check nodeSelector, taints/tolerations, or node availability") so operators get a more complete hint; leave the rest of the function (including defer i.setDaemonSetReplicas and other setInProgress/setReady calls) unchanged.internal/pkg/manager/status/status_manager_test.go (1)
291-308: 💤 Low value
nilclient is safe here — but worth an inline comment.
CheckDaemonSetHealthonly callsCheckPodHealth(ctx, c, ...)whenNumberReady < DesiredNumberScheduled || NumberUnavailable > 0. WithDesiredNumberScheduled: 0both conditions are false (0 < 0andNumberUnavailabledefaults to0), so the nil client is never dereferenced. A one-line comment would make the intent explicit and prevent a future reader from wondering if this is a mistake.💬 Suggested inline comment
agent.CheckDaemonSetHealth(context.Background(), nil, &appsv1.DaemonSet{ + // nil client is safe: DesiredNumberScheduled==0 means CheckPodHealth is never invoked ObjectMeta: metav1.ObjectMeta{Name: "netobserv-ebpf-agent"},🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/pkg/manager/status/status_manager_test.go` around lines 291 - 308, The test TestDaemonSetZeroDesiredScheduledHealth passes a nil client to agent.CheckDaemonSetHealth which is safe because CheckDaemonSetHealth only calls CheckPodHealth(ctx, c, ...) when NumberReady < DesiredNumberScheduled or NumberUnavailable > 0; with DesiredNumberScheduled set to 0 neither condition is true so the client is never dereferenced—add a one-line inline comment inside TestDaemonSetZeroDesiredScheduledHealth explaining this safety and reference CheckDaemonSetHealth and CheckPodHealth so future readers know the nil client is intentional.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@internal/pkg/manager/status/status_manager_test.go`:
- Around line 291-308: The test TestDaemonSetZeroDesiredScheduledHealth passes a
nil client to agent.CheckDaemonSetHealth which is safe because
CheckDaemonSetHealth only calls CheckPodHealth(ctx, c, ...) when NumberReady <
DesiredNumberScheduled or NumberUnavailable > 0; with DesiredNumberScheduled set
to 0 neither condition is true so the client is never dereferenced—add a
one-line inline comment inside TestDaemonSetZeroDesiredScheduledHealth
explaining this safety and reference CheckDaemonSetHealth and CheckPodHealth so
future readers know the nil client is intentional.
In `@internal/pkg/manager/status/status_manager.go`:
- Around line 542-557: Update the degraded message in CheckDaemonSetProgress so
it mentions taints/tolerations as a common scheduling blocker; specifically,
change the string passed to i.s.setDegraded (inside CheckDaemonSetProgress) from
"DaemonSet %s has no pods scheduled; check nodeSelector or node availability" to
include "taints/tolerations" (e.g., "check nodeSelector, taints/tolerations, or
node availability") so operators get a more complete hint; leave the rest of the
function (including defer i.setDaemonSetReplicas and other
setInProgress/setReady calls) unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5b960869-022b-44c6-a47b-7854436c20c2
📒 Files selected for processing (3)
internal/controller/flp/flp_pipeline_builder_test.gointernal/pkg/manager/status/status_manager.gointernal/pkg/manager/status/status_manager_test.go
|
@memodi Should we extend this to other netobserv components like FLP and consolePlugin as well? |
For FLP and plugin, it's bit different since they're deployments and not then DaemonSet, from what I see netobserv-operator/internal/pkg/manager/status/status_manager.go Lines 522 to 525 in 029f89d |
|
/label qe-approved |
Description
FlowCollector was incorrectly showing as
Readyeven when no eBPF pods were deployed due to a non-matchingnodeSelector. The root cause was inCheckDaemonSetProgress(): whenDesiredNumberScheduled == 0, the checkNumberReady < DesiredNumberScheduled(0 < 0) is false, causing the component to be reported as Ready.This fix adds an explicit guard: if
DesiredNumberScheduled == 0, the eBPF agent component is reported asDegradedwith reasonNoPodsScheduledand a message pointing users toward checking theirnodeSelectoror node availability.The same fix also covers the case where eBPF pods are in CrashLoopBackOff and
DesiredNumberScheduledhappens to be 0, asCheckDaemonSetHealthdelegates toCheckDaemonSetProgress.Changed files:
internal/pkg/manager/status/status_manager.go— add zero-scheduled guard inCheckDaemonSetProgressinternal/pkg/manager/status/status_manager_test.go— add tests for zero-scheduled DaemonSet scenarioDependencies
n/a
Checklist
nodeSelectorfor eBPF pods and observe theStatus.Components.Agentfield on the FlowCollector CR.🤖 Generated with Claude Code via
/jira:solve [NETOBSERV-2674](https://redhat.atlassian.net/browse/NETOBSERV-2674)Summary by CodeRabbit