NETOBSERV-2674: report Degraded when eBPF DaemonSet has no pods scheduled by memodi · Pull Request #2704 · netobserv/netobserv-operator

memodi · 2026-05-04T18:36:35Z

Description

FlowCollector was incorrectly showing as Ready even when no eBPF pods were deployed due to a non-matching nodeSelector. The root cause was in CheckDaemonSetProgress(): when DesiredNumberScheduled == 0, the check NumberReady < DesiredNumberScheduled (0 < 0) is false, causing the component to be reported as Ready.

This fix adds an explicit guard: if DesiredNumberScheduled == 0, the eBPF agent component is reported as Degraded with reason NoPodsScheduled and a message pointing users toward checking their nodeSelector or node availability.

The same fix also covers the case where eBPF pods are in CrashLoopBackOff and DesiredNumberScheduled happens to be 0, as CheckDaemonSetHealth delegates to CheckDaemonSetProgress.

Changed files:

internal/pkg/manager/status/status_manager.go — add zero-scheduled guard in CheckDaemonSetProgress
internal/pkg/manager/status/status_manager_test.go — add tests for zero-scheduled DaemonSet scenario

Dependencies

n/a

Checklist

Does the changes in PR need specific configuration or environment set up for testing?
- To reproduce: patch FlowCollector with a non-matching nodeSelector for eBPF pods and observe the Status.Components.Agent field on the FlowCollector CR.
I have added thorough unit tests for the change.
QE requirements (check 1 from the list):
- Standard QE validation, with pre-merge tests unless stated otherwise.
- Regression tests only (e.g. refactoring with no user-facing change).
- No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

🤖 Generated with Claude Code via /jira:solve [NETOBSERV-2674](https://redhat.atlassian.net/browse/NETOBSERV-2674)

Summary by CodeRabbit

Bug Fixes
- DaemonSet status reporting now correctly identifies components with zero scheduled pods as degraded instead of in-progress, improving status accuracy and visibility.

When nodeSelector on the eBPF agent matches no nodes, DesiredNumberScheduled is 0. The previous check `NumberReady < DesiredNumberScheduled` (0 < 0 = false) incorrectly reported the component as Ready. Add an explicit check for DesiredNumberScheduled == 0, reporting the component as Degraded with reason NoPodsScheduled and a message suggesting the user check nodeSelector or node availability. The same root cause also prevented CheckDaemonSetHealth from inspecting pod health in this scenario, since it relied on the same condition. Using Degraded (rather than Failure/InProgress) is appropriate because the DaemonSet is valid — the issue is a scheduling or configuration mismatch. Fixes: NETOBSERV-2674 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci-robot · 2026-05-04T18:36:39Z

@memodi: This pull request references NETOBSERV-2674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "netobserv-1.12" instead.

Details

In response to this:

Description

FlowCollector was incorrectly showing as Ready even when no eBPF pods were deployed due to a non-matching nodeSelector. The root cause was in CheckDaemonSetProgress(): when DesiredNumberScheduled == 0, the check NumberReady < DesiredNumberScheduled (0 < 0) is false, causing the component to be reported as Ready.

This fix adds an explicit guard: if DesiredNumberScheduled == 0, the eBPF agent component is reported as Degraded with reason NoPodsScheduled and a message pointing users toward checking their nodeSelector or node availability.

The same fix also covers the case where eBPF pods are in CrashLoopBackOff and DesiredNumberScheduled happens to be 0, as CheckDaemonSetHealth delegates to CheckDaemonSetProgress.

Changed files:

internal/pkg/manager/status/status_manager.go — add zero-scheduled guard in CheckDaemonSetProgress

internal/pkg/manager/status/status_manager_test.go — add tests for zero-scheduled DaemonSet scenario

Dependencies

n/a

Checklist

Does the changes in PR need specific configuration or environment set up for testing?

To reproduce: patch FlowCollector with a non-matching nodeSelector for eBPF pods and observe the Status.Components.Agent field on the FlowCollector CR.

I have added thorough unit tests for the change.

QE requirements (check 1 from the list):

Standard QE validation, with pre-merge tests unless stated otherwise.

Regression tests only (e.g. refactoring with no user-facing change).

No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

🤖 Generated with Claude Code via /jira:solve [NETOBSERV-2674](https://redhat.atlassian.net/browse/NETOBSERV-2674)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-05-04T18:36:40Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-05-04T18:36:43Z

📝 Walkthrough

Walkthrough

Three files are modified: test data formatting alignment, DaemonSet status check logic to treat zero desired pods as Degraded instead of In Progress, and two new test cases validating the degraded behavior.

Changes

DaemonSet Zero-Pods Status Handling

Layer / File(s)	Summary
Core Logic `internal/pkg/manager/status/status_manager.go`	`CheckDaemonSetProgress` adds branch treating `DesiredNumberScheduled == 0` as `StatusDegraded` with reason `NoPodsScheduled`; method comment updated to document Degraded outcome.
Test Coverage `internal/pkg/manager/status/status_manager_test.go`	Two new test functions assert zero-desired-scheduled DaemonSets produce `StatusDegraded` with `NoPodsScheduled` reason via both `CheckDaemonSetProgress` and `CheckDaemonSetHealth`.

FLP Pipeline Test Formatting

Layer / File(s)	Summary
Test Data Alignment `internal/controller/flp/flp_pipeline_builder_test.go`	Comment spacing adjusted in `AdditionalIncludeList` slice literal; test logic and assertions unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: reporting DaemonSet as Degraded when no pods are scheduled, with the ticket reference for context.
Description check	✅ Passed	The description covers the issue, root cause, solution, changed files, dependencies, testing approach, and QE requirements. All required sections are present and well-populated.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 60 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-05-04T18:36:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mffiedler for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

DOWNSTREAM_OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov · 2026-05-04T18:42:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.34%. Comparing base (029f89d) to head (15945db).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2704      +/-   ##
==========================================
+ Coverage   72.32%   72.34%   +0.02%     
==========================================
  Files         107      107              
  Lines       11482    11484       +2     
==========================================
+ Hits         8304     8308       +4     
+ Misses       2675     2674       -1     
+ Partials      503      502       -1

Flag	Coverage Δ
unittests	`72.34% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
internal/pkg/manager/status/status_manager.go	`93.91% <100.00%> (+0.03%)`	⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

memodi · 2026-05-04T19:40:22Z

/ok-to-test

github-actions · 2026-05-04T19:47:32Z

New images:

quay.io/netobserv/network-observability-operator:eba545b
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-eba545b
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-eba545b

They will expire in two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:eba545b make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-eba545b

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-eba545b
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

memodi · 2026-05-04T20:04:26Z

Verified this:

$ oc get flowcollector
NAME      AGENT      PROCESSOR   PLUGIN   SAMPLING   DEPLOYMENT MODEL   STATUS           WARNINGS
cluster   Degraded   Ready       Ready    1          Service            Ready,Degraded   InstallDemoLoki option is enabled. This is useful for development and demo purposes but should not be used in production.

status:
  components:
    agent:
      desiredReplicas: 0
      message: DaemonSet netobserv-ebpf-agent has no pods scheduled; check nodeSelector
        or node availability
      readyReplicas: 0
      reason: NoPodsScheduled
      state: Degraded
    plugin:
      desiredReplicas: 1
      readyReplicas: 1
      state: Ready
    processor:
      desiredReplicas: 3
      readyReplicas: 3
      state: Ready
  conditions:
  - lastTransitionTime: "2026-05-04T19:58:02Z"
    message: 8 ready components, 0 with failure, 0 pending, 1 degraded
    reason: Ready,Degraded
    status: "True"
    type: Ready
  - lastTransitionTime: "2026-05-04T19:58:02Z"
    message: ""
    reason: Ready
    status: "False"
    type: WaitingDemoLoki
  - lastTransitionTime: "2026-05-04T19:58:02Z"
    message: DaemonSet netobserv-ebpf-agent has no pods scheduled; check nodeSelector
      or node availability
    reason: NoPodsScheduled
    status: "True"

coderabbitai

🧹 Nitpick comments (2)

internal/pkg/manager/status/status_manager.go (1)
542-557: 💤 Low value

LGTM — core change is correct.

The if/else if/else chain is clean. defer setDaemonSetReplicas(ds) still fires and correctly writes DesiredReplicas=0/ReadyReplicas=0 after setDegraded. The previous replicas preserved by preserveReplicas are properly overwritten by the defer.

Minor: "nodeSelector or node availability" is the most common cause, but taints/tolerations are another frequent scheduling barrier. Worth considering extending the message:
💬 Optional message expansion
-		i.s.setDegraded(i.cpnt, "NoPodsScheduled", fmt.Sprintf("DaemonSet %s has no pods scheduled; check nodeSelector or node availability", ds.Name))
+		i.s.setDegraded(i.cpnt, "NoPodsScheduled", fmt.Sprintf("DaemonSet %s has no pods scheduled; check nodeSelector, taints/tolerations, or node availability", ds.Name))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/pkg/manager/status/status_manager.go` around lines 542 - 557, Update
the degraded message in CheckDaemonSetProgress so it mentions taints/tolerations
as a common scheduling blocker; specifically, change the string passed to
i.s.setDegraded (inside CheckDaemonSetProgress) from "DaemonSet %s has no pods
scheduled; check nodeSelector or node availability" to include
"taints/tolerations" (e.g., "check nodeSelector, taints/tolerations, or node
availability") so operators get a more complete hint; leave the rest of the
function (including defer i.setDaemonSetReplicas and other
setInProgress/setReady calls) unchanged.
internal/pkg/manager/status/status_manager_test.go (1)
291-308: 💤 Low value

nil client is safe here — but worth an inline comment.

CheckDaemonSetHealth only calls CheckPodHealth(ctx, c, ...) when NumberReady < DesiredNumberScheduled || NumberUnavailable > 0. With DesiredNumberScheduled: 0 both conditions are false (0 < 0 and NumberUnavailable defaults to 0), so the nil client is never dereferenced. A one-line comment would make the intent explicit and prevent a future reader from wondering if this is a mistake.
💬 Suggested inline comment
 	agent.CheckDaemonSetHealth(context.Background(), nil, &appsv1.DaemonSet{
+		// nil client is safe: DesiredNumberScheduled==0 means CheckPodHealth is never invoked
 		ObjectMeta: metav1.ObjectMeta{Name: "netobserv-ebpf-agent"},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/pkg/manager/status/status_manager_test.go` around lines 291 - 308,
The test TestDaemonSetZeroDesiredScheduledHealth passes a nil client to
agent.CheckDaemonSetHealth which is safe because CheckDaemonSetHealth only calls
CheckPodHealth(ctx, c, ...) when NumberReady < DesiredNumberScheduled or
NumberUnavailable > 0; with DesiredNumberScheduled set to 0 neither condition is
true so the client is never dereferenced—add a one-line inline comment inside
TestDaemonSetZeroDesiredScheduledHealth explaining this safety and reference
CheckDaemonSetHealth and CheckPodHealth so future readers know the nil client is
intentional.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/pkg/manager/status/status_manager_test.go`:
- Around line 291-308: The test TestDaemonSetZeroDesiredScheduledHealth passes a
nil client to agent.CheckDaemonSetHealth which is safe because
CheckDaemonSetHealth only calls CheckPodHealth(ctx, c, ...) when NumberReady <
DesiredNumberScheduled or NumberUnavailable > 0; with DesiredNumberScheduled set
to 0 neither condition is true so the client is never dereferenced—add a
one-line inline comment inside TestDaemonSetZeroDesiredScheduledHealth
explaining this safety and reference CheckDaemonSetHealth and CheckPodHealth so
future readers know the nil client is intentional.

In `@internal/pkg/manager/status/status_manager.go`:
- Around line 542-557: Update the degraded message in CheckDaemonSetProgress so
it mentions taints/tolerations as a common scheduling blocker; specifically,
change the string passed to i.s.setDegraded (inside CheckDaemonSetProgress) from
"DaemonSet %s has no pods scheduled; check nodeSelector or node availability" to
include "taints/tolerations" (e.g., "check nodeSelector, taints/tolerations, or
node availability") so operators get a more complete hint; leave the rest of the
function (including defer i.setDaemonSetReplicas and other
setInProgress/setReady calls) unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5b960869-022b-44c6-a47b-7854436c20c2

📥 Commits

Reviewing files that changed from the base of the PR and between 029f89d and 15945db.

📒 Files selected for processing (3)

internal/controller/flp/flp_pipeline_builder_test.go
internal/pkg/manager/status/status_manager.go
internal/pkg/manager/status/status_manager_test.go

memodi · 2026-05-04T20:13:04Z

/cc @leandroberetta @OlivierCazade

Amoghrd · 2026-05-04T20:21:17Z

@memodi Should we extend this to other netobserv components like FLP and consolePlugin as well?

memodi · 2026-05-04T20:43:20Z

@memodi Should we extend this to other netobserv components like FLP and consolePlugin as well?

For FLP and plugin, it's bit different since they're deployments and not then DaemonSet, from what I see

netobserv-operator/internal/pkg/manager/status/status_manager.go

Lines 522 to 525 in 029f89d

    
           if d.Status.ReadyReplicas == d.Status.Replicas && d.Status.Replicas > 0 { 
        
           	i.s.setReady(i.cpnt) 
        
           } else { 
        
           	i.s.setInProgress(i.cpnt, "DeploymentNotReady", fmt.Sprintf("Deployment %s not ready: %d/%d (missing condition)", d.Name, d.Status.ReadyReplicas, d.Status.Replicas))

, its being handled correctly.

Amoghrd · 2026-05-05T15:30:17Z

/label qe-approved

openshift-ci-robot added the jira/valid-reference label May 4, 2026

openshift-ci Bot added the do-not-merge/work-in-progress label May 4, 2026

openshift-ci Bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 4, 2026

memodi marked this pull request as ready for review May 4, 2026 20:04

openshift-ci Bot removed the do-not-merge/work-in-progress label May 4, 2026

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

memodi requested a review from Amoghrd May 4, 2026 20:12

openshift-ci Bot requested review from OlivierCazade and leandroberetta May 4, 2026 20:13

openshift-ci Bot added the qe-approved QE has approved this pull request label May 5, 2026

memodi added the needs-review Tells that the PR needs a review label May 8, 2026

Conversation

memodi commented May 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Dependencies

Checklist

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented May 4, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Dependencies

Checklist

Uh oh!

openshift-ci Bot commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented May 4, 2026

Uh oh!

codecov Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

memodi commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

memodi commented May 4, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

memodi commented May 4, 2026

Uh oh!

Amoghrd commented May 4, 2026

Uh oh!

memodi commented May 4, 2026

Uh oh!

Amoghrd commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

memodi commented May 4, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented May 4, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading

codecov Bot commented May 4, 2026 •

edited

Loading