Skip to content

NETOBSERV-2674: report Degraded when eBPF DaemonSet has no pods scheduled#2704

Open
memodi wants to merge 1 commit intonetobserv:mainfrom
memodi:fix-NETOBSERV-2674
Open

NETOBSERV-2674: report Degraded when eBPF DaemonSet has no pods scheduled#2704
memodi wants to merge 1 commit intonetobserv:mainfrom
memodi:fix-NETOBSERV-2674

Conversation

@memodi
Copy link
Copy Markdown
Member

@memodi memodi commented May 4, 2026

Description

FlowCollector was incorrectly showing as Ready even when no eBPF pods were deployed due to a non-matching nodeSelector. The root cause was in CheckDaemonSetProgress(): when DesiredNumberScheduled == 0, the check NumberReady < DesiredNumberScheduled (0 < 0) is false, causing the component to be reported as Ready.

This fix adds an explicit guard: if DesiredNumberScheduled == 0, the eBPF agent component is reported as Degraded with reason NoPodsScheduled and a message pointing users toward checking their nodeSelector or node availability.

The same fix also covers the case where eBPF pods are in CrashLoopBackOff and DesiredNumberScheduled happens to be 0, as CheckDaemonSetHealth delegates to CheckDaemonSetProgress.

Changed files:

  • internal/pkg/manager/status/status_manager.go — add zero-scheduled guard in CheckDaemonSetProgress
  • internal/pkg/manager/status/status_manager_test.go — add tests for zero-scheduled DaemonSet scenario

Dependencies

n/a

Checklist

  • Does the changes in PR need specific configuration or environment set up for testing?
    • To reproduce: patch FlowCollector with a non-matching nodeSelector for eBPF pods and observe the Status.Components.Agent field on the FlowCollector CR.
  • I have added thorough unit tests for the change.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

🤖 Generated with Claude Code via /jira:solve [NETOBSERV-2674](https://redhat.atlassian.net/browse/NETOBSERV-2674)

Summary by CodeRabbit

  • Bug Fixes
    • DaemonSet status reporting now correctly identifies components with zero scheduled pods as degraded instead of in-progress, improving status accuracy and visibility.

When nodeSelector on the eBPF agent matches no nodes,
DesiredNumberScheduled is 0. The previous check `NumberReady <
DesiredNumberScheduled` (0 < 0 = false) incorrectly reported the
component as Ready.

Add an explicit check for DesiredNumberScheduled == 0, reporting the
component as Degraded with reason NoPodsScheduled and a message
suggesting the user check nodeSelector or node availability.

The same root cause also prevented CheckDaemonSetHealth from inspecting
pod health in this scenario, since it relied on the same condition.
Using Degraded (rather than Failure/InProgress) is appropriate because
the DaemonSet is valid — the issue is a scheduling or configuration
mismatch.

Fixes: NETOBSERV-2674

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci-robot
Copy link
Copy Markdown
Collaborator

openshift-ci-robot commented May 4, 2026

@memodi: This pull request references NETOBSERV-2674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "netobserv-1.12" instead.

Details

In response to this:

Description

FlowCollector was incorrectly showing as Ready even when no eBPF pods were deployed due to a non-matching nodeSelector. The root cause was in CheckDaemonSetProgress(): when DesiredNumberScheduled == 0, the check NumberReady < DesiredNumberScheduled (0 < 0) is false, causing the component to be reported as Ready.

This fix adds an explicit guard: if DesiredNumberScheduled == 0, the eBPF agent component is reported as Degraded with reason NoPodsScheduled and a message pointing users toward checking their nodeSelector or node availability.

The same fix also covers the case where eBPF pods are in CrashLoopBackOff and DesiredNumberScheduled happens to be 0, as CheckDaemonSetHealth delegates to CheckDaemonSetProgress.

Changed files:

  • internal/pkg/manager/status/status_manager.go — add zero-scheduled guard in CheckDaemonSetProgress
  • internal/pkg/manager/status/status_manager_test.go — add tests for zero-scheduled DaemonSet scenario

Dependencies

n/a

Checklist

  • Does the changes in PR need specific configuration or environment set up for testing?
    • To reproduce: patch FlowCollector with a non-matching nodeSelector for eBPF pods and observe the Status.Components.Agent field on the FlowCollector CR.
  • I have added thorough unit tests for the change.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

🤖 Generated with Claude Code via /jira:solve [NETOBSERV-2674](https://redhat.atlassian.net/browse/NETOBSERV-2674)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 4, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

📝 Walkthrough

Walkthrough

Three files are modified: test data formatting alignment, DaemonSet status check logic to treat zero desired pods as Degraded instead of In Progress, and two new test cases validating the degraded behavior.

Changes

DaemonSet Zero-Pods Status Handling

Layer / File(s) Summary
Core Logic
internal/pkg/manager/status/status_manager.go
CheckDaemonSetProgress adds branch treating DesiredNumberScheduled == 0 as StatusDegraded with reason NoPodsScheduled; method comment updated to document Degraded outcome.
Test Coverage
internal/pkg/manager/status/status_manager_test.go
Two new test functions assert zero-desired-scheduled DaemonSets produce StatusDegraded with NoPodsScheduled reason via both CheckDaemonSetProgress and CheckDaemonSetHealth.

FLP Pipeline Test Formatting

Layer / File(s) Summary
Test Data Alignment
internal/controller/flp/flp_pipeline_builder_test.go
Comment spacing adjusted in AdditionalIncludeList slice literal; test logic and assertions unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: reporting DaemonSet as Degraded when no pods are scheduled, with the ticket reference for context.
Description check ✅ Passed The description covers the issue, root cause, solution, changed files, dependencies, testing approach, and QE requirements. All required sections are present and well-populated.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 4, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mffiedler for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.34%. Comparing base (029f89d) to head (15945db).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2704      +/-   ##
==========================================
+ Coverage   72.32%   72.34%   +0.02%     
==========================================
  Files         107      107              
  Lines       11482    11484       +2     
==========================================
+ Hits         8304     8308       +4     
+ Misses       2675     2674       -1     
+ Partials      503      502       -1     
Flag Coverage Δ
unittests 72.34% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
internal/pkg/manager/status/status_manager.go 93.91% <100.00%> (+0.03%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@memodi
Copy link
Copy Markdown
Member Author

memodi commented May 4, 2026

/ok-to-test

@openshift-ci openshift-ci Bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 4, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

New images:

quay.io/netobserv/network-observability-operator:eba545b
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-eba545b
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-eba545b

They will expire in two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:eba545b make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-eba545b

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-eba545b
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@memodi
Copy link
Copy Markdown
Member Author

memodi commented May 4, 2026

Verified this:

$ oc get flowcollector
NAME      AGENT      PROCESSOR   PLUGIN   SAMPLING   DEPLOYMENT MODEL   STATUS           WARNINGS
cluster   Degraded   Ready       Ready    1          Service            Ready,Degraded   InstallDemoLoki option is enabled. This is useful for development and demo purposes but should not be used in production.
status:
  components:
    agent:
      desiredReplicas: 0
      message: DaemonSet netobserv-ebpf-agent has no pods scheduled; check nodeSelector
        or node availability
      readyReplicas: 0
      reason: NoPodsScheduled
      state: Degraded
    plugin:
      desiredReplicas: 1
      readyReplicas: 1
      state: Ready
    processor:
      desiredReplicas: 3
      readyReplicas: 3
      state: Ready
  conditions:
  - lastTransitionTime: "2026-05-04T19:58:02Z"
    message: 8 ready components, 0 with failure, 0 pending, 1 degraded
    reason: Ready,Degraded
    status: "True"
    type: Ready
  - lastTransitionTime: "2026-05-04T19:58:02Z"
    message: ""
    reason: Ready
    status: "False"
    type: WaitingDemoLoki
  - lastTransitionTime: "2026-05-04T19:58:02Z"
    message: DaemonSet netobserv-ebpf-agent has no pods scheduled; check nodeSelector
      or node availability
    reason: NoPodsScheduled
    status: "True"

@memodi memodi marked this pull request as ready for review May 4, 2026 20:04
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
internal/pkg/manager/status/status_manager.go (1)

542-557: 💤 Low value

LGTM — core change is correct.

The if/else if/else chain is clean. defer setDaemonSetReplicas(ds) still fires and correctly writes DesiredReplicas=0/ReadyReplicas=0 after setDegraded. The previous replicas preserved by preserveReplicas are properly overwritten by the defer.

Minor: "nodeSelector or node availability" is the most common cause, but taints/tolerations are another frequent scheduling barrier. Worth considering extending the message:

💬 Optional message expansion
-		i.s.setDegraded(i.cpnt, "NoPodsScheduled", fmt.Sprintf("DaemonSet %s has no pods scheduled; check nodeSelector or node availability", ds.Name))
+		i.s.setDegraded(i.cpnt, "NoPodsScheduled", fmt.Sprintf("DaemonSet %s has no pods scheduled; check nodeSelector, taints/tolerations, or node availability", ds.Name))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/pkg/manager/status/status_manager.go` around lines 542 - 557, Update
the degraded message in CheckDaemonSetProgress so it mentions taints/tolerations
as a common scheduling blocker; specifically, change the string passed to
i.s.setDegraded (inside CheckDaemonSetProgress) from "DaemonSet %s has no pods
scheduled; check nodeSelector or node availability" to include
"taints/tolerations" (e.g., "check nodeSelector, taints/tolerations, or node
availability") so operators get a more complete hint; leave the rest of the
function (including defer i.setDaemonSetReplicas and other
setInProgress/setReady calls) unchanged.
internal/pkg/manager/status/status_manager_test.go (1)

291-308: 💤 Low value

nil client is safe here — but worth an inline comment.

CheckDaemonSetHealth only calls CheckPodHealth(ctx, c, ...) when NumberReady < DesiredNumberScheduled || NumberUnavailable > 0. With DesiredNumberScheduled: 0 both conditions are false (0 < 0 and NumberUnavailable defaults to 0), so the nil client is never dereferenced. A one-line comment would make the intent explicit and prevent a future reader from wondering if this is a mistake.

💬 Suggested inline comment
 	agent.CheckDaemonSetHealth(context.Background(), nil, &appsv1.DaemonSet{
+		// nil client is safe: DesiredNumberScheduled==0 means CheckPodHealth is never invoked
 		ObjectMeta: metav1.ObjectMeta{Name: "netobserv-ebpf-agent"},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/pkg/manager/status/status_manager_test.go` around lines 291 - 308,
The test TestDaemonSetZeroDesiredScheduledHealth passes a nil client to
agent.CheckDaemonSetHealth which is safe because CheckDaemonSetHealth only calls
CheckPodHealth(ctx, c, ...) when NumberReady < DesiredNumberScheduled or
NumberUnavailable > 0; with DesiredNumberScheduled set to 0 neither condition is
true so the client is never dereferenced—add a one-line inline comment inside
TestDaemonSetZeroDesiredScheduledHealth explaining this safety and reference
CheckDaemonSetHealth and CheckPodHealth so future readers know the nil client is
intentional.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/pkg/manager/status/status_manager_test.go`:
- Around line 291-308: The test TestDaemonSetZeroDesiredScheduledHealth passes a
nil client to agent.CheckDaemonSetHealth which is safe because
CheckDaemonSetHealth only calls CheckPodHealth(ctx, c, ...) when NumberReady <
DesiredNumberScheduled or NumberUnavailable > 0; with DesiredNumberScheduled set
to 0 neither condition is true so the client is never dereferenced—add a
one-line inline comment inside TestDaemonSetZeroDesiredScheduledHealth
explaining this safety and reference CheckDaemonSetHealth and CheckPodHealth so
future readers know the nil client is intentional.

In `@internal/pkg/manager/status/status_manager.go`:
- Around line 542-557: Update the degraded message in CheckDaemonSetProgress so
it mentions taints/tolerations as a common scheduling blocker; specifically,
change the string passed to i.s.setDegraded (inside CheckDaemonSetProgress) from
"DaemonSet %s has no pods scheduled; check nodeSelector or node availability" to
include "taints/tolerations" (e.g., "check nodeSelector, taints/tolerations, or
node availability") so operators get a more complete hint; leave the rest of the
function (including defer i.setDaemonSetReplicas and other
setInProgress/setReady calls) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5b960869-022b-44c6-a47b-7854436c20c2

📥 Commits

Reviewing files that changed from the base of the PR and between 029f89d and 15945db.

📒 Files selected for processing (3)
  • internal/controller/flp/flp_pipeline_builder_test.go
  • internal/pkg/manager/status/status_manager.go
  • internal/pkg/manager/status/status_manager_test.go

@memodi memodi requested a review from Amoghrd May 4, 2026 20:12
@memodi
Copy link
Copy Markdown
Member Author

memodi commented May 4, 2026

/cc @leandroberetta @OlivierCazade

@Amoghrd
Copy link
Copy Markdown
Member

Amoghrd commented May 4, 2026

@memodi Should we extend this to other netobserv components like FLP and consolePlugin as well?

@memodi
Copy link
Copy Markdown
Member Author

memodi commented May 4, 2026

@memodi Should we extend this to other netobserv components like FLP and consolePlugin as well?

For FLP and plugin, it's bit different since they're deployments and not then DaemonSet, from what I see

if d.Status.ReadyReplicas == d.Status.Replicas && d.Status.Replicas > 0 {
i.s.setReady(i.cpnt)
} else {
i.s.setInProgress(i.cpnt, "DeploymentNotReady", fmt.Sprintf("Deployment %s not ready: %d/%d (missing condition)", d.Name, d.Status.ReadyReplicas, d.Status.Replicas))
, its being handled correctly.

@Amoghrd
Copy link
Copy Markdown
Member

Amoghrd commented May 5, 2026

/label qe-approved

@openshift-ci openshift-ci Bot added the qe-approved QE has approved this pull request label May 5, 2026
@memodi memodi added the needs-review Tells that the PR needs a review label May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference needs-review Tells that the PR needs a review ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. qe-approved QE has approved this pull request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants