Skip to content

e2e(agentscheduler): fix flaky batch scheduling wait#5425

Open
agrawalcodes wants to merge 1 commit into
volcano-sh:masterfrom
agrawalcodes:fix-agentscheduler-batch-bind-serialization
Open

e2e(agentscheduler): fix flaky batch scheduling wait#5425
agrawalcodes wants to merge 1 commit into
volcano-sh:masterfrom
agrawalcodes:fix-agentscheduler-batch-bind-serialization

Conversation

@agrawalcodes

Copy link
Copy Markdown
Contributor

What this PR does

Fixes the intermittently failing Agent Scheduler E2E → Batch Scheduling → should schedule a batch of pods spec (seen failing in both hard and soft runs as 9/10 scheduled).

Root cause

The spec created 10 pods and then waited on each pod with its own fixed deadline (WaitPodScheduled, 2 min each).

The agent scheduler's conflict-aware binder serializes binds per node: a candidate node is only accepted once its BindGeneration has advanced past the previous bind to that node (FindNonConflictingNode), which requires the prior bind to be reflected through the node informer. The CI scheduling config uses binpack, so a burst of small pods all score onto the same node and are therefore bound sequentially, one informer round-trip apart.

Under slow CI, draining the batch this way occasionally pushed one straggler past the per-pod deadline while the other 9 were already bound — 9/10. This is timing, not capacity (10 × 30m easily fits a single node).

Change

Wait for the whole batch with a single Eventually over all pods, asserting each lands on a node, instead of imposing a per-pod deadline. This matches how the scheduler actually drains a batch and removes the per-pod timing assumption.

Evidence

Scheduled 9/10 pods
[FAILED] all pods should be scheduled
Expected <int>: 9 to equal <int>: 10
NodeShard: agent-scheduler, NodesDesired: 1, NodesInUse: 1

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings June 10, 2026 10:37
@volcano-sh-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign william-wang for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jun 10, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates the Batch Scheduling E2E test to wait for the whole pod batch to be scheduled together, better matching scheduling behavior under a conflict-aware binder and reducing per-pod deadline flakiness.

Changes:

  • Adjusted test logging to clarify expected scheduler usage for the created pods.
  • Replaced per-pod WaitPodScheduled loop with a single Eventually that checks all pods’ Spec.NodeName.
  • Added inline rationale about serialized binds and batch-level waiting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +471 to +472
pod, err := ctx.Kubeclient.CoreV1().Pods(ctx.Namespace).Get(
context.TODO(), podName, metav1.GetOptions{})

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, aligned with the rest of the test.

GinkgoWriter.Printf("Scheduled %d/%d pods\n", scheduledCount, podCount)
Expect(scheduledCount).To(Equal(podCount),
"all pods should be scheduled")
}, 3*time.Minute, 2*time.Second).Should(Succeed(), "all pods should be scheduled")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem with Eventually.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the batch scheduling E2E test to wait for the entire batch of pods to be scheduled together using an Eventually block, accommodating the conflict-aware binder's serialized binding behavior. The reviewer suggested optimizing the polling logic inside the Eventually block by listing all pods in the namespace once per poll instead of making individual Get requests for each pod, which reduces API server load and potential test flakiness.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +469 to +477
Eventually(func(g Gomega) {
for _, podName := range podNames {
pod, err := ctx.Kubeclient.CoreV1().Pods(ctx.Namespace).Get(
context.TODO(), podName, metav1.GetOptions{})
g.Expect(err).NotTo(HaveOccurred(), "failed to get pod %s", podName)
g.Expect(pod.Spec.NodeName).NotTo(BeEmpty(),
"pod %s should be scheduled onto a node", podName)
}
}

GinkgoWriter.Printf("Scheduled %d/%d pods\n", scheduledCount, podCount)
Expect(scheduledCount).To(Equal(podCount),
"all pods should be scheduled")
}, 3*time.Minute, 2*time.Second).Should(Succeed(), "all pods should be scheduled")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of making podCount (10) individual Get API requests to the Kubernetes API server on every poll interval inside the Eventually block, you can list all pods in the namespace once per poll. This reduces the API call volume by 90% (from 10 calls to 1 call per poll), which significantly reduces API server load and helps prevent potential test flakiness or throttling in busy CI environments.

			podSet := make(map[string]bool)
			for _, name := range podNames {
				podSet[name] = true
			}

			Eventually(func(g Gomega) {
				pods, err := ctx.Kubeclient.CoreV1().Pods(ctx.Namespace).List(
					context.TODO(), metav1.ListOptions{})
				g.Expect(err).NotTo(HaveOccurred(), "failed to list pods")

				scheduled := 0
				for _, pod := range pods.Items {
					if podSet[pod.Name] && pod.Spec.NodeName != "" {
						scheduled++
					}
				}
				g.Expect(scheduled).To(Equal(podCount), "not all batch pods are scheduled yet")
			}, 3*time.Minute, 2*time.Second).Should(Succeed(), "all pods should be scheduled")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems legit can you consider it?

…dline

The batch scheduling spec waited on each pod with its own fixed deadline.
The agent scheduler's conflict-aware binder serializes binds per node, so
a burst of pods that score onto the same node (e.g. via binpack) is placed
sequentially; under slow CI a single straggler could miss the per-pod
deadline while the rest were already bound, yielding 9/10.

Wait for the whole batch to be scheduled with one Eventually over all
pods, asserting each lands on a node. This matches how the scheduler
actually drains a batch and removes the per-pod timing assumption.

Signed-off-by: Rajesh Agrawal <rajeshagrawal38000@gmail.com>
@agrawalcodes agrawalcodes force-pushed the fix-agentscheduler-batch-bind-serialization branch from 28e5a50 to 4feab08 Compare June 10, 2026 10:40

@hajnalmt hajnalmt left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/area test
/label tide/merge-method-squash

Thanks for the flake fix. I haven't seen this test failing yet though, but this seems unharmful. May I ask how many times have you experienced this issue, and if there is an issue created for this somewhere?

The bot comments look non-blocking to me. The single-List polling suggestion would be a nice cleanup to reduce API-server calls, but I do not think it needs to block this small test-only fix.

Comment on lines +469 to +477
Eventually(func(g Gomega) {
for _, podName := range podNames {
pod, err := ctx.Kubeclient.CoreV1().Pods(ctx.Namespace).Get(
context.TODO(), podName, metav1.GetOptions{})
g.Expect(err).NotTo(HaveOccurred(), "failed to get pod %s", podName)
g.Expect(pod.Spec.NodeName).NotTo(BeEmpty(),
"pod %s should be scheduled onto a node", podName)
}
}

GinkgoWriter.Printf("Scheduled %d/%d pods\n", scheduledCount, podCount)
Expect(scheduledCount).To(Equal(podCount),
"all pods should be scheduled")
}, 3*time.Minute, 2*time.Second).Should(Succeed(), "all pods should be scheduled")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems legit can you consider it?

Comment on lines +471 to +472
pod, err := ctx.Kubeclient.CoreV1().Pods(ctx.Namespace).Get(
context.TODO(), podName, metav1.GetOptions{})

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, aligned with the rest of the test.

GinkgoWriter.Printf("Scheduled %d/%d pods\n", scheduledCount, podCount)
Expect(scheduledCount).To(Equal(podCount),
"all pods should be scheduled")
}, 3*time.Minute, 2*time.Second).Should(Succeed(), "all pods should be scheduled")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem with Eventually.

@volcano-sh-bot volcano-sh-bot added area/test CI and test related Issues or PRs tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. labels Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/test CI and test related Issues or PRs size/S Denotes a PR that changes 10-29 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants