e2e(agentscheduler): fix flaky batch scheduling wait by agrawalcodes · Pull Request #5425 · volcano-sh/volcano

agrawalcodes · 2026-06-10T10:37:55Z

What this PR does

Fixes the intermittently failing Agent Scheduler E2E → Batch Scheduling → should schedule a batch of pods spec (seen failing in both hard and soft runs as 9/10 scheduled).

Root cause

The spec created 10 pods and then waited on each pod with its own fixed deadline (WaitPodScheduled, 2 min each).

The agent scheduler's conflict-aware binder serializes binds per node: a candidate node is only accepted once its BindGeneration has advanced past the previous bind to that node (FindNonConflictingNode), which requires the prior bind to be reflected through the node informer. The CI scheduling config uses binpack, so a burst of small pods all score onto the same node and are therefore bound sequentially, one informer round-trip apart.

Under slow CI, draining the batch this way occasionally pushed one straggler past the per-pod deadline while the other 9 were already bound — 9/10. This is timing, not capacity (10 × 30m easily fits a single node).

Change

Wait for the whole batch with a single Eventually over all pods, asserting each lands on a node, instead of imposing a per-pod deadline. This matches how the scheduler actually drains a batch and removes the per-pod timing assumption.

Evidence

Scheduled 9/10 pods
[FAILED] all pods should be scheduled
Expected <int>: 9 to equal <int>: 10
NodeShard: agent-scheduler, NodesDesired: 1, NodesInUse: 1

🤖 Generated with Claude Code

volcano-sh-bot · 2026-06-10T10:38:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign william-wang for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates the Batch Scheduling E2E test to wait for the whole pod batch to be scheduled together, better matching scheduling behavior under a conflict-aware binder and reducing per-pod deadline flakiness.

Changes:

Adjusted test logging to clarify expected scheduler usage for the created pods.
Replaced per-pod WaitPodScheduled loop with a single Eventually that checks all pods’ Spec.NodeName.
Added inline rationale about serialized binds and batch-level waiting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hajnalmt · 2026-06-11T20:20:04Z

+					pod, err := ctx.Kubeclient.CoreV1().Pods(ctx.Namespace).Get(
+						context.TODO(), podName, metav1.GetOptions{})


This is fine, aligned with the rest of the test.

hajnalmt · 2026-06-11T20:20:48Z

-			GinkgoWriter.Printf("Scheduled %d/%d pods\n", scheduledCount, podCount)
-			Expect(scheduledCount).To(Equal(podCount),
-				"all pods should be scheduled")
+			}, 3*time.Minute, 2*time.Second).Should(Succeed(), "all pods should be scheduled")


I have no problem with Eventually.

gemini-code-assist

Code Review

This pull request refactors the batch scheduling E2E test to wait for the entire batch of pods to be scheduled together using an Eventually block, accommodating the conflict-aware binder's serialized binding behavior. The reviewer suggested optimizing the polling logic inside the Eventually block by listing all pods in the namespace once per poll instead of making individual Get requests for each pod, which reduces API server load and potential test flakiness.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-10T10:38:58Z

+			Eventually(func(g Gomega) {
+				for _, podName := range podNames {
+					pod, err := ctx.Kubeclient.CoreV1().Pods(ctx.Namespace).Get(
+						context.TODO(), podName, metav1.GetOptions{})
+					g.Expect(err).NotTo(HaveOccurred(), "failed to get pod %s", podName)
+					g.Expect(pod.Spec.NodeName).NotTo(BeEmpty(),
+						"pod %s should be scheduled onto a node", podName)
 				}
-			}
-
-			GinkgoWriter.Printf("Scheduled %d/%d pods\n", scheduledCount, podCount)
-			Expect(scheduledCount).To(Equal(podCount),
-				"all pods should be scheduled")
+			}, 3*time.Minute, 2*time.Second).Should(Succeed(), "all pods should be scheduled")


Instead of making podCount (10) individual Get API requests to the Kubernetes API server on every poll interval inside the Eventually block, you can list all pods in the namespace once per poll. This reduces the API call volume by 90% (from 10 calls to 1 call per poll), which significantly reduces API server load and helps prevent potential test flakiness or throttling in busy CI environments.

podSet := make(map[string]bool) for _, name := range podNames { podSet[name] = true } Eventually(func(g Gomega) { pods, err := ctx.Kubeclient.CoreV1().Pods(ctx.Namespace).List( context.TODO(), metav1.ListOptions{}) g.Expect(err).NotTo(HaveOccurred(), "failed to list pods") scheduled := 0 for _, pod := range pods.Items { if podSet[pod.Name] && pod.Spec.NodeName != "" { scheduled++ } } g.Expect(scheduled).To(Equal(podCount), "not all batch pods are scheduled yet") }, 3*time.Minute, 2*time.Second).Should(Succeed(), "all pods should be scheduled")

This seems legit can you consider it?

…dline The batch scheduling spec waited on each pod with its own fixed deadline. The agent scheduler's conflict-aware binder serializes binds per node, so a burst of pods that score onto the same node (e.g. via binpack) is placed sequentially; under slow CI a single straggler could miss the per-pod deadline while the rest were already bound, yielding 9/10. Wait for the whole batch to be scheduled with one Eventually over all pods, asserting each lands on a node. This matches how the scheduler actually drains a batch and removes the per-pod timing assumption. Signed-off-by: Rajesh Agrawal <rajeshagrawal38000@gmail.com>

hajnalmt

/area test
/label tide/merge-method-squash

Thanks for the flake fix. I haven't seen this test failing yet though, but this seems unharmful. May I ask how many times have you experienced this issue, and if there is an issue created for this somewhere?

The bot comments look non-blocking to me. The single-List polling suggestion would be a nice cleanup to reduce API-server calls, but I do not think it needs to block this small test-only fix.

hajnalmt · 2026-06-11T20:19:30Z

+			Eventually(func(g Gomega) {
+				for _, podName := range podNames {
+					pod, err := ctx.Kubeclient.CoreV1().Pods(ctx.Namespace).Get(
+						context.TODO(), podName, metav1.GetOptions{})
+					g.Expect(err).NotTo(HaveOccurred(), "failed to get pod %s", podName)
+					g.Expect(pod.Spec.NodeName).NotTo(BeEmpty(),
+						"pod %s should be scheduled onto a node", podName)
 				}
-			}
-
-			GinkgoWriter.Printf("Scheduled %d/%d pods\n", scheduledCount, podCount)
-			Expect(scheduledCount).To(Equal(podCount),
-				"all pods should be scheduled")
+			}, 3*time.Minute, 2*time.Second).Should(Succeed(), "all pods should be scheduled")


This seems legit can you consider it?

hajnalmt · 2026-06-11T20:20:04Z

+					pod, err := ctx.Kubeclient.CoreV1().Pods(ctx.Namespace).Get(
+						context.TODO(), podName, metav1.GetOptions{})


This is fine, aligned with the rest of the test.

hajnalmt · 2026-06-11T20:20:48Z

-			GinkgoWriter.Printf("Scheduled %d/%d pods\n", scheduledCount, podCount)
-			Expect(scheduledCount).To(Equal(podCount),
-				"all pods should be scheduled")
+			}, 3*time.Minute, 2*time.Second).Should(Succeed(), "all pods should be scheduled")


I have no problem with Eventually.

Copilot AI review requested due to automatic review settings June 10, 2026 10:37

volcano-sh-bot requested review from archlitchi and wangyang0616 June 10, 2026 10:37

volcano-sh-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jun 10, 2026

Copilot AI reviewed Jun 10, 2026

View reviewed changes

gemini-code-assist Bot reviewed Jun 10, 2026

View reviewed changes

agrawalcodes force-pushed the fix-agentscheduler-batch-bind-serialization branch from 28e5a50 to 4feab08 Compare June 10, 2026 10:40

hajnalmt reviewed Jun 11, 2026

View reviewed changes

volcano-sh-bot added area/test CI and test related Issues or PRs tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. labels Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e(agentscheduler): fix flaky batch scheduling wait#5425

e2e(agentscheduler): fix flaky batch scheduling wait#5425
agrawalcodes wants to merge 1 commit into
volcano-sh:masterfrom
agrawalcodes:fix-agentscheduler-batch-bind-serialization

agrawalcodes commented Jun 10, 2026

Uh oh!

volcano-sh-bot commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

hajnalmt Jun 11, 2026

Uh oh!

hajnalmt Jun 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Uh oh!

hajnalmt Jun 11, 2026

Uh oh!

hajnalmt left a comment

Uh oh!

hajnalmt Jun 11, 2026

Uh oh!

hajnalmt Jun 11, 2026

Uh oh!

hajnalmt Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		pod, err := ctx.Kubeclient.CoreV1().Pods(ctx.Namespace).Get(
		context.TODO(), podName, metav1.GetOptions{})

Conversation

agrawalcodes commented Jun 10, 2026

What this PR does

Root cause

Change

Evidence

Uh oh!

volcano-sh-bot commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

hajnalmt Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

hajnalmt Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

hajnalmt Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

hajnalmt left a comment

Choose a reason for hiding this comment

Uh oh!

hajnalmt Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

hajnalmt Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

hajnalmt Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants