e2e(agentscheduler): fix flaky batch scheduling wait#5425
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates the Batch Scheduling E2E test to wait for the whole pod batch to be scheduled together, better matching scheduling behavior under a conflict-aware binder and reducing per-pod deadline flakiness.
Changes:
- Adjusted test logging to clarify expected scheduler usage for the created pods.
- Replaced per-pod
WaitPodScheduledloop with a singleEventuallythat checks all pods’Spec.NodeName. - Added inline rationale about serialized binds and batch-level waiting.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pod, err := ctx.Kubeclient.CoreV1().Pods(ctx.Namespace).Get( | ||
| context.TODO(), podName, metav1.GetOptions{}) |
There was a problem hiding this comment.
This is fine, aligned with the rest of the test.
| GinkgoWriter.Printf("Scheduled %d/%d pods\n", scheduledCount, podCount) | ||
| Expect(scheduledCount).To(Equal(podCount), | ||
| "all pods should be scheduled") | ||
| }, 3*time.Minute, 2*time.Second).Should(Succeed(), "all pods should be scheduled") |
There was a problem hiding this comment.
I have no problem with Eventually.
There was a problem hiding this comment.
Code Review
This pull request refactors the batch scheduling E2E test to wait for the entire batch of pods to be scheduled together using an Eventually block, accommodating the conflict-aware binder's serialized binding behavior. The reviewer suggested optimizing the polling logic inside the Eventually block by listing all pods in the namespace once per poll instead of making individual Get requests for each pod, which reduces API server load and potential test flakiness.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| Eventually(func(g Gomega) { | ||
| for _, podName := range podNames { | ||
| pod, err := ctx.Kubeclient.CoreV1().Pods(ctx.Namespace).Get( | ||
| context.TODO(), podName, metav1.GetOptions{}) | ||
| g.Expect(err).NotTo(HaveOccurred(), "failed to get pod %s", podName) | ||
| g.Expect(pod.Spec.NodeName).NotTo(BeEmpty(), | ||
| "pod %s should be scheduled onto a node", podName) | ||
| } | ||
| } | ||
|
|
||
| GinkgoWriter.Printf("Scheduled %d/%d pods\n", scheduledCount, podCount) | ||
| Expect(scheduledCount).To(Equal(podCount), | ||
| "all pods should be scheduled") | ||
| }, 3*time.Minute, 2*time.Second).Should(Succeed(), "all pods should be scheduled") |
There was a problem hiding this comment.
Instead of making podCount (10) individual Get API requests to the Kubernetes API server on every poll interval inside the Eventually block, you can list all pods in the namespace once per poll. This reduces the API call volume by 90% (from 10 calls to 1 call per poll), which significantly reduces API server load and helps prevent potential test flakiness or throttling in busy CI environments.
podSet := make(map[string]bool)
for _, name := range podNames {
podSet[name] = true
}
Eventually(func(g Gomega) {
pods, err := ctx.Kubeclient.CoreV1().Pods(ctx.Namespace).List(
context.TODO(), metav1.ListOptions{})
g.Expect(err).NotTo(HaveOccurred(), "failed to list pods")
scheduled := 0
for _, pod := range pods.Items {
if podSet[pod.Name] && pod.Spec.NodeName != "" {
scheduled++
}
}
g.Expect(scheduled).To(Equal(podCount), "not all batch pods are scheduled yet")
}, 3*time.Minute, 2*time.Second).Should(Succeed(), "all pods should be scheduled")There was a problem hiding this comment.
This seems legit can you consider it?
…dline The batch scheduling spec waited on each pod with its own fixed deadline. The agent scheduler's conflict-aware binder serializes binds per node, so a burst of pods that score onto the same node (e.g. via binpack) is placed sequentially; under slow CI a single straggler could miss the per-pod deadline while the rest were already bound, yielding 9/10. Wait for the whole batch to be scheduled with one Eventually over all pods, asserting each lands on a node. This matches how the scheduler actually drains a batch and removes the per-pod timing assumption. Signed-off-by: Rajesh Agrawal <rajeshagrawal38000@gmail.com>
28e5a50 to
4feab08
Compare
hajnalmt
left a comment
There was a problem hiding this comment.
/area test
/label tide/merge-method-squash
Thanks for the flake fix. I haven't seen this test failing yet though, but this seems unharmful. May I ask how many times have you experienced this issue, and if there is an issue created for this somewhere?
The bot comments look non-blocking to me. The single-List polling suggestion would be a nice cleanup to reduce API-server calls, but I do not think it needs to block this small test-only fix.
| Eventually(func(g Gomega) { | ||
| for _, podName := range podNames { | ||
| pod, err := ctx.Kubeclient.CoreV1().Pods(ctx.Namespace).Get( | ||
| context.TODO(), podName, metav1.GetOptions{}) | ||
| g.Expect(err).NotTo(HaveOccurred(), "failed to get pod %s", podName) | ||
| g.Expect(pod.Spec.NodeName).NotTo(BeEmpty(), | ||
| "pod %s should be scheduled onto a node", podName) | ||
| } | ||
| } | ||
|
|
||
| GinkgoWriter.Printf("Scheduled %d/%d pods\n", scheduledCount, podCount) | ||
| Expect(scheduledCount).To(Equal(podCount), | ||
| "all pods should be scheduled") | ||
| }, 3*time.Minute, 2*time.Second).Should(Succeed(), "all pods should be scheduled") |
There was a problem hiding this comment.
This seems legit can you consider it?
| pod, err := ctx.Kubeclient.CoreV1().Pods(ctx.Namespace).Get( | ||
| context.TODO(), podName, metav1.GetOptions{}) |
There was a problem hiding this comment.
This is fine, aligned with the rest of the test.
| GinkgoWriter.Printf("Scheduled %d/%d pods\n", scheduledCount, podCount) | ||
| Expect(scheduledCount).To(Equal(podCount), | ||
| "all pods should be scheduled") | ||
| }, 3*time.Minute, 2*time.Second).Should(Succeed(), "all pods should be scheduled") |
There was a problem hiding this comment.
I have no problem with Eventually.
What this PR does
Fixes the intermittently failing Agent Scheduler E2E → Batch Scheduling → should schedule a batch of pods spec (seen failing in both
hardandsoftruns as9/10scheduled).Root cause
The spec created 10 pods and then waited on each pod with its own fixed deadline (
WaitPodScheduled, 2 min each).The agent scheduler's conflict-aware binder serializes binds per node: a candidate node is only accepted once its
BindGenerationhas advanced past the previous bind to that node (FindNonConflictingNode), which requires the prior bind to be reflected through the node informer. The CI scheduling config usesbinpack, so a burst of small pods all score onto the same node and are therefore bound sequentially, one informer round-trip apart.Under slow CI, draining the batch this way occasionally pushed one straggler past the per-pod deadline while the other 9 were already bound —
9/10. This is timing, not capacity (10 × 30m easily fits a single node).Change
Wait for the whole batch with a single
Eventuallyover all pods, asserting each lands on a node, instead of imposing a per-pod deadline. This matches how the scheduler actually drains a batch and removes the per-pod timing assumption.Evidence
🤖 Generated with Claude Code