fix(jobset): omit container images from SSA apply config when resuming suspended TrainJob#3388
Conversation
When the TrainJob controller resumes a suspended JobSet, it issues a Server-Side Apply (SSA) patch that includes the full desired spec — including container images. The JobSet admission webhook rejects any update to spec.replicatedJobs container images because that field is immutable, even for suspended JobSets. Fix: when Build() detects a resume (oldJobSet.Suspend=true, trainJob.Suspend=false), nil out all container and initContainer images in the apply configuration before the SSA call. SSA then omits those fields from the patch, the webhook sees no image change, and the JobSet resumes successfully with the image it was originally created with. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR addresses a JobSet admission webhook immutability constraint by preventing Server-Side Apply (SSA) from attempting to patch spec.replicatedJobs[*] container images when resuming a previously suspended TrainJob.
Changes:
- Detect the “resume from suspended JobSet” scenario in
JobSet.Build()and nil out container/initContainer image fields in the SSA apply configuration so those fields are omitted from the apply patch. - Add a new unit test covering resume vs. non-resume behavior and the existing “both running” early-exit guard.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
pkg/runtime/framework/plugins/jobset/jobset.go |
Omits container/initContainer images from SSA apply config on resume to avoid immutable-field webhook rejections. |
pkg/runtime/framework/plugins/jobset/jobset_test.go |
Adds unit coverage for the resume scenario and related Build() paths. |
| for j := range podSpec.Containers { | ||
| podSpec.Containers[j].Image = nil | ||
| } | ||
| for j := range podSpec.InitContainers { | ||
| podSpec.InitContainers[j].Image = nil |
There was a problem hiding this comment.
The inner loops use for j := range ... which shadows the method receiver j and makes this block harder to read/grep; rename the loop indices (e.g., cIdx, icIdx) to avoid shadowing.
| for j := range podSpec.Containers { | |
| podSpec.Containers[j].Image = nil | |
| } | |
| for j := range podSpec.InitContainers { | |
| podSpec.InitContainers[j].Image = nil | |
| for cIdx := range podSpec.Containers { | |
| podSpec.Containers[cIdx].Image = nil | |
| } | |
| for icIdx := range podSpec.InitContainers { | |
| podSpec.InitContainers[icIdx].Image = nil |
| WithTemplate(corev1ac.PodTemplateSpec(). | ||
| WithSpec(corev1ac.PodSpec(). | ||
| WithContainers( | ||
| corev1ac.Container(). | ||
| WithName(constants.Node). | ||
| WithImage(upgradedImage), | ||
| ), | ||
| ), |
There was a problem hiding this comment.
This test intends to verify both containers and initContainers have images omitted on resume, but the apply template only defines WithContainers(...) and never sets any InitContainers, so the initContainer assertions never exercise the new behavior; add an initContainer with an image to the template (and possibly the PodSets) so the resume/non-resume cases actually validate initContainer image handling.
What this does
Fixes #3387
When a suspended TrainJob is resumed after a TrainingRuntime image update, the Trainer controller includes the updated image in its SSA patch. The JobSet webhook rejects this because
spec.replicatedJobscontainer images are immutable — even on suspended JobSets.Fix: In
Build(), when a resume is detected (oldJobSet.Suspend=true,trainJob.Suspend=false), nil out all container and initContainer images in the apply configuration before the SSA call. SSA then omits those fields from the patch, the webhook sees no image change, and the JobSet resumes with the image it was originally created with.How it works
The JobSet webhook (
vjobset.kb.io) allows onlyannotations,labels,nodeSelector,tolerations, andschedulingGatesto change on a suspended JobSet. Container images are unconditionally immutable. By omitting images from the SSA apply config, we respect this constraint: the field manager releases ownership of the image field and the value in etcd is unchanged.Tests
TestBuild_ResumeOmitsContainerImageswith 4 cases:pkg/...)