Skip to content

fix(jobset): omit container images from SSA apply config when resuming suspended TrainJob#3388

Draft
hrathina wants to merge 1 commit intokubeflow:masterfrom
hrathina:fix/immutable-jobset-recreate-on-resume
Draft

fix(jobset): omit container images from SSA apply config when resuming suspended TrainJob#3388
hrathina wants to merge 1 commit intokubeflow:masterfrom
hrathina:fix/immutable-jobset-recreate-on-resume

Conversation

@hrathina
Copy link
Copy Markdown

@hrathina hrathina commented Mar 25, 2026

What this does

Fixes #3387

When a suspended TrainJob is resumed after a TrainingRuntime image update, the Trainer controller includes the updated image in its SSA patch. The JobSet webhook rejects this because spec.replicatedJobs container images are immutable — even on suspended JobSets.

Fix: In Build(), when a resume is detected (oldJobSet.Suspend=true, trainJob.Suspend=false), nil out all container and initContainer images in the apply configuration before the SSA call. SSA then omits those fields from the patch, the webhook sees no image change, and the JobSet resumes with the image it was originally created with.

Note: This is an interim fix addressing the specific symptom. The approach is still under discussion.

How it works

The JobSet webhook (vjobset.kb.io) allows only annotations, labels, nodeSelector, tolerations, and schedulingGates to change on a suspended JobSet. Container images are unconditionally immutable. By omitting images from the SSA apply config, we respect this constraint: the field manager releases ownership of the image field and the value in etcd is unchanged.

Tests

  • Added TestBuild_ResumeOmitsContainerImages with 4 cases:
    • Suspended JobSet being resumed → images are nil in apply config ✅
    • No existing JobSet (creation path) → images present ✅
    • Both running (existing early-exit guard) → nil result ✅
    • TrainJob suspended (not a resume) → images present ✅
  • All existing unit tests pass (pkg/...)
  • Integration tests pass

When the TrainJob controller resumes a suspended JobSet, it issues a
Server-Side Apply (SSA) patch that includes the full desired spec —
including container images. The JobSet admission webhook rejects any
update to spec.replicatedJobs container images because that field is
immutable, even for suspended JobSets.

Fix: when Build() detects a resume (oldJobSet.Suspend=true,
trainJob.Suspend=false), nil out all container and initContainer images
in the apply configuration before the SSA call. SSA then omits those
fields from the patch, the webhook sees no image change, and the JobSet
resumes successfully with the image it was originally created with.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 25, 2026 15:36
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a JobSet admission webhook immutability constraint by preventing Server-Side Apply (SSA) from attempting to patch spec.replicatedJobs[*] container images when resuming a previously suspended TrainJob.

Changes:

  • Detect the “resume from suspended JobSet” scenario in JobSet.Build() and nil out container/initContainer image fields in the SSA apply configuration so those fields are omitted from the apply patch.
  • Add a new unit test covering resume vs. non-resume behavior and the existing “both running” early-exit guard.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pkg/runtime/framework/plugins/jobset/jobset.go Omits container/initContainer images from SSA apply config on resume to avoid immutable-field webhook rejections.
pkg/runtime/framework/plugins/jobset/jobset_test.go Adds unit coverage for the resume scenario and related Build() paths.

Comment on lines +339 to +343
for j := range podSpec.Containers {
podSpec.Containers[j].Image = nil
}
for j := range podSpec.InitContainers {
podSpec.InitContainers[j].Image = nil
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inner loops use for j := range ... which shadows the method receiver j and makes this block harder to read/grep; rename the loop indices (e.g., cIdx, icIdx) to avoid shadowing.

Suggested change
for j := range podSpec.Containers {
podSpec.Containers[j].Image = nil
}
for j := range podSpec.InitContainers {
podSpec.InitContainers[j].Image = nil
for cIdx := range podSpec.Containers {
podSpec.Containers[cIdx].Image = nil
}
for icIdx := range podSpec.InitContainers {
podSpec.InitContainers[icIdx].Image = nil

Copilot uses AI. Check for mistakes.
Comment on lines +1673 to +1680
WithTemplate(corev1ac.PodTemplateSpec().
WithSpec(corev1ac.PodSpec().
WithContainers(
corev1ac.Container().
WithName(constants.Node).
WithImage(upgradedImage),
),
),
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test intends to verify both containers and initContainers have images omitted on resume, but the apply template only defines WithContainers(...) and never sets any InitContainers, so the initContainer assertions never exercise the new behavior; add an initContainer with an image to the template (and possibly the PodSets) so the resume/non-resume cases actually validate initContainer image handling.

Copilot generated this review using guidance from repository custom instructions.
@hrathina hrathina marked this pull request as draft March 25, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: suspended TrainJob stuck in broken state when TrainingRuntime image is updated

2 participants