Skip to content

Fix unbounded job.Status.Conditions growth#5422

Open
avinxshKD wants to merge 4 commits into
volcano-sh:masterfrom
avinxshKD:fix/job-conditions-unbounded-growth
Open

Fix unbounded job.Status.Conditions growth#5422
avinxshKD wants to merge 4 commits into
volcano-sh:masterfrom
avinxshKD:fix/job-conditions-unbounded-growth

Conversation

@avinxshKD

Copy link
Copy Markdown

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes an issue where job.Status.Conditions grows unbounded during frequent job phase transitions (especially for jobs with RestartJob policies that retry often). Previously, every phase transition unconditionally appended a new condition without deduplication or a maximum length check.

For heavily retried jobs, this slice could easily accumulate thousands of entries, eventually causing the UpdateStatus call to permanently fail with etcdserver: request is too large. At this point, the job becomes unrecoverable without manual deletion.

This fix introduces an appendJobCondition helper that:

  1. Skips appending if the new condition's phase is identical to the most recent one (deduplicating redundant consecutive states).
  2. Enforces a hard cap of maxJobConditions = 32, truncating the oldest entries to ensure the etcd object size remains bounded and safe.

Which issue(s) this PR fixes:

Fixes #5421

Special notes for your reviewer:

The maxJobConditions is set to 32, which is more than enough to capture the recent historical context of a job's state transitions without risking etcd size limits.

Does this PR introduce a user-facing change?

Fixed an issue where frequently restarting jobs could exceed etcd object size limits by capping `job.Status.Conditions` to a maximum of 32 entries and deduplicating consecutive identical phases.

@volcano-sh-bot volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 10, 2026
@volcano-sh-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jessestutler for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot

Copy link
Copy Markdown
Contributor

Welcome @avinxshKD! It looks like this is your first PR to volcano-sh/volcano 🎉

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 10, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request limits the number of job conditions stored in job.Status.Conditions to a maximum of 32 to prevent unbounded growth in etcd. It introduces the appendJobCondition helper function to handle deduplication of consecutive identical phases and truncation of older entries, accompanied by unit tests. The review feedback points out a potential memory leak in the truncation logic, where slicing the slice from the front keeps discarded pointers pinned in memory, and suggests copying the remaining elements to a new slice instead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread pkg/controllers/job/job_controller_actions.go Outdated
@avinxshKD

Copy link
Copy Markdown
Author

@JesseStutler Just a quick note on this one: the core fix is very minimal. The bulk of the lines added here are just the new unit tests in job_controller_actions_test.go to ensure the truncation and deduplication work exactly as expected.

Happy to make any adjustments to the cap size or logic if needed, thanks

This adds a maxJobConditions cap (32) and deduplicates consecutive identical condition phases, preventing etcd object-size exceeded errors when frequently retrying jobs.

Signed-off-by: Avinash Kumar Deepak <avinash8655279@gmail.com>
Allocates a new slice and copies elements instead of slicing the existing array, ensuring old pointers are safely garbage collected.

Signed-off-by: Avinash Kumar Deepak <avinash8655279@gmail.com>
Comment thread pkg/controllers/job/job_controller_actions.go Outdated
Comment thread pkg/controllers/job/job_controller_actions.go Outdated
Signed-off-by: Avinash Kumar Deepak <avinash8655279@gmail.com>
@avinxshKD avinxshKD requested a review from hzxuzhonghu June 11, 2026 07:10
@avinxshKD

Copy link
Copy Markdown
Author

@hzxuzhonghu lgtm, happy to make any further changes if needed thanks.

// slice would exceed maxJobConditions the oldest entries are discarded so that
// only the most recent maxJobConditions entries are kept.
func appendJobCondition(conditions []batch.JobCondition, c batch.JobCondition) []batch.JobCondition {
if n := len(conditions); n > 0 && conditions[n-1].Status == c.Status {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THis is not right. You should use a loop and match delete the original condition and append the new condition at last

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rewritten the logic exactly as requested, it now loops through to filter out any existing condition with the matching phase, and appends the new one at the end.

Tests are updated and passingg

Signed-off-by: Avinash Kumar Deepak <avinash8655279@gmail.com>
@volcano-sh-bot volcano-sh-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

job.Status.Conditions grows unbounded, eventually causing UpdateStatus to fail with etcd object-size exceeded

3 participants