Fix unbounded job.Status.Conditions growth#5422
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @avinxshKD! It looks like this is your first PR to volcano-sh/volcano 🎉 |
There was a problem hiding this comment.
Code Review
This pull request limits the number of job conditions stored in job.Status.Conditions to a maximum of 32 to prevent unbounded growth in etcd. It introduces the appendJobCondition helper function to handle deduplication of consecutive identical phases and truncation of older entries, accompanied by unit tests. The review feedback points out a potential memory leak in the truncation logic, where slicing the slice from the front keeps discarded pointers pinned in memory, and suggests copying the remaining elements to a new slice instead.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
@JesseStutler Just a quick note on this one: the core fix is very minimal. The bulk of the lines added here are just the new unit tests in job_controller_actions_test.go to ensure the truncation and deduplication work exactly as expected. Happy to make any adjustments to the cap size or logic if needed, thanks |
35eb874 to
f64afad
Compare
21d95d6 to
9851de1
Compare
This adds a maxJobConditions cap (32) and deduplicates consecutive identical condition phases, preventing etcd object-size exceeded errors when frequently retrying jobs. Signed-off-by: Avinash Kumar Deepak <avinash8655279@gmail.com>
Allocates a new slice and copies elements instead of slicing the existing array, ensuring old pointers are safely garbage collected. Signed-off-by: Avinash Kumar Deepak <avinash8655279@gmail.com>
744397b to
c214fe1
Compare
Signed-off-by: Avinash Kumar Deepak <avinash8655279@gmail.com>
|
@hzxuzhonghu lgtm, happy to make any further changes if needed thanks. |
| // slice would exceed maxJobConditions the oldest entries are discarded so that | ||
| // only the most recent maxJobConditions entries are kept. | ||
| func appendJobCondition(conditions []batch.JobCondition, c batch.JobCondition) []batch.JobCondition { | ||
| if n := len(conditions); n > 0 && conditions[n-1].Status == c.Status { |
There was a problem hiding this comment.
THis is not right. You should use a loop and match delete the original condition and append the new condition at last
There was a problem hiding this comment.
I've rewritten the logic exactly as requested, it now loops through to filter out any existing condition with the matching phase, and appends the new one at the end.
Tests are updated and passingg
Signed-off-by: Avinash Kumar Deepak <avinash8655279@gmail.com>
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fixes an issue where
job.Status.Conditionsgrows unbounded during frequent job phase transitions (especially for jobs withRestartJobpolicies that retry often). Previously, every phase transition unconditionally appended a new condition without deduplication or a maximum length check.For heavily retried jobs, this slice could easily accumulate thousands of entries, eventually causing the
UpdateStatuscall to permanently fail withetcdserver: request is too large. At this point, the job becomes unrecoverable without manual deletion.This fix introduces an
appendJobConditionhelper that:maxJobConditions = 32, truncating the oldest entries to ensure the etcd object size remains bounded and safe.Which issue(s) this PR fixes:
Fixes #5421
Special notes for your reviewer:
The
maxJobConditionsis set to 32, which is more than enough to capture the recent historical context of a job's state transitions without risking etcd size limits.Does this PR introduce a user-facing change?