fix: update armada to respect node pod limits#4517
fix: update armada to respect node pod limits#4517jparraga-stackav wants to merge 4 commits intoarmadaproject:masterfrom
Conversation
Signed-off-by: Jason Parraga <jparraga@stackav.com>
Signed-off-by: Jason Parraga <jparraga@stackav.com>
| FailedPodChecks podchecks.FailedChecks | ||
| PendingPodChecks *podchecks.Checks | ||
| FatalPodSubmissionErrors []string | ||
| ResourcesToSanitize []string |
There was a problem hiding this comment.
Can you please add docs for the new field?
There was a problem hiding this comment.
In our docs, we often link to config in https://pkg.go.dev, i.e. for the executor it will be https://pkg.go.dev/github.com/armadaproject/armada/internal/executor/configuration#ApplicationConfiguration, it is easier for the end-users if they can see the config field docs there.
| } | ||
|
|
||
| // Sanitizes pod resources that may be used during scheduling but are invalid at runtime. | ||
| func (submitService *SubmitService) sanitizePodResources(pod *v1.Pod) { |
There was a problem hiding this comment.
nit: this function can be simplified to a simple sanitizePodResources function, no need for it to be a receiver method on SubmitService, same for sanitizeResourceList
|
Hey so generally a good direction but I there are a few things I think we'd like to consider / possibly change
|
|
@dejanzele @nikola-jokic to reveiw |
|
Closing in favour of #4841 |
…ity (#4841) ## Summary - Adds `scheduling.respectNodePodLimits` feature flag (default `false`) that enables the scheduler to track `pods` as a resource and reject scheduling to nodes that have exhausted their pod limit (`node.Status.Allocatable["pods"]`) - When enabled, the scheduler programmatically registers `pods` in `supportedResourceTypes` and `indexedResources` at startup, and injects `pods: 1` into every job's internal resource requirements - The executor now always reports non-Armada pod count in `NonArmadaAllocatedResources` so the scheduler can subtract system/DaemonSet pods from available capacity Fixes #4515 This PR builds on top of #4517 and big thanks for the initial work to @Sovietaced ## Operator upgrade notes - **Executor change is unconditional.** After the executor upgrade, `NonArmadaAllocatedResources` gains a `pods` key in every report regardless of whether any scheduler has the flag enabled. Dashboards, metrics, or custom consumers that iterate this map generically (e.g. sum over all keys) will start including pod counts. Audit prometheus / Grafana panels before rollout. - **Rollback is clean.** Reverting the scheduler flag to `false` stops the scheduler from tracking pods; reverting the executor binary removes the `pods` key from its reports. Neither requires data migration. - **Rolling upgrade order is flexible.** Old scheduler + new executor is safe (scheduler's `FromNodeProto` silently drops unknown resources). New scheduler + old executor briefly overestimates free pod capacity by the count of non-Armada pods per node (~10-30 typically, DaemonSets + system pods), since the old executor does not report them. The overestimate resolves as executors are upgraded. ## Known limitations - `pods` is **not** added to `dominantResourceFairnessResourcesToConsider`. On dense-pod nodes (e.g. GKE's 110-pod limit) a queue running many small pods can monopolize pod slots without a fair-share penalty. Deferred per reviewer request; follow-up if this becomes a problem in practice. Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
What type of PR is this?
Bug fix
What this PR does / why we need it:
This pull request updates armada to respect node limits. Without this change it is possible for Armada to try and schedule pods to nodes which do not have capacity in regards to concurrent running pods. When this happens the pods get stuck in a pending state and end up getting lease returned. Presumably Armada considers these pods towards fair share even though they cannot run.
podspodsresource in scheduling decisions by defaultpodsresources by defaultWhich issue(s) this PR fixes:
Fixes #4515
Special notes for your reviewer:
In order to rollout safely the executors must be updated first. After that the scheduler/server can be rolled out in any order.