feat(metrics): expose queue inqueue resource metrics in proportion and capacity plugins#5243
Conversation
250d1c7 to
d72c70e
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces new Prometheus metrics to track "inqueue" resources—specifically CPU, memory, and scalar resources—for jobs that have been admitted but are not yet running within a queue. The changes include the definition of these metrics, a new update function, and cleanup logic in the metrics package, along with integration into the capacity and proportion scheduler plugins. A review comment identifies a potential issue in the proportion plugin where updating the inqueue metric mid-session could lead to over-reporting; it is recommended to rely on the updates performed at the start of the session for consistency.
There was a problem hiding this comment.
Pull request overview
Adds missing Prometheus visibility into per-queue “inqueue” (admitted-but-not-yet-running) resource reservations used by the proportion and capacity admission gates, so dashboards can explain “queue full” pending behavior.
Changes:
- Add new queue metrics:
volcano_queue_inqueue_milli_cpu,volcano_queue_inqueue_memory_bytes,volcano_queue_inqueue_scalar_resources. - Emit
inqueuemetrics from proportion and capacity plugins (session-open for both; additionally on each admission decision for proportion). - Add unit test coverage for
UpdateQueueInqueueand metric cleanup behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| pkg/scheduler/plugins/proportion/proportion.go | Emit per-queue inqueue metrics on session open and update after job admission mutates inqueue. |
| pkg/scheduler/plugins/capacity/capacity.go | Emit per-queue inqueue metrics when building queue attributes (including hierarchical mode). |
| pkg/scheduler/metrics/queue.go | Define inqueue GaugeVecs, implement UpdateQueueInqueue, and include deletion/cleanup in DeleteQueueMetrics. |
| pkg/scheduler/metrics/queue_scalar_test.go | Add test validating inqueue metric updates, scalar zeroing on removal, and series cleanup on delete. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…d capacity plugins Signed-off-by: Aman-Cool <aman017102007@gmail.com>
d72c70e to
2101cfc
Compare
Signed-off-by: Aman-Cool <aman017102007@gmail.com>
…in TestUpdateQueueInqueue Signed-off-by: Aman-Cool <aman017102007@gmail.com>
|
/assign @hzxuzhonghu |
|
/assign @hajnalmt |
|
/assign @JesseStutler |
Signed-off-by: Aman-Cool <aman017102007@gmail.com>
a289843 to
3a35978
Compare
hajnalmt
left a comment
There was a problem hiding this comment.
Thanks for the contribution! This is a good idea.
Please find my remarks below.
- remove "reserved" from inqueue metric Help strings - use "admitted but not yet running" (no hyphens) consistently - move TestUpdateQueueInqueue into TestQueueResourceMetric in queue_test.go - remove TestUpdateQueueInqueue from queue_scalar_test.go Signed-off-by: Aman-Cool <aman017102007@gmail.com>
|
@hajnalmt, removed "reserved", fixed the hyphenation, and moved the inqueue test into |
|
Hey @hajnalmt, just checking in.., are there any other changes needed from my side before this can move forward? Happy to address anything! |
|
@hzxuzhonghu, could you have a look into this when you get some time🙏 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hajnalmt, hzxuzhonghu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Both proportion and capacity plugins compute inqueue resources (what's reserved for admitted-but-not-yet-running jobs) and use it as the admission gate, but never expose it as a Prometheus metric. This means when a job is stuck at Pending because the queue is full, there's nothing in Grafana to explain why;
allocatedlooks low,real_capacitylooks fine, butinqueueis silently consuming the headroom.This PR adds
volcano_queue_inqueue_milli_cpu,volcano_queue_inqueue_memory_bytes, andvolcano_queue_inqueue_scalar_resources; same pattern as the existingallocated/request/deservedgauges. Both plugins emit the metric at session-open only, which correctly covers all queues (including ancestor queues in hierarchical mode) with a lag of at most one scheduling cycle (~a few seconds).Which issue(s) this PR fixes:
Fixes #5242
Special notes for your reviewer:
Mid-session metric updates were intentionally skipped for both plugins. For capacity, admitting a job mutates
inqueueon the leaf and all ancestor queues; emitting only the leaf mid-session would give an inconsistent picture. For proportion,inqueueonly increases mid-session (on job admission) with no corresponding decrease when tasks get allocated, which would cause over-reporting within a session. Session-open recomputes everything correctly each cycle.AI Disclosure: This change was developed with AI assistance (Claude). The author has reviewed and understands all changes.
Does this PR introduce a user-facing change?
Added Prometheus metrics
volcano_queue_inqueue_milli_cpu,volcano_queue_inqueue_memory_bytes, andvolcano_queue_inqueue_scalar_resourcesto expose per-queue inqueue resource reservations from the proportion and capacity plugins.