Skip to content

feat(metrics): expose queue inqueue resource metrics in proportion and capacity plugins#5243

Merged
volcano-sh-bot merged 5 commits into
volcano-sh:masterfrom
Aman-Cool:feat/queue-inqueue-metrics
Jun 11, 2026
Merged

feat(metrics): expose queue inqueue resource metrics in proportion and capacity plugins#5243
volcano-sh-bot merged 5 commits into
volcano-sh:masterfrom
Aman-Cool:feat/queue-inqueue-metrics

Conversation

@Aman-Cool

@Aman-Cool Aman-Cool commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?
/kind feature

What this PR does / why we need it:

Both proportion and capacity plugins compute inqueue resources (what's reserved for admitted-but-not-yet-running jobs) and use it as the admission gate, but never expose it as a Prometheus metric. This means when a job is stuck at Pending because the queue is full, there's nothing in Grafana to explain why; allocated looks low, real_capacity looks fine, but inqueue is silently consuming the headroom.

This PR adds volcano_queue_inqueue_milli_cpu, volcano_queue_inqueue_memory_bytes, and volcano_queue_inqueue_scalar_resources; same pattern as the existing allocated/request/deserved gauges. Both plugins emit the metric at session-open only, which correctly covers all queues (including ancestor queues in hierarchical mode) with a lag of at most one scheduling cycle (~a few seconds).

Which issue(s) this PR fixes:
Fixes #5242

Special notes for your reviewer:

Mid-session metric updates were intentionally skipped for both plugins. For capacity, admitting a job mutates inqueue on the leaf and all ancestor queues; emitting only the leaf mid-session would give an inconsistent picture. For proportion, inqueue only increases mid-session (on job admission) with no corresponding decrease when tasks get allocated, which would cause over-reporting within a session. Session-open recomputes everything correctly each cycle.

AI Disclosure: This change was developed with AI assistance (Claude). The author has reviewed and understands all changes.

Does this PR introduce a user-facing change?

Added Prometheus metrics volcano_queue_inqueue_milli_cpu, volcano_queue_inqueue_memory_bytes, and volcano_queue_inqueue_scalar_resources to expose per-queue inqueue resource reservations from the proportion and capacity plugins.

Copilot AI review requested due to automatic review settings April 24, 2026 21:11
@volcano-sh-bot volcano-sh-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 24, 2026
@volcano-sh-bot volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 24, 2026
@Aman-Cool Aman-Cool force-pushed the feat/queue-inqueue-metrics branch from 250d1c7 to d72c70e Compare April 24, 2026 21:12

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new Prometheus metrics to track "inqueue" resources—specifically CPU, memory, and scalar resources—for jobs that have been admitted but are not yet running within a queue. The changes include the definition of these metrics, a new update function, and cleanup logic in the metrics package, along with integration into the capacity and proportion scheduler plugins. A review comment identifies a potential issue in the proportion plugin where updating the inqueue metric mid-session could lead to over-reporting; it is recommended to rely on the updates performed at the start of the session for consistency.

Comment thread pkg/scheduler/plugins/proportion/proportion.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds missing Prometheus visibility into per-queue “inqueue” (admitted-but-not-yet-running) resource reservations used by the proportion and capacity admission gates, so dashboards can explain “queue full” pending behavior.

Changes:

  • Add new queue metrics: volcano_queue_inqueue_milli_cpu, volcano_queue_inqueue_memory_bytes, volcano_queue_inqueue_scalar_resources.
  • Emit inqueue metrics from proportion and capacity plugins (session-open for both; additionally on each admission decision for proportion).
  • Add unit test coverage for UpdateQueueInqueue and metric cleanup behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
pkg/scheduler/plugins/proportion/proportion.go Emit per-queue inqueue metrics on session open and update after job admission mutates inqueue.
pkg/scheduler/plugins/capacity/capacity.go Emit per-queue inqueue metrics when building queue attributes (including hierarchical mode).
pkg/scheduler/metrics/queue.go Define inqueue GaugeVecs, implement UpdateQueueInqueue, and include deletion/cleanup in DeleteQueueMetrics.
pkg/scheduler/metrics/queue_scalar_test.go Add test validating inqueue metric updates, scalar zeroing on removal, and series cleanup on delete.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…d capacity plugins

Signed-off-by: Aman-Cool <aman017102007@gmail.com>
@Aman-Cool Aman-Cool force-pushed the feat/queue-inqueue-metrics branch from d72c70e to 2101cfc Compare April 24, 2026 21:13
Signed-off-by: Aman-Cool <aman017102007@gmail.com>
@Aman-Cool Aman-Cool changed the title feat(metrics): expose queue inqueue resource metrics in proportion an… feat(metrics): expose queue inqueue resource metrics in proportion and capacity plugins Apr 24, 2026
…in TestUpdateQueueInqueue

Signed-off-by: Aman-Cool <aman017102007@gmail.com>
@Aman-Cool

Copy link
Copy Markdown
Contributor Author

/assign @hzxuzhonghu

@Aman-Cool

Copy link
Copy Markdown
Contributor Author

/assign @hajnalmt

@Aman-Cool

Copy link
Copy Markdown
Contributor Author

/assign @JesseStutler

Signed-off-by: Aman-Cool <aman017102007@gmail.com>
@Aman-Cool Aman-Cool force-pushed the feat/queue-inqueue-metrics branch from a289843 to 3a35978 Compare May 3, 2026 17:28

@hajnalmt hajnalmt left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! This is a good idea.

Please find my remarks below.

Comment thread docs/design/metrics.md Outdated
Comment thread pkg/scheduler/metrics/queue.go Outdated
Comment thread pkg/scheduler/metrics/queue_scalar_test.go Outdated
- remove "reserved" from inqueue metric Help strings
- use "admitted but not yet running" (no hyphens) consistently
- move TestUpdateQueueInqueue into TestQueueResourceMetric in queue_test.go
- remove TestUpdateQueueInqueue from queue_scalar_test.go

Signed-off-by: Aman-Cool <aman017102007@gmail.com>
@Aman-Cool

Copy link
Copy Markdown
Contributor Author

@hajnalmt, removed "reserved", fixed the hyphenation, and moved the inqueue test into TestQueueResourceMetric in queue_test.go.
lmk if anything else needs changing...

@Aman-Cool Aman-Cool requested a review from hajnalmt May 4, 2026 14:00
@Aman-Cool

Copy link
Copy Markdown
Contributor Author

Hey @hajnalmt, just checking in.., are there any other changes needed from my side before this can move forward? Happy to address anything!

@Aman-Cool

Aman-Cool commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

@hzxuzhonghu, could you have a look into this when you get some time🙏

@hzxuzhonghu hzxuzhonghu left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 4, 2026

@hajnalmt hajnalmt left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hajnalmt, hzxuzhonghu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 11, 2026
@volcano-sh-bot volcano-sh-bot merged commit a4830bf into volcano-sh:master Jun 11, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

enhancement(scheduler): expose queue inqueue resource metrics in proportion and capacity plugins

5 participants