-
Notifications
You must be signed in to change notification settings - Fork 5
Description
While working on energy awareness the following idea came up.
Description
Summary
Add support for a "Generic Metric" type that allows users to define complex Prometheus queries directly in the Datasource CR and use the results for weighing or filtering hosts during scheduling decisions. This would enable building weigher/filter logic purely in Prometheus queries without requiring custom Go code for each new metric.
Motivation
Currently, adding a new Prometheus-based weigher or filter requires:
- Creating a new typed metric struct in
internal/knowledge/datasources/plugins/prometheus/types.go - Implementing a new extractor in
internal/knowledge/extractor/plugins/ - Implementing a new weigher/filter in
internal/scheduling/<domain>/plugins/weighers/ - Registering the new plugin in the index
This is a significant amount of boilerplate for what is sometimes a simple "query Prometheus -> map to hosts -> apply weight/filter" pattern. A generic implementation would allow operators to:
- Rapidly prototype new scheduling heuristics
- Use complex PromQL aggregations without code changes
- Experiment with different metrics without redeploying cortex
Use Case Examples
Example 1: Weigher - Prefer Hosts with Low CPU Usage
Goal: Prefer hosts where the CPU is mostly idle (high idle ratio).
Prometheus Query:
1 - (
avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
/
avg by (instance) (rate(node_cpu_seconds_total[5m]))
)
This returns a value between 0 and 1, where 1 = fully idle, 0 = fully utilized.
Desired Behavior:
- Higher values (more idle) -> higher weight
- This would be configured as a weigher in the pipeline
Example Datasource CR:
apiVersion: cortex.cloud/v1alpha1
kind: Datasource
metadata:
name: node-cpu-idle-ratio
spec:
schedulingDomain: nova
type: prometheus
prometheus:
query: |
1 - (
avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
/
avg by (instance) (rate(node_cpu_seconds_total[5m]))
)
alias: node_cpu_idle_ratio
type: generic # newExample 2: Filter - Exclude Hosts Under IRQ Pressure
Goal: Filter out hosts experiencing significant IRQ stall pressure.
Prometheus Query:
rate(node_pressure_irq_stalled_seconds_total[5m]) > bool 0.005
This returns 1 (true) for hosts under pressure, 0 (false) for healthy hosts.
Desired Behavior:
- Hosts returning
1-> filtered out (excluded from scheduling) - Hosts returning
0-> kept as valid candidates
Example Datasource CR:
apiVersion: cortex.cloud/v1alpha1
kind: Datasource
metadata:
name: node-irq-pressure-filter
spec:
schedulingDomain: nova
type: prometheus
prometheus:
query: |
rate(node_pressure_irq_stalled_seconds_total[5m]) > 0.005
alias: node_irq_pressure
type: generic # newKey Challenge: Label-to-Subject Mapping
Prometheus metrics use labels like instance, node, host or custom labels to identify targets. However Nova (and other OpenStack services) use their own naming conventions for compute hosts (e.g., compute_host from openstack API).
Example Mismatch:
| Prometheus Label | Nova Compute Host |
|---|---|
instance="10.0.0.5:9100" |
compute-node-01 |
node="worker-1.internal" |
nova-compute-worker-1 |
hostsystem="esxi-host-42.vcenter.local" |
vc-a-0-runq42 |
Current State:
Cortex already solves this for vROps metrics using mapping Knowledges like vmware-resolved-hostsystems that translate vROps hostsystem names to Nova compute hosts.
Needed Solution:
The generic metric implementation needs a flexible way to map Prometheus labels to scheduling subjects. Some ideas:
- Direct Mapping: Label value directly matches subject name (simplest case)
- Mapping Knowledge Reference: Reference an existing Knowledge CR that contains the label-to-subject mapping
- Label Transformation Template: Apply a transformation (e.g., strip suffix, regex extract) text/template
Proposed API (Ideas Welcome!)
apiVersion: cortex.cloud/v1alpha1
kind: Pipeline
metadata:
name: nova-external-scheduler
spec:
schedulingDomain: nova
type: filter-weigher
filters:
- name: generic
params:
datasource: node-irq-pressure-filter
mapping:
# Option A: Label value directly matches subject name
label: "instance"
# Option B: Reference an existing mapping Knowledge CR
knowledgeRef: "prometheus-to-nova-mapping"
# Option C: Apply a transformation template to the label
transform: "{{ trimSuffix \":9100\" .instance }}"
weighers:
- name: generic
weight: 1.0
params:
datasource: node-cpu-idle-ratio
mapping:
knowledgeRef: "prometheus-to-nova-mapping"Questions for Discussion
- General Interest: Do you consider this feature broadly useful? In my case, it would simplify energy-aware weighing.
- Mapping Strategy: Which approach (or combination) for label-to-subject mapping makes the most sense?
- Query Execution: Should the generic metric always be evaluated as an instant query (single point in time), with all temporal logic expressed in PromQL itself?
- Knowledge Integration The current workflow is build around Knowledge CR. Should we use a Knowledge CR for each Datasource, or a single shared one?