Skip to content

[Feature Idea] Generic Prometheus Metric Plugin for Dynamic Weigher/Filter Logic #497

@henrichter

Description

@henrichter

While working on energy awareness the following idea came up.

Description

Summary

Add support for a "Generic Metric" type that allows users to define complex Prometheus queries directly in the Datasource CR and use the results for weighing or filtering hosts during scheduling decisions. This would enable building weigher/filter logic purely in Prometheus queries without requiring custom Go code for each new metric.


Motivation

Currently, adding a new Prometheus-based weigher or filter requires:

  1. Creating a new typed metric struct in internal/knowledge/datasources/plugins/prometheus/types.go
  2. Implementing a new extractor in internal/knowledge/extractor/plugins/
  3. Implementing a new weigher/filter in internal/scheduling/<domain>/plugins/weighers/
  4. Registering the new plugin in the index

This is a significant amount of boilerplate for what is sometimes a simple "query Prometheus -> map to hosts -> apply weight/filter" pattern. A generic implementation would allow operators to:

  • Rapidly prototype new scheduling heuristics
  • Use complex PromQL aggregations without code changes
  • Experiment with different metrics without redeploying cortex

Use Case Examples

Example 1: Weigher - Prefer Hosts with Low CPU Usage

Goal: Prefer hosts where the CPU is mostly idle (high idle ratio).

Prometheus Query:

1 - (
  avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
  /
  avg by (instance) (rate(node_cpu_seconds_total[5m]))
)

This returns a value between 0 and 1, where 1 = fully idle, 0 = fully utilized.

Desired Behavior:

  • Higher values (more idle) -> higher weight
  • This would be configured as a weigher in the pipeline

Example Datasource CR:

apiVersion: cortex.cloud/v1alpha1
kind: Datasource
metadata:
  name: node-cpu-idle-ratio
spec:
  schedulingDomain: nova
  type: prometheus
  prometheus:
    query: |
      1 - (
        avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
        /
        avg by (instance) (rate(node_cpu_seconds_total[5m]))
      )
    alias: node_cpu_idle_ratio
    type: generic  # new

Example 2: Filter - Exclude Hosts Under IRQ Pressure

Goal: Filter out hosts experiencing significant IRQ stall pressure.

Prometheus Query:

rate(node_pressure_irq_stalled_seconds_total[5m]) > bool 0.005

This returns 1 (true) for hosts under pressure, 0 (false) for healthy hosts.

Desired Behavior:

  • Hosts returning 1 -> filtered out (excluded from scheduling)
  • Hosts returning 0 -> kept as valid candidates

Example Datasource CR:

apiVersion: cortex.cloud/v1alpha1
kind: Datasource
metadata:
  name: node-irq-pressure-filter
spec:
  schedulingDomain: nova
  type: prometheus
  prometheus:
    query: |
      rate(node_pressure_irq_stalled_seconds_total[5m]) > 0.005
    alias: node_irq_pressure
    type: generic  # new

Key Challenge: Label-to-Subject Mapping

Prometheus metrics use labels like instance, node, host or custom labels to identify targets. However Nova (and other OpenStack services) use their own naming conventions for compute hosts (e.g., compute_host from openstack API).

Example Mismatch:

Prometheus Label Nova Compute Host
instance="10.0.0.5:9100" compute-node-01
node="worker-1.internal" nova-compute-worker-1
hostsystem="esxi-host-42.vcenter.local" vc-a-0-runq42

Current State:
Cortex already solves this for vROps metrics using mapping Knowledges like vmware-resolved-hostsystems that translate vROps hostsystem names to Nova compute hosts.

Needed Solution:
The generic metric implementation needs a flexible way to map Prometheus labels to scheduling subjects. Some ideas:

  1. Direct Mapping: Label value directly matches subject name (simplest case)
  2. Mapping Knowledge Reference: Reference an existing Knowledge CR that contains the label-to-subject mapping
  3. Label Transformation Template: Apply a transformation (e.g., strip suffix, regex extract) text/template

Proposed API (Ideas Welcome!)

apiVersion: cortex.cloud/v1alpha1
kind: Pipeline
metadata:
  name: nova-external-scheduler
spec:
  schedulingDomain: nova
  type: filter-weigher
  filters:
  - name: generic
    params:
      datasource: node-irq-pressure-filter
      mapping:
        # Option A: Label value directly matches subject name
        label: "instance"
        
        # Option B: Reference an existing mapping Knowledge CR
        knowledgeRef: "prometheus-to-nova-mapping"
        
        # Option C: Apply a transformation template to the label
        transform: "{{ trimSuffix \":9100\" .instance }}"
  weighers:
  - name: generic
    weight: 1.0
    params:
      datasource: node-cpu-idle-ratio
      mapping:
        knowledgeRef: "prometheus-to-nova-mapping"

Questions for Discussion

  1. General Interest: Do you consider this feature broadly useful? In my case, it would simplify energy-aware weighing.
  2. Mapping Strategy: Which approach (or combination) for label-to-subject mapping makes the most sense?
  3. Query Execution: Should the generic metric always be evaluated as an instant query (single point in time), with all temporal logic expressed in PromQL itself?
  4. Knowledge Integration The current workflow is build around Knowledge CR. Should we use a Knowledge CR for each Datasource, or a single shared one?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions