Add prometheus-exporter contrib#104
Add prometheus-exporter contrib#104IsaiahStapleton wants to merge 1 commit intoopenshift-psap:mainfrom
Conversation
SEE README.md. This tool automatically discovers KServe InferenceService models in a Kubernetes/OpenShift cluster, runs load tests against them, and exports the results as Prometheus metrics. Signed-off-by: IsaiahStapleton <istaplet@redhat.com>
📝 WalkthroughWalkthroughThis PR introduces a complete Prometheus metrics exporter system for Changes
Sequence Diagram(s)sequenceDiagram
participant Runner as Runner Container
participant K8sAPI as Kubernetes API
participant Secret as Secrets Store
participant LLM as Load Test CLI
participant Volume as Shared Volume
participant Exporter as Exporter Container
participant Prom as Prometheus
Runner->>K8sAPI: List pods with<br/>serving.kserve.io/inferenceservice label
K8sAPI-->>Runner: Pod list with gather_llm_metrics labels
loop For each eligible model
Runner->>K8sAPI: Resolve service endpoint<br/>for model
K8sAPI-->>Runner: Service URL + port info
alt Auth enabled on pod
Runner->>Secret: Retrieve bearer token<br/>from service account secret
Secret-->>Runner: Auth token
end
Runner->>LLM: Execute load-test<br/>with model config
LLM-->>LLM: Run load tests
LLM-->>Volume: Write JSON results<br/>{model}_{namespace}.json
end
Runner->>Volume: Clean stale JSON files
par Metrics Export
Exporter->>Volume: Read JSON result files
Volume-->>Exporter: Model latency/throughput data
Exporter->>Exporter: Parse & update<br/>Prometheus Gauges
Exporter-->>Prom: /metrics endpoint<br/>(model, namespace labels)
and Prometheus Scraping
Prom->>Exporter: GET /metrics<br/>every 120s
Exporter-->>Prom: Metric samples
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (7)
contrib/prometheus-exporter/exporter/Containerfile (1)
1-12: Consider running as a non-root user for security hardening.The container runs as root by default. For better security posture, especially in Kubernetes/OpenShift environments, create and switch to a non-root user.
🛡️ Proposed fix to add non-root user
FROM python:3.12-slim WORKDIR /app +RUN useradd --create-home --shell /bin/bash appuser + COPY requirements.txt ./ RUN pip install --no-cache-dir -r requirements.txt COPY exporter.py wsgi.py ./ +RUN chown -R appuser:appuser /app +USER appuser + EXPOSE 8080 CMD ["gunicorn", "wsgi:app", "--log-level=info", "--workers", "2", "--bind", "0.0.0.0:8080"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@contrib/prometheus-exporter/exporter/Containerfile` around lines 1 - 12, The image runs as root; add a non-root user and switch to it: create a system group/user (e.g., "app"), ensure /app is owned by that user (chown) after copying files and installing deps, and then set USER app before the CMD so gunicorn runs unprivileged; keep WORKDIR, EXPOSE and CMD unchanged but perform pip install as root (or use a temporary root step) then drop privileges by switching to the created user (reference symbols: WORKDIR /app, COPY, RUN pip install, and CMD ["gunicorn", "wsgi:app"...], and add USER <name>).contrib/prometheus-exporter/exporter/requirements.txt (1)
1-3: Consider pinning exact versions for reproducible container builds.The
>=constraints allow flexibility but can lead to non-reproducible builds whenpip installruns during container builds. While this pattern is consistent with other contrib tools in the repository, pinning exact versions or adding a lock file would ensure the container image remains identical across rebuilds and deployments.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@contrib/prometheus-exporter/exporter/requirements.txt` around lines 1 - 3, Replace the floating `>=` version constraints in contrib/prometheus-exporter/exporter/requirements.txt with exact pinned versions (e.g., set flask, prometheus-client, gunicorn to specific versions) or alternatively add a generated lock file (pip-tools requirements.txt/.lock or pipfile.lock) and update CI/container build steps to install from that lock file so container builds are reproducible; target the package lines for flask, prometheus-client, and gunicorn in the file when making the change.contrib/prometheus-exporter/runner/datasets/dataset.jsonl (1)
1-1: Update the license placeholder values.The MIT license text contains
[year]and[fullname]placeholders that should be replaced with actual values for proper attribution.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@contrib/prometheus-exporter/runner/datasets/dataset.jsonl` at line 1, The license field in the dataset entry contains placeholder tokens "[year]" and "[fullname]" which must be replaced with real attribution values; update the "license" string in the JSON object (the entry with "name": "synthetic-data") to substitute [year] with the current copyright year and [fullname] with the project or author name so the MIT license text is complete and accurate.contrib/prometheus-exporter/base/deployment.yaml (2)
14-66: Consider adding security hardening to container specs.Static analysis flags missing security context settings. While not blocking for a contrib tool, adding these settings would improve security posture and serve as a good example for users customizing this deployment.
🔒 Suggested security context additions
spec: containers: - name: exporter image: quay.io/openshift-psap/llm-load-test-exporter:latest imagePullPolicy: Always + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + capabilities: + drop: + - ALL ports: - name: web containerPort: 8080Apply the same
securityContextblock to therunnercontainer as well.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@contrib/prometheus-exporter/base/deployment.yaml` around lines 14 - 66, Add container-level security hardening to both containers ("exporter" and "runner"): set securityContext with runAsNonRoot: true and a non-root runAsUser (e.g., 1000), set allowPrivilegeEscalation: false, set readOnlyRootFilesystem: true where feasible, and drop all capabilities (capabilities.drop: ["ALL"]); also consider adding a minimal podSecurityContext (fsGroup/runAsUser) consistent with the container runAsUser and ensure the serviceAccountName (llm-load-test-sa) has no elevated permissions. Update the "exporter" container block to include these securityContext settings and apply the same securityContext to the "runner" container.
17-17: Consider pinning image tags instead of using:latest.Using
:latesttags can lead to unpredictable deployments when images are updated. For reproducibility, consider using specific version tags or SHA digests, especially when documenting deployment instructions.Also applies to: 48-48
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@contrib/prometheus-exporter/base/deployment.yaml` at line 17, The deployment uses an unpinned image tag "quay.io/openshift-psap/llm-load-test-exporter:latest" which can cause unpredictable deployments; update the image field(s) (e.g., the "image:" entries for the container in the Deployment manifest and the other occurrence noted) to a specific version tag or an immutable SHA digest (quay pullspec@sha256:...) so deployments are reproducible and stable.contrib/prometheus-exporter/README.md (1)
94-188: Add a language specifier to the fenced code block.The example metrics output block lacks a language specifier. Use
textorpromqlto satisfy the linter and improve syntax highlighting.✏️ Suggested fix
-``` +```text # HELP llm_load_test_tpot_mean_ms Mean Time Per Output Token (ms)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@contrib/prometheus-exporter/README.md` around lines 94 - 188, The fenced example metrics block in contrib/prometheus-exporter/README.md lacks a language specifier, causing linter/syntax-highlighting issues; update the opening triple-backticks for that example metrics block to include a language (e.g., add "text" or "promql") so the block becomes ```text and the linter stops complaining and highlighting works correctly.contrib/prometheus-exporter/exporter/exporter.py (1)
131-132: Use publicclear()method instead of accessing private_metricsattribute.The
prometheus_client.Gaugeclass provides a documentedclear()method that safely empties all labelsets. Changegauge._metrics.clear()togauge.clear()to avoid relying on implementation details that may change across library versions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@contrib/prometheus-exporter/exporter/exporter.py` around lines 131 - 132, Replace direct access to the private attribute with the public API: instead of calling gauge._metrics.clear() in the loop over ALL_GAUGES, call gauge.clear() so you use the documented prometheus_client.Gauge.clear() method; update the loop that iterates ALL_GAUGES and each gauge variable to invoke clear() (referencing ALL_GAUGES and the Gauge instances) to avoid relying on the private _metrics attribute.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@contrib/prometheus-exporter/runner/runner.py`:
- Around line 198-249: discover_and_test_models currently iterates pods and runs
load tests per pod which causes duplicate tests for the same (model_name,
namespace); deduplicate by tracking seen (model_name, namespace) pairs before
building cfg and calling run_load_test. Modify discover_and_test_models to
maintain a set (e.g., seen_models) of tuples (model_name, namespace), skip
processing if the tuple is already present, and only add to active_files and
call get_auth_token, build_config, and run_load_test for the first occurrence;
keep existing logging and fallback URL logic unchanged.
---
Nitpick comments:
In `@contrib/prometheus-exporter/base/deployment.yaml`:
- Around line 14-66: Add container-level security hardening to both containers
("exporter" and "runner"): set securityContext with runAsNonRoot: true and a
non-root runAsUser (e.g., 1000), set allowPrivilegeEscalation: false, set
readOnlyRootFilesystem: true where feasible, and drop all capabilities
(capabilities.drop: ["ALL"]); also consider adding a minimal podSecurityContext
(fsGroup/runAsUser) consistent with the container runAsUser and ensure the
serviceAccountName (llm-load-test-sa) has no elevated permissions. Update the
"exporter" container block to include these securityContext settings and apply
the same securityContext to the "runner" container.
- Line 17: The deployment uses an unpinned image tag
"quay.io/openshift-psap/llm-load-test-exporter:latest" which can cause
unpredictable deployments; update the image field(s) (e.g., the "image:" entries
for the container in the Deployment manifest and the other occurrence noted) to
a specific version tag or an immutable SHA digest (quay pullspec@sha256:...) so
deployments are reproducible and stable.
In `@contrib/prometheus-exporter/exporter/Containerfile`:
- Around line 1-12: The image runs as root; add a non-root user and switch to
it: create a system group/user (e.g., "app"), ensure /app is owned by that user
(chown) after copying files and installing deps, and then set USER app before
the CMD so gunicorn runs unprivileged; keep WORKDIR, EXPOSE and CMD unchanged
but perform pip install as root (or use a temporary root step) then drop
privileges by switching to the created user (reference symbols: WORKDIR /app,
COPY, RUN pip install, and CMD ["gunicorn", "wsgi:app"...], and add USER
<name>).
In `@contrib/prometheus-exporter/exporter/exporter.py`:
- Around line 131-132: Replace direct access to the private attribute with the
public API: instead of calling gauge._metrics.clear() in the loop over
ALL_GAUGES, call gauge.clear() so you use the documented
prometheus_client.Gauge.clear() method; update the loop that iterates ALL_GAUGES
and each gauge variable to invoke clear() (referencing ALL_GAUGES and the Gauge
instances) to avoid relying on the private _metrics attribute.
In `@contrib/prometheus-exporter/exporter/requirements.txt`:
- Around line 1-3: Replace the floating `>=` version constraints in
contrib/prometheus-exporter/exporter/requirements.txt with exact pinned versions
(e.g., set flask, prometheus-client, gunicorn to specific versions) or
alternatively add a generated lock file (pip-tools requirements.txt/.lock or
pipfile.lock) and update CI/container build steps to install from that lock file
so container builds are reproducible; target the package lines for flask,
prometheus-client, and gunicorn in the file when making the change.
In `@contrib/prometheus-exporter/README.md`:
- Around line 94-188: The fenced example metrics block in
contrib/prometheus-exporter/README.md lacks a language specifier, causing
linter/syntax-highlighting issues; update the opening triple-backticks for that
example metrics block to include a language (e.g., add "text" or "promql") so
the block becomes ```text and the linter stops complaining and highlighting
works correctly.
In `@contrib/prometheus-exporter/runner/datasets/dataset.jsonl`:
- Line 1: The license field in the dataset entry contains placeholder tokens
"[year]" and "[fullname]" which must be replaced with real attribution values;
update the "license" string in the JSON object (the entry with "name":
"synthetic-data") to substitute [year] with the current copyright year and
[fullname] with the project or author name so the MIT license text is complete
and accurate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 36352909-cef1-467d-bf2e-86f67ef75f9a
📒 Files selected for processing (19)
contrib/prometheus-exporter/README.mdcontrib/prometheus-exporter/base/clusterrole.yamlcontrib/prometheus-exporter/base/clusterrolebinding.yamlcontrib/prometheus-exporter/base/deployment.yamlcontrib/prometheus-exporter/base/files/llm-load-test-config.envcontrib/prometheus-exporter/base/files/uwl_metrics_list.yamlcontrib/prometheus-exporter/base/kustomization.yamlcontrib/prometheus-exporter/base/service.yamlcontrib/prometheus-exporter/base/serviceaccount.yamlcontrib/prometheus-exporter/base/servicemonitor.yamlcontrib/prometheus-exporter/exporter/Containerfilecontrib/prometheus-exporter/exporter/exporter.pycontrib/prometheus-exporter/exporter/requirements.txtcontrib/prometheus-exporter/exporter/wsgi.pycontrib/prometheus-exporter/grafana/grafana-llm-load-test-dashboard.jsoncontrib/prometheus-exporter/runner/Containerfilecontrib/prometheus-exporter/runner/datasets/dataset.jsonlcontrib/prometheus-exporter/runner/requirements.txtcontrib/prometheus-exporter/runner/runner.py
| def discover_and_test_models() -> None: | ||
| """Discover KServe InferenceService models and run load tests.""" | ||
| try: | ||
| model_pods = v1.list_pod_for_all_namespaces( | ||
| label_selector="serving.kserve.io/inferenceservice" | ||
| ) | ||
| except Exception as exc: | ||
| LOG.error("Failed to list model pods: %s", exc) | ||
| return | ||
|
|
||
| active_files: set[str] = set() | ||
|
|
||
| for pod in model_pods.items: | ||
| model_name = pod.metadata.labels.get( | ||
| "serving.kserve.io/inferenceservice", "unknown" | ||
| ) | ||
| namespace = pod.metadata.namespace | ||
|
|
||
| # Only test pods that are Running and opted-in | ||
| gather = pod.metadata.labels.get("gather_llm_metrics") | ||
| if pod.status.phase != "Running" or not gather: | ||
| LOG.debug( | ||
| "Skipping %s/%s (phase=%s, gather_llm_metrics=%s)", | ||
| namespace, model_name, pod.status.phase, gather, | ||
| ) | ||
| continue | ||
|
|
||
| active_files.add(f"{model_name}_{namespace}.json") | ||
|
|
||
| # Check if token auth is required | ||
| annotations = pod.metadata.annotations or {} | ||
| enable_auth = ( | ||
| annotations.get("security.opendatahub.io/enable-auth") == "true" | ||
| ) | ||
| auth_token = get_auth_token(model_name, namespace) if enable_auth else None | ||
|
|
||
| host_url = _discover_service_url(model_name, namespace) | ||
| if host_url is None: | ||
| host_url = f"https://{model_name}.{namespace}.svc.cluster.local" | ||
| LOG.warning("No predictor service found for %s/%s, falling back to %s", | ||
| namespace, model_name, host_url) | ||
|
|
||
| LOG.info("Running load test for model %s in namespace %s (url=%s)", | ||
| model_name, namespace, host_url) | ||
|
|
||
| cfg = build_config(model_name, host_url, namespace, auth_token) | ||
| run_load_test(cfg) | ||
|
|
||
| LOG.info("Completed load test for model %s in namespace %s", | ||
| model_name, namespace) | ||
|
|
||
| _remove_stale_files(active_files) |
There was a problem hiding this comment.
Duplicate load tests run for models with multiple replicas.
The code iterates over all pods matching the serving.kserve.io/inferenceservice label. If an InferenceService has multiple replicas, each replica pod triggers a separate load test, all writing to the same output file. This wastes resources and produces non-deterministic results.
Deduplicate by (model_name, namespace) before running tests.
🔧 Suggested fix
def discover_and_test_models() -> None:
"""Discover KServe InferenceService models and run load tests."""
try:
model_pods = v1.list_pod_for_all_namespaces(
label_selector="serving.kserve.io/inferenceservice"
)
except Exception as exc:
LOG.error("Failed to list model pods: %s", exc)
return
active_files: set[str] = set()
+ seen_models: set[tuple[str, str]] = set()
for pod in model_pods.items:
model_name = pod.metadata.labels.get(
"serving.kserve.io/inferenceservice", "unknown"
)
namespace = pod.metadata.namespace
# Only test pods that are Running and opted-in
gather = pod.metadata.labels.get("gather_llm_metrics")
if pod.status.phase != "Running" or not gather:
LOG.debug(
"Skipping %s/%s (phase=%s, gather_llm_metrics=%s)",
namespace, model_name, pod.status.phase, gather,
)
continue
+ # Skip if we've already processed this model
+ model_key = (model_name, namespace)
+ if model_key in seen_models:
+ LOG.debug("Skipping duplicate pod for %s/%s", namespace, model_name)
+ continue
+ seen_models.add(model_key)
+
active_files.add(f"{model_name}_{namespace}.json")🧰 Tools
🪛 Ruff (0.15.7)
[warning] 204-204: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@contrib/prometheus-exporter/runner/runner.py` around lines 198 - 249,
discover_and_test_models currently iterates pods and runs load tests per pod
which causes duplicate tests for the same (model_name, namespace); deduplicate
by tracking seen (model_name, namespace) pairs before building cfg and calling
run_load_test. Modify discover_and_test_models to maintain a set (e.g.,
seen_models) of tuples (model_name, namespace), skip processing if the tuple is
already present, and only add to active_files and call get_auth_token,
build_config, and run_load_test for the first occurrence; keep existing logging
and fallback URL logic unchanged.
SEE README.md. This tool automatically discovers KServe InferenceService models in a Kubernetes/OpenShift cluster, runs load tests against them, and exports the results as Prometheus metrics.
This was a tool that was originally created by me to be run against models deployed in the Mass Open Cloud (MOC) environment: https://github.com/IsaiahStapleton/llm-load-test-exporter.
Summary by CodeRabbit