feat: add configurable live metrics publishing for CPU servers#363
feat: add configurable live metrics publishing for CPU servers#363attafosu wants to merge 6 commits into
Conversation
Add a new configuration parameter 'enable_live_metrics' (default=True) that controls periodic live metrics publishing. This addresses resource contention issues where the metrics aggregator subprocess competing for CPU cycles causes CPU servers to starve for cycles. When disabled (--enable-live-metrics=false or runtime.enable_live_metrics=false in YAML), the aggregator skips the live tick task, eliminating the periodic registry.build_snapshot() calls that cause CPU contention. Final snapshots (used by Report) are unaffected and continue to provide exact metrics. Changes: - RuntimeConfig: Add 'enable_live_metrics: bool = True' parameter (CLI/YAML) - RuntimeSettings: Add field with default=True for backward compatibility - MetricsPublisher: Treat 'publish_interval_s <= 0' as disabled state (log and skip tick task) - execute.py: Conditionally create metrics subscriber only when enabled, pass publish-interval to aggregator (0.25 if enabled, 0 if disabled), add None checks for conditional subscriber usage - Config templates: Regenerated with enable_live_metrics field - test_publisher.py: Add unit test for disabled publish_interval_s path Backward compatible: Default=True maintains existing behavior. Fixes CPU contention on CPU-only server deployments where metrics aggregator competes with inference workload for shared L3/LLC resources. Signed-off-by: attafosu <thomas.atta-fosu@intel.com>
Signed-off-by: attafosu <thomas.atta-fosu@intel.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces the ability to disable live metrics publishing via a new configuration option enable_live_metrics. When disabled, the live metrics tick task is skipped, while final snapshots are still written. The review feedback suggests two improvements: first, reordering the checks in MetricsPublisher.start to ensure that duplicate calls still trigger a warning even when live publishing is disabled; second, adding a negative CLI alias (--no-live-metrics) to the configuration schema for a more intuitive user experience.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Pull request overview
Adds a runtime-configurable switch to disable periodic live metrics snapshot publishing (while preserving final snapshot/report generation) to reduce CPU contention on CPU-only deployments.
Changes:
- Introduces
settings.runtime.enable_live_metrics(defaulttrue) and plumbs it throughRuntimeSettings. - Passes
--publish-interval 0to the metrics aggregator when live publishing is disabled, and updatesMetricsPublisher.start()to treatpublish_interval_s <= 0as “no live tick task”. - Updates full config templates and adds a unit test covering the disabled-live-publishing path.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/async_utils/services/metrics_aggregator/test_publisher.py | Adds unit coverage ensuring publish_interval_s <= 0 skips the tick task but still produces a final snapshot. |
| src/inference_endpoint/config/templates/online_template_full.yaml | Adds settings.runtime.enable_live_metrics to the full online template. |
| src/inference_endpoint/config/templates/offline_template_full.yaml | Adds settings.runtime.enable_live_metrics to the full offline template. |
| src/inference_endpoint/config/templates/concurrency_template_full.yaml | Adds settings.runtime.enable_live_metrics to the full concurrency template. |
| src/inference_endpoint/config/schema.py | Adds the new runtime config field and CLI alias --enable-live-metrics. |
| src/inference_endpoint/config/runtime_settings.py | Carries enable_live_metrics into immutable RuntimeSettings. |
| src/inference_endpoint/commands/benchmark/execute.py | Plumbs the toggle into metrics-aggregator launch args via --publish-interval. |
| src/inference_endpoint/async_utils/services/metrics_aggregator/publisher.py | Treats publish_interval_s <= 0 as “live publishing disabled” (no tick task). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if publish_interval_s <= 0: | ||
| logger.info( | ||
| "Live metrics publishing disabled " | ||
| "(publish_interval_s=%s, skipping tick task)", | ||
| publish_interval_s, | ||
| ) | ||
| return | ||
| if self._tick_task is not None: | ||
| logger.warning( | ||
| "MetricsPublisher.start called again while tick task is " | ||
| "still running (id=%r); ignoring the second start.", | ||
| id(self._tick_task), | ||
| ) | ||
| return |
…/publisher.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Add negative alias to `enable-live-metrics` Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
| # Control live metrics publishing via config parameter | ||
| # publish_interval_s = 0 disables live tick task in publisher.start() | ||
| publish_interval_s = 0.25 if ctx.rt_settings.enable_live_metrics else 0.0 | ||
| aggregator_args.extend(["--publish-interval", str(publish_interval_s)]) |
|
@viraatc @arekay-nv @nvzhihanj Can you please take a look? It fixes a perf regression for CPU servers as a result of #306 (periodic and live metric snapshots) |
What does this PR do?
Summary
This PR adds a runtime toggle to control live metrics publishing in the metrics aggregator:
settings.runtime.enable_live_metrics(default:true)--enable-live-metricspublish_interval=0, which skips periodic live snapshot ticks while preserving final snapshot/report generation.Why
On CPU-only server deployments, periodic live snapshot building can compete with inference workloads for CPU cycles.
This change allows disabling live ticks to reduce contention while keeping end-of-run reporting intact.
What Changed
enable_live_metricsto runtime schema and runtime settings (default true)enable_live_metricspublish_interval <= 0as live publishing disabledValidation
enable_live_metrics=true(existing behavior)enable_live_metrics=false(no periodic live ticks, final report still generated)Snapshots
Performance snapshots:
enable_live_metrics: false)These artifacts illustrate reduced CPU contention and improved workload stability when live metrics are disabled on CPU servers.
Type of change
Related issues
Testing
Checklist