Enable /metrics endpoint with prometheus metrics#1412
Enable /metrics endpoint with prometheus metrics#1412dobesv wants to merge 2 commits intokagent-dev:mainfrom
Conversation
0e4a95e to
316d263
Compare
Signed-off-by: Dobes Vandermeer <dobes.vandermeer@newsela.com>
316d263 to
9c0aa53
Compare
There was a problem hiding this comment.
Pull request overview
This PR exposes a /metrics HTTP endpoint on the FastAPI application to enable Prometheus scraping of OpenTelemetry metrics, primarily targeting LLM usage monitoring (token counts, operation durations). It adds the opentelemetry-exporter-prometheus and prometheus-client dependencies and introduces a LiteLLM callback to capture metrics for providers that bypass their Python SDK (e.g., Anthropic).
Changes:
- Added
opentelemetry-exporter-prometheusas a hard dependency and locked its version along withprometheus-clientinuv.lock. - Refactored
configure()to set up theMeterProvider(Prometheus) before instrumentors, and conditionally registers a/metricsGET endpoint on the FastAPI app. - Added
_register_litellm_metrics_callback()which recordsgen_ai.client.token.usageandgen_ai.client.operation.durationhistograms via a LiteLLMCustomLoggerfor non-SDK providers.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
python/packages/kagent-core/pyproject.toml |
Adds opentelemetry-exporter-prometheus>=0.52b0 as a hard project dependency |
python/packages/kagent-core/src/kagent/core/tracing/_utils.py |
Adds metrics provider setup, /metrics endpoint registration, and LiteLLM callback for non-SDK providers |
python/uv.lock |
Locks opentelemetry-exporter-prometheus==0.59b0 and prometheus-client==0.24.1 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "opentelemetry-exporter-prometheus>=0.52b0", | ||
| "typing-extensions>=4.0.0", | ||
| ] | ||
|
|
There was a problem hiding this comment.
The opentelemetry-exporter-prometheus package is added as a hard (unconditional) dependency in pyproject.toml, so the ImportError fallback at lines 95-100 can never actually be triggered. This creates a contradiction: the hard dependency guarantees the package is always present, rendering the warning log dead code, while users who install kagent-core will always pull in prometheus-client even when they never intend to use Prometheus metrics.
If the intention is to keep metrics optional (the try/except pattern suggests this), the dependency should be made optional (e.g., [project.optional-dependencies] with an metrics extra). If the intention is to always require it, the try/except ImportError block can be removed and the import can be moved to the top-level imports.
| "opentelemetry-exporter-prometheus>=0.52b0", | |
| "typing-extensions>=4.0.0", | |
| ] | |
| "typing-extensions>=4.0.0", | |
| ] | |
| [project.optional-dependencies] | |
| metrics = [ | |
| "opentelemetry-exporter-prometheus>=0.52b0", | |
| ] |
| if metrics_enabled: | ||
| _register_litellm_metrics_callback() |
There was a problem hiding this comment.
When the PrometheusMetricReader / MeterProvider import fails (the ImportError is caught at line 95), metrics_enabled remains True and _register_litellm_metrics_callback() is still called at line 184. That function calls metrics.get_meter(...) against the default NoOpMeterProvider, so the histograms are silently created as no-ops and the callback is registered but does nothing. The call to _register_litellm_metrics_callback() should be guarded on whether the MeterProvider was successfully configured, not just on metrics_enabled.
A straightforward fix is to track whether provider setup succeeded with a boolean flag (e.g. meter_provider_configured = False) and set it to True inside the try block after calling metrics.set_meter_provider(), then gate the callback on meter_provider_configured instead of metrics_enabled.
| "gen_ai.response.model": model, | ||
| "gen_ai.operation.name": "chat", | ||
| "server.address": "", | ||
| "stream": stream, |
There was a problem hiding this comment.
The "stream" key in base_attrs holds a raw Python bool (line 246). OpenTelemetry's Prometheus exporter serializes boolean attributes by calling str() on them, producing "True" or "False" (capital first letter). The OpenAI SDK instrumentation, if it records the same attribute, uses the lowercase string "true" / "false" following the OpenTelemetry semantic conventions. This mismatch means the Prometheus label values will differ between metrics emitted by the OpenAI instrumentor (for SDK-instrumented providers) and those emitted by this callback (for non-SDK providers), making cross-provider dashboards inconsistent.
Convert the value to a lowercase string explicitly, e.g. "stream": str(stream).lower().
| "stream": stream, | |
| "stream": str(stream).lower(), |
| def log_success_event(self, kwargs, response_obj, start_time, end_time): | ||
| try: | ||
| self._record_metrics(kwargs, response_obj, start_time, end_time) | ||
| except Exception: | ||
| logging.debug("Failed to record LiteLLM metrics", exc_info=True) | ||
|
|
||
| async def async_log_success_event(self, kwargs, response_obj, start_time, end_time): | ||
| self.log_success_event(kwargs, response_obj, start_time, end_time) |
There was a problem hiding this comment.
The _MetricsCallback only overrides log_success_event and async_log_success_event, but does not override log_failure_event or async_log_failure_event. As a result, the gen_ai.client.operation.duration histogram (and the error-status token counts, if applicable) will be missing entirely for failed LLM calls. According to the OpenTelemetry GenAI semantic conventions, operation duration should be recorded for both successful and failed operations, with an error.type attribute set on failure. Without this, latency dashboards and SLO calculations will be inaccurate because failed (often slow or timed-out) requests are silently excluded.
| """ | ||
| tracing_enabled = os.getenv("OTEL_TRACING_ENABLED", "false").lower() == "true" | ||
| logging_enabled = os.getenv("OTEL_LOGGING_ENABLED", "false").lower() == "true" | ||
| metrics_enabled = os.getenv("OTEL_METRICS_ENABLED", "false").lower() == "true" |
There was a problem hiding this comment.
The OTEL_METRICS_ENABLED environment variable name is inconsistent with the existing OTEL_TRACING_ENABLED and OTEL_LOGGING_ENABLED variables, which already form a pattern. However, the new variable name OTEL_METRICS_ENABLED should at minimum be documented somewhere users can discover it. More critically, the variable is not documented in the configure() docstring despite OTEL_TRACING_ENABLED and OTEL_LOGGING_ENABLED being implicitly documented there. This will make it hard for users to discover how to enable metrics without reading the source code.
| @@ -90,10 +129,8 @@ def configure(name: str = "kagent", namespace: str = "kagent", fastapi_app: Fast | |||
| trace.set_tracer_provider(tracer_provider) | |||
| logging.info("Created new TracerProvider") | |||
|
|
|||
| HTTPXClientInstrumentor().instrument() | |||
| if fastapi_app: | |||
| FastAPIInstrumentor().instrument_app(fastapi_app) | |||
| # Configure logging if enabled | |||
| # 1c. Logging provider | |||
| event_logger_provider = None | |||
| if logging_enabled: | |||
| logging.info("Enabling logging for GenAI events") | |||
| logger_provider = LoggerProvider(resource=resource) | |||
| @@ -114,15 +151,136 @@ def configure(name: str = "kagent", namespace: str = "kagent", fastapi_app: Fast | |||
|
|
|||
| _logs.set_logger_provider(logger_provider) | |||
| logging.info("Log provider configured with OTLP") | |||
| # When logging is enabled, use new event-based approach (input/output as log events in Body) | |||
| logging.info("OpenAI instrumentation configured with event logging capability") | |||
| # Create event logger provider using the configured logger provider | |||
| # Create event logger provider for instrumentors | |||
| event_logger_provider = EventLoggerProvider(logger_provider) | |||
|
|
|||
| # ------------------------------------------------------------------ # | |||
| # 2. Instrument libraries — all providers are now available. # | |||
| # ------------------------------------------------------------------ # | |||
|
|
|||
| if tracing_enabled: | |||
| HTTPXClientInstrumentor().instrument() | |||
| if fastapi_app: | |||
| FastAPIInstrumentor().instrument_app(fastapi_app) | |||
|
|
|||
| if event_logger_provider: | |||
| # Event logging mode: input/output as log events in Body | |||
| logging.info("OpenAI instrumentation configured with event logging capability") | |||
| OpenAIInstrumentor(use_legacy_attributes=False).instrument(event_logger_provider=event_logger_provider) | |||
| _instrument_anthropic(event_logger_provider) | |||
| else: | |||
| # Use legacy attributes (input/output as GenAI span attributes) | |||
| # Legacy attributes mode: input/output as GenAI span attributes | |||
| logging.info("OpenAI instrumentation configured with legacy GenAI span attributes") | |||
| OpenAIInstrumentor().instrument() | |||
| _instrument_anthropic() | |||
| _instrument_google_generativeai() | |||
|
|
|||
| # ------------------------------------------------------------------ # | |||
| # 3. LiteLLM metrics callback for providers that bypass their SDK. # | |||
| # LiteLLM uses raw httpx for some providers (e.g., Anthropic), # | |||
| # so the SDK instrumentors never fire. This callback fills the gap.# | |||
| # ------------------------------------------------------------------ # | |||
|
|
|||
| if metrics_enabled: | |||
| _register_litellm_metrics_callback() | |||
There was a problem hiding this comment.
There are no tests covering the new metrics_enabled code path in configure() or the _register_litellm_metrics_callback() function, even though the existing test file test_tracing_configure.py provides coverage for the tracing_enabled and logging_enabled paths. At minimum, tests should verify:
- When
OTEL_METRICS_ENABLED=trueandfastapi_appis provided, the/metricsendpoint is registered. - When
OTEL_METRICS_ENABLED=trueandlitellmis available, the LiteLLM callback is appended tolitellm.callbacks. - The callback correctly skips SDK-instrumented providers and records histograms for others.
Promethteus metrics are mostly supported by the libraries we are using so it is not hard to expose these. It allows simple creation of a grafana dashboard to monitor agent LLM usage.