Skip to content

Enable /metrics endpoint with prometheus metrics#1412

Open
dobesv wants to merge 2 commits intokagent-dev:mainfrom
dobesv:prom-metrics-exporter
Open

Enable /metrics endpoint with prometheus metrics#1412
dobesv wants to merge 2 commits intokagent-dev:mainfrom
dobesv:prom-metrics-exporter

Conversation

@dobesv
Copy link
Contributor

@dobesv dobesv commented Mar 1, 2026

Promethteus metrics are mostly supported by the libraries we are using so it is not hard to expose these. It allows simple creation of a grafana dashboard to monitor agent LLM usage.

image

@dobesv dobesv force-pushed the prom-metrics-exporter branch 2 times, most recently from 0e4a95e to 316d263 Compare March 1, 2026 20:09
@dobesv dobesv changed the title Enable /metrics endpoint Enable /metrics endpoint with prometheus metrics Mar 1, 2026
Signed-off-by: Dobes Vandermeer <dobes.vandermeer@newsela.com>
@dobesv dobesv force-pushed the prom-metrics-exporter branch from 316d263 to 9c0aa53 Compare March 1, 2026 20:50
@dobesv dobesv marked this pull request as ready for review March 8, 2026 08:27
Copilot AI review requested due to automatic review settings March 8, 2026 08:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR exposes a /metrics HTTP endpoint on the FastAPI application to enable Prometheus scraping of OpenTelemetry metrics, primarily targeting LLM usage monitoring (token counts, operation durations). It adds the opentelemetry-exporter-prometheus and prometheus-client dependencies and introduces a LiteLLM callback to capture metrics for providers that bypass their Python SDK (e.g., Anthropic).

Changes:

  • Added opentelemetry-exporter-prometheus as a hard dependency and locked its version along with prometheus-client in uv.lock.
  • Refactored configure() to set up the MeterProvider (Prometheus) before instrumentors, and conditionally registers a /metrics GET endpoint on the FastAPI app.
  • Added _register_litellm_metrics_callback() which records gen_ai.client.token.usage and gen_ai.client.operation.duration histograms via a LiteLLM CustomLogger for non-SDK providers.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 6 comments.

File Description
python/packages/kagent-core/pyproject.toml Adds opentelemetry-exporter-prometheus>=0.52b0 as a hard project dependency
python/packages/kagent-core/src/kagent/core/tracing/_utils.py Adds metrics provider setup, /metrics endpoint registration, and LiteLLM callback for non-SDK providers
python/uv.lock Locks opentelemetry-exporter-prometheus==0.59b0 and prometheus-client==0.24.1

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +21 to 24
"opentelemetry-exporter-prometheus>=0.52b0",
"typing-extensions>=4.0.0",
]

Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The opentelemetry-exporter-prometheus package is added as a hard (unconditional) dependency in pyproject.toml, so the ImportError fallback at lines 95-100 can never actually be triggered. This creates a contradiction: the hard dependency guarantees the package is always present, rendering the warning log dead code, while users who install kagent-core will always pull in prometheus-client even when they never intend to use Prometheus metrics.

If the intention is to keep metrics optional (the try/except pattern suggests this), the dependency should be made optional (e.g., [project.optional-dependencies] with an metrics extra). If the intention is to always require it, the try/except ImportError block can be removed and the import can be moved to the top-level imports.

Suggested change
"opentelemetry-exporter-prometheus>=0.52b0",
"typing-extensions>=4.0.0",
]
"typing-extensions>=4.0.0",
]
[project.optional-dependencies]
metrics = [
"opentelemetry-exporter-prometheus>=0.52b0",
]

Copilot uses AI. Check for mistakes.
Comment on lines +184 to +185
if metrics_enabled:
_register_litellm_metrics_callback()
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the PrometheusMetricReader / MeterProvider import fails (the ImportError is caught at line 95), metrics_enabled remains True and _register_litellm_metrics_callback() is still called at line 184. That function calls metrics.get_meter(...) against the default NoOpMeterProvider, so the histograms are silently created as no-ops and the callback is registered but does nothing. The call to _register_litellm_metrics_callback() should be guarded on whether the MeterProvider was successfully configured, not just on metrics_enabled.

A straightforward fix is to track whether provider setup succeeded with a boolean flag (e.g. meter_provider_configured = False) and set it to True inside the try block after calling metrics.set_meter_provider(), then gate the callback on meter_provider_configured instead of metrics_enabled.

Copilot uses AI. Check for mistakes.
"gen_ai.response.model": model,
"gen_ai.operation.name": "chat",
"server.address": "",
"stream": stream,
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "stream" key in base_attrs holds a raw Python bool (line 246). OpenTelemetry's Prometheus exporter serializes boolean attributes by calling str() on them, producing "True" or "False" (capital first letter). The OpenAI SDK instrumentation, if it records the same attribute, uses the lowercase string "true" / "false" following the OpenTelemetry semantic conventions. This mismatch means the Prometheus label values will differ between metrics emitted by the OpenAI instrumentor (for SDK-instrumented providers) and those emitted by this callback (for non-SDK providers), making cross-provider dashboards inconsistent.

Convert the value to a lowercase string explicitly, e.g. "stream": str(stream).lower().

Suggested change
"stream": stream,
"stream": str(stream).lower(),

Copilot uses AI. Check for mistakes.
Comment on lines +276 to +283
def log_success_event(self, kwargs, response_obj, start_time, end_time):
try:
self._record_metrics(kwargs, response_obj, start_time, end_time)
except Exception:
logging.debug("Failed to record LiteLLM metrics", exc_info=True)

async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
self.log_success_event(kwargs, response_obj, start_time, end_time)
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _MetricsCallback only overrides log_success_event and async_log_success_event, but does not override log_failure_event or async_log_failure_event. As a result, the gen_ai.client.operation.duration histogram (and the error-status token counts, if applicable) will be missing entirely for failed LLM calls. According to the OpenTelemetry GenAI semantic conventions, operation duration should be recorded for both successful and failed operations, with an error.type attribute set on failure. Without this, latency dashboards and SLO calculations will be inaccurate because failed (often slow or timed-out) requests are silently excluded.

Copilot uses AI. Check for mistakes.
"""
tracing_enabled = os.getenv("OTEL_TRACING_ENABLED", "false").lower() == "true"
logging_enabled = os.getenv("OTEL_LOGGING_ENABLED", "false").lower() == "true"
metrics_enabled = os.getenv("OTEL_METRICS_ENABLED", "false").lower() == "true"
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OTEL_METRICS_ENABLED environment variable name is inconsistent with the existing OTEL_TRACING_ENABLED and OTEL_LOGGING_ENABLED variables, which already form a pattern. However, the new variable name OTEL_METRICS_ENABLED should at minimum be documented somewhere users can discover it. More critically, the variable is not documented in the configure() docstring despite OTEL_TRACING_ENABLED and OTEL_LOGGING_ENABLED being implicitly documented there. This will make it hard for users to discover how to enable metrics without reading the source code.

Copilot uses AI. Check for mistakes.
Comment on lines 75 to +185
@@ -90,10 +129,8 @@ def configure(name: str = "kagent", namespace: str = "kagent", fastapi_app: Fast
trace.set_tracer_provider(tracer_provider)
logging.info("Created new TracerProvider")

HTTPXClientInstrumentor().instrument()
if fastapi_app:
FastAPIInstrumentor().instrument_app(fastapi_app)
# Configure logging if enabled
# 1c. Logging provider
event_logger_provider = None
if logging_enabled:
logging.info("Enabling logging for GenAI events")
logger_provider = LoggerProvider(resource=resource)
@@ -114,15 +151,136 @@ def configure(name: str = "kagent", namespace: str = "kagent", fastapi_app: Fast

_logs.set_logger_provider(logger_provider)
logging.info("Log provider configured with OTLP")
# When logging is enabled, use new event-based approach (input/output as log events in Body)
logging.info("OpenAI instrumentation configured with event logging capability")
# Create event logger provider using the configured logger provider
# Create event logger provider for instrumentors
event_logger_provider = EventLoggerProvider(logger_provider)

# ------------------------------------------------------------------ #
# 2. Instrument libraries — all providers are now available. #
# ------------------------------------------------------------------ #

if tracing_enabled:
HTTPXClientInstrumentor().instrument()
if fastapi_app:
FastAPIInstrumentor().instrument_app(fastapi_app)

if event_logger_provider:
# Event logging mode: input/output as log events in Body
logging.info("OpenAI instrumentation configured with event logging capability")
OpenAIInstrumentor(use_legacy_attributes=False).instrument(event_logger_provider=event_logger_provider)
_instrument_anthropic(event_logger_provider)
else:
# Use legacy attributes (input/output as GenAI span attributes)
# Legacy attributes mode: input/output as GenAI span attributes
logging.info("OpenAI instrumentation configured with legacy GenAI span attributes")
OpenAIInstrumentor().instrument()
_instrument_anthropic()
_instrument_google_generativeai()

# ------------------------------------------------------------------ #
# 3. LiteLLM metrics callback for providers that bypass their SDK. #
# LiteLLM uses raw httpx for some providers (e.g., Anthropic), #
# so the SDK instrumentors never fire. This callback fills the gap.#
# ------------------------------------------------------------------ #

if metrics_enabled:
_register_litellm_metrics_callback()
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no tests covering the new metrics_enabled code path in configure() or the _register_litellm_metrics_callback() function, even though the existing test file test_tracing_configure.py provides coverage for the tracing_enabled and logging_enabled paths. At minimum, tests should verify:

  • When OTEL_METRICS_ENABLED=true and fastapi_app is provided, the /metrics endpoint is registered.
  • When OTEL_METRICS_ENABLED=true and litellm is available, the LiteLLM callback is appended to litellm.callbacks.
  • The callback correctly skips SDK-instrumented providers and records histograms for others.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants