-
Notifications
You must be signed in to change notification settings - Fork 426
Enable /metrics endpoint with prometheus metrics #1412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -2,7 +2,7 @@ | |||||
| import os | ||||||
|
|
||||||
| from fastapi import FastAPI | ||||||
| from opentelemetry import _logs, trace | ||||||
| from opentelemetry import _logs, metrics, trace | ||||||
| from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter | ||||||
| from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter | ||||||
| from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor | ||||||
|
|
@@ -44,23 +44,62 @@ def _instrument_google_generativeai(): | |||||
|
|
||||||
|
|
||||||
| def configure(name: str = "kagent", namespace: str = "kagent", fastapi_app: FastAPI | None = None): | ||||||
| """Configure OpenTelemetry tracing and logging for this service. | ||||||
| """Configure OpenTelemetry tracing, logging, and metrics for this service. | ||||||
|
|
||||||
| This sets up OpenTelemetry providers and exporters for tracing and logging, | ||||||
| using environment variables to determine whether each is enabled. | ||||||
| This sets up OpenTelemetry providers and exporters for tracing, logging, | ||||||
| and metrics, using environment variables to determine whether each is enabled. | ||||||
|
|
||||||
| Providers are configured before instrumentors so that instrumentors can | ||||||
| discover and use all available providers (TracerProvider, MeterProvider, etc.). | ||||||
|
|
||||||
| Args: | ||||||
| name: service name to report to OpenTelemetry (used as ``service.name``). Default is "kagent". | ||||||
| namespace: logical namespace for the service (used as ``service.namespace``). Default is "kagent". | ||||||
| fastapi_app: Optional FastAPI application instance to instrument. If | ||||||
| provided and tracing is enabled, FastAPI routes will be instrumented. | ||||||
| If metrics is enabled, a ``/metrics`` endpoint will be added for | ||||||
| Prometheus scraping. | ||||||
| """ | ||||||
| tracing_enabled = os.getenv("OTEL_TRACING_ENABLED", "false").lower() == "true" | ||||||
| logging_enabled = os.getenv("OTEL_LOGGING_ENABLED", "false").lower() == "true" | ||||||
| metrics_enabled = os.getenv("OTEL_METRICS_ENABLED", "false").lower() == "true" | ||||||
|
||||||
|
|
||||||
| resource = Resource({"service.name": name, "service.namespace": namespace}) | ||||||
|
|
||||||
| # Configure tracing if enabled | ||||||
| # ------------------------------------------------------------------ # | ||||||
| # 1. Configure providers BEFORE instrumentors so that instrumentors # | ||||||
| # can discover MeterProvider, TracerProvider, etc. at init time. # | ||||||
| # ------------------------------------------------------------------ # | ||||||
|
|
||||||
| # 1a. Metrics provider (Prometheus pull endpoint) | ||||||
| if metrics_enabled: | ||||||
| logging.info("Enabling Prometheus metrics") | ||||||
| try: | ||||||
| from opentelemetry.exporter.prometheus import PrometheusMetricReader | ||||||
| from opentelemetry.sdk.metrics import MeterProvider | ||||||
|
|
||||||
| reader = PrometheusMetricReader() | ||||||
| meter_provider = MeterProvider(resource=resource, metric_readers=[reader]) | ||||||
| metrics.set_meter_provider(meter_provider) | ||||||
| logging.info("MeterProvider configured with Prometheus exporter") | ||||||
|
|
||||||
| if fastapi_app: | ||||||
| from prometheus_client import CONTENT_TYPE_LATEST, generate_latest | ||||||
| from starlette.responses import Response | ||||||
|
|
||||||
| @fastapi_app.get("/metrics") | ||||||
| async def metrics_endpoint(): | ||||||
| return Response(content=generate_latest(), media_type=CONTENT_TYPE_LATEST) | ||||||
|
|
||||||
| logging.info("Added /metrics endpoint for Prometheus scraping") | ||||||
| except ImportError: | ||||||
| logging.warning( | ||||||
| "opentelemetry-exporter-prometheus is not installed; " | ||||||
| "metrics endpoint will not be available. " | ||||||
| "Install it with: pip install opentelemetry-exporter-prometheus" | ||||||
| ) | ||||||
|
|
||||||
| # 1b. Tracing provider | ||||||
| if tracing_enabled: | ||||||
| logging.info("Enabling tracing") | ||||||
| # Check standard OTEL env vars: signal-specific endpoint first, then general endpoint | ||||||
|
|
@@ -90,10 +129,8 @@ def configure(name: str = "kagent", namespace: str = "kagent", fastapi_app: Fast | |||||
| trace.set_tracer_provider(tracer_provider) | ||||||
| logging.info("Created new TracerProvider") | ||||||
|
|
||||||
| HTTPXClientInstrumentor().instrument() | ||||||
| if fastapi_app: | ||||||
| FastAPIInstrumentor().instrument_app(fastapi_app) | ||||||
| # Configure logging if enabled | ||||||
| # 1c. Logging provider | ||||||
| event_logger_provider = None | ||||||
| if logging_enabled: | ||||||
| logging.info("Enabling logging for GenAI events") | ||||||
| logger_provider = LoggerProvider(resource=resource) | ||||||
|
|
@@ -114,15 +151,136 @@ def configure(name: str = "kagent", namespace: str = "kagent", fastapi_app: Fast | |||||
|
|
||||||
| _logs.set_logger_provider(logger_provider) | ||||||
| logging.info("Log provider configured with OTLP") | ||||||
| # When logging is enabled, use new event-based approach (input/output as log events in Body) | ||||||
| logging.info("OpenAI instrumentation configured with event logging capability") | ||||||
| # Create event logger provider using the configured logger provider | ||||||
| # Create event logger provider for instrumentors | ||||||
| event_logger_provider = EventLoggerProvider(logger_provider) | ||||||
|
|
||||||
| # ------------------------------------------------------------------ # | ||||||
| # 2. Instrument libraries — all providers are now available. # | ||||||
| # ------------------------------------------------------------------ # | ||||||
|
|
||||||
| if tracing_enabled: | ||||||
| HTTPXClientInstrumentor().instrument() | ||||||
| if fastapi_app: | ||||||
| FastAPIInstrumentor().instrument_app(fastapi_app) | ||||||
|
|
||||||
| if event_logger_provider: | ||||||
| # Event logging mode: input/output as log events in Body | ||||||
| logging.info("OpenAI instrumentation configured with event logging capability") | ||||||
| OpenAIInstrumentor(use_legacy_attributes=False).instrument(event_logger_provider=event_logger_provider) | ||||||
| _instrument_anthropic(event_logger_provider) | ||||||
| else: | ||||||
| # Use legacy attributes (input/output as GenAI span attributes) | ||||||
| # Legacy attributes mode: input/output as GenAI span attributes | ||||||
| logging.info("OpenAI instrumentation configured with legacy GenAI span attributes") | ||||||
| OpenAIInstrumentor().instrument() | ||||||
| _instrument_anthropic() | ||||||
| _instrument_google_generativeai() | ||||||
|
|
||||||
| # ------------------------------------------------------------------ # | ||||||
| # 3. LiteLLM metrics callback for providers that bypass their SDK. # | ||||||
| # LiteLLM uses raw httpx for some providers (e.g., Anthropic), # | ||||||
| # so the SDK instrumentors never fire. This callback fills the gap.# | ||||||
| # ------------------------------------------------------------------ # | ||||||
|
|
||||||
| if metrics_enabled: | ||||||
| _register_litellm_metrics_callback() | ||||||
|
Comment on lines
+184
to
+185
|
||||||
|
|
||||||
|
|
||||||
| def _register_litellm_metrics_callback(): | ||||||
| """Register a LiteLLM callback that records GenAI metrics for providers | ||||||
| where LiteLLM bypasses the provider's Python SDK (e.g., Anthropic). | ||||||
|
|
||||||
| LiteLLM uses raw httpx POST requests for some providers instead of their | ||||||
| official Python SDKs. This means the OpenTelemetry instrumentors for those | ||||||
| SDKs never fire and no metrics are recorded. This callback fills that gap | ||||||
| by recording metrics directly from LiteLLM's success/failure callbacks. | ||||||
|
|
||||||
| Providers where LiteLLM uses the SDK directly (e.g., OpenAI) are skipped | ||||||
| to avoid double-counting with the existing instrumentor metrics. | ||||||
| """ | ||||||
| try: | ||||||
| import litellm | ||||||
| from litellm.integrations.custom_logger import CustomLogger | ||||||
| except ImportError: | ||||||
| logging.debug("litellm not installed; skipping LiteLLM metrics callback") | ||||||
| return | ||||||
|
|
||||||
| meter = metrics.get_meter("kagent.litellm") | ||||||
| token_histogram = meter.create_histogram( | ||||||
| name="gen_ai.client.token.usage", | ||||||
| unit="token", | ||||||
| description="Measures number of input and output tokens used", | ||||||
| ) | ||||||
| duration_histogram = meter.create_histogram( | ||||||
| name="gen_ai.client.operation.duration", | ||||||
| unit="s", | ||||||
| description="GenAI operation duration", | ||||||
| ) | ||||||
|
|
||||||
| # Providers where LiteLLM uses the Python SDK directly, so the | ||||||
| # SDK instrumentor already captures metrics. Skip these to avoid | ||||||
| # double-counting. | ||||||
| SDK_INSTRUMENTED_PROVIDERS = frozenset( | ||||||
| { | ||||||
| "openai", | ||||||
| "azure", | ||||||
| "azure_text", | ||||||
| "azure_ai", | ||||||
| } | ||||||
| ) | ||||||
|
|
||||||
| class _MetricsCallback(CustomLogger): | ||||||
| def _record_metrics(self, kwargs, response_obj, start_time, end_time): | ||||||
| provider = kwargs.get("custom_llm_provider", "") | ||||||
| if provider in SDK_INSTRUMENTED_PROVIDERS: | ||||||
| return | ||||||
|
|
||||||
| model = kwargs.get("model", "unknown") | ||||||
| stream = kwargs.get("stream", False) | ||||||
| # Match attribute names used by the OpenAI instrumentor | ||||||
| # so all providers appear with consistent labels in Prometheus. | ||||||
| base_attrs = { | ||||||
| "gen_ai.system": provider or "unknown", | ||||||
| "gen_ai.response.model": model, | ||||||
| "gen_ai.operation.name": "chat", | ||||||
| "server.address": "", | ||||||
| "stream": stream, | ||||||
|
||||||
| "stream": stream, | |
| "stream": str(stream).lower(), |
Copilot
AI
Mar 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _MetricsCallback only overrides log_success_event and async_log_success_event, but does not override log_failure_event or async_log_failure_event. As a result, the gen_ai.client.operation.duration histogram (and the error-status token counts, if applicable) will be missing entirely for failed LLM calls. According to the OpenTelemetry GenAI semantic conventions, operation duration should be recorded for both successful and failed operations, with an error.type attribute set on failure. Without this, latency dashboards and SLO calculations will be inaccurate because failed (often slow or timed-out) requests are silently excluded.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
opentelemetry-exporter-prometheuspackage is added as a hard (unconditional) dependency inpyproject.toml, so theImportErrorfallback at lines 95-100 can never actually be triggered. This creates a contradiction: the hard dependency guarantees the package is always present, rendering the warning log dead code, while users who installkagent-corewill always pull inprometheus-clienteven when they never intend to use Prometheus metrics.If the intention is to keep metrics optional (the try/except pattern suggests this), the dependency should be made optional (e.g.,
[project.optional-dependencies]with anmetricsextra). If the intention is to always require it, thetry/except ImportErrorblock can be removed and the import can be moved to the top-level imports.