Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions python/packages/kagent-core/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ dependencies = [
"opentelemetry-instrumentation-httpx >= 0.52.0",
"opentelemetry-instrumentation-fastapi>=0.52.0",
"opentelemetry-instrumentation-google-generativeai>=0.52.5",
"opentelemetry-exporter-prometheus>=0.52b0",
"typing-extensions>=4.0.0",
]

Comment on lines +21 to 24
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The opentelemetry-exporter-prometheus package is added as a hard (unconditional) dependency in pyproject.toml, so the ImportError fallback at lines 95-100 can never actually be triggered. This creates a contradiction: the hard dependency guarantees the package is always present, rendering the warning log dead code, while users who install kagent-core will always pull in prometheus-client even when they never intend to use Prometheus metrics.

If the intention is to keep metrics optional (the try/except pattern suggests this), the dependency should be made optional (e.g., [project.optional-dependencies] with an metrics extra). If the intention is to always require it, the try/except ImportError block can be removed and the import can be moved to the top-level imports.

Suggested change
"opentelemetry-exporter-prometheus>=0.52b0",
"typing-extensions>=4.0.0",
]
"typing-extensions>=4.0.0",
]
[project.optional-dependencies]
metrics = [
"opentelemetry-exporter-prometheus>=0.52b0",
]

Copilot uses AI. Check for mistakes.
Expand Down
184 changes: 171 additions & 13 deletions python/packages/kagent-core/src/kagent/core/tracing/_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import os

from fastapi import FastAPI
from opentelemetry import _logs, trace
from opentelemetry import _logs, metrics, trace
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
Expand Down Expand Up @@ -44,23 +44,62 @@ def _instrument_google_generativeai():


def configure(name: str = "kagent", namespace: str = "kagent", fastapi_app: FastAPI | None = None):
"""Configure OpenTelemetry tracing and logging for this service.
"""Configure OpenTelemetry tracing, logging, and metrics for this service.

This sets up OpenTelemetry providers and exporters for tracing and logging,
using environment variables to determine whether each is enabled.
This sets up OpenTelemetry providers and exporters for tracing, logging,
and metrics, using environment variables to determine whether each is enabled.

Providers are configured before instrumentors so that instrumentors can
discover and use all available providers (TracerProvider, MeterProvider, etc.).

Args:
name: service name to report to OpenTelemetry (used as ``service.name``). Default is "kagent".
namespace: logical namespace for the service (used as ``service.namespace``). Default is "kagent".
fastapi_app: Optional FastAPI application instance to instrument. If
provided and tracing is enabled, FastAPI routes will be instrumented.
If metrics is enabled, a ``/metrics`` endpoint will be added for
Prometheus scraping.
"""
tracing_enabled = os.getenv("OTEL_TRACING_ENABLED", "false").lower() == "true"
logging_enabled = os.getenv("OTEL_LOGGING_ENABLED", "false").lower() == "true"
metrics_enabled = os.getenv("OTEL_METRICS_ENABLED", "false").lower() == "true"
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OTEL_METRICS_ENABLED environment variable name is inconsistent with the existing OTEL_TRACING_ENABLED and OTEL_LOGGING_ENABLED variables, which already form a pattern. However, the new variable name OTEL_METRICS_ENABLED should at minimum be documented somewhere users can discover it. More critically, the variable is not documented in the configure() docstring despite OTEL_TRACING_ENABLED and OTEL_LOGGING_ENABLED being implicitly documented there. This will make it hard for users to discover how to enable metrics without reading the source code.

Copilot uses AI. Check for mistakes.

resource = Resource({"service.name": name, "service.namespace": namespace})

# Configure tracing if enabled
# ------------------------------------------------------------------ #
# 1. Configure providers BEFORE instrumentors so that instrumentors #
# can discover MeterProvider, TracerProvider, etc. at init time. #
# ------------------------------------------------------------------ #

# 1a. Metrics provider (Prometheus pull endpoint)
if metrics_enabled:
logging.info("Enabling Prometheus metrics")
try:
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from opentelemetry.sdk.metrics import MeterProvider

reader = PrometheusMetricReader()
meter_provider = MeterProvider(resource=resource, metric_readers=[reader])
metrics.set_meter_provider(meter_provider)
logging.info("MeterProvider configured with Prometheus exporter")

if fastapi_app:
from prometheus_client import CONTENT_TYPE_LATEST, generate_latest
from starlette.responses import Response

@fastapi_app.get("/metrics")
async def metrics_endpoint():
return Response(content=generate_latest(), media_type=CONTENT_TYPE_LATEST)

logging.info("Added /metrics endpoint for Prometheus scraping")
except ImportError:
logging.warning(
"opentelemetry-exporter-prometheus is not installed; "
"metrics endpoint will not be available. "
"Install it with: pip install opentelemetry-exporter-prometheus"
)

# 1b. Tracing provider
if tracing_enabled:
logging.info("Enabling tracing")
# Check standard OTEL env vars: signal-specific endpoint first, then general endpoint
Expand Down Expand Up @@ -90,10 +129,8 @@ def configure(name: str = "kagent", namespace: str = "kagent", fastapi_app: Fast
trace.set_tracer_provider(tracer_provider)
logging.info("Created new TracerProvider")

HTTPXClientInstrumentor().instrument()
if fastapi_app:
FastAPIInstrumentor().instrument_app(fastapi_app)
# Configure logging if enabled
# 1c. Logging provider
event_logger_provider = None
if logging_enabled:
logging.info("Enabling logging for GenAI events")
logger_provider = LoggerProvider(resource=resource)
Expand All @@ -114,15 +151,136 @@ def configure(name: str = "kagent", namespace: str = "kagent", fastapi_app: Fast

_logs.set_logger_provider(logger_provider)
logging.info("Log provider configured with OTLP")
# When logging is enabled, use new event-based approach (input/output as log events in Body)
logging.info("OpenAI instrumentation configured with event logging capability")
# Create event logger provider using the configured logger provider
# Create event logger provider for instrumentors
event_logger_provider = EventLoggerProvider(logger_provider)

# ------------------------------------------------------------------ #
# 2. Instrument libraries — all providers are now available. #
# ------------------------------------------------------------------ #

if tracing_enabled:
HTTPXClientInstrumentor().instrument()
if fastapi_app:
FastAPIInstrumentor().instrument_app(fastapi_app)

if event_logger_provider:
# Event logging mode: input/output as log events in Body
logging.info("OpenAI instrumentation configured with event logging capability")
OpenAIInstrumentor(use_legacy_attributes=False).instrument(event_logger_provider=event_logger_provider)
_instrument_anthropic(event_logger_provider)
else:
# Use legacy attributes (input/output as GenAI span attributes)
# Legacy attributes mode: input/output as GenAI span attributes
logging.info("OpenAI instrumentation configured with legacy GenAI span attributes")
OpenAIInstrumentor().instrument()
_instrument_anthropic()
_instrument_google_generativeai()

# ------------------------------------------------------------------ #
# 3. LiteLLM metrics callback for providers that bypass their SDK. #
# LiteLLM uses raw httpx for some providers (e.g., Anthropic), #
# so the SDK instrumentors never fire. This callback fills the gap.#
# ------------------------------------------------------------------ #

if metrics_enabled:
_register_litellm_metrics_callback()
Comment on lines +184 to +185
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the PrometheusMetricReader / MeterProvider import fails (the ImportError is caught at line 95), metrics_enabled remains True and _register_litellm_metrics_callback() is still called at line 184. That function calls metrics.get_meter(...) against the default NoOpMeterProvider, so the histograms are silently created as no-ops and the callback is registered but does nothing. The call to _register_litellm_metrics_callback() should be guarded on whether the MeterProvider was successfully configured, not just on metrics_enabled.

A straightforward fix is to track whether provider setup succeeded with a boolean flag (e.g. meter_provider_configured = False) and set it to True inside the try block after calling metrics.set_meter_provider(), then gate the callback on meter_provider_configured instead of metrics_enabled.

Copilot uses AI. Check for mistakes.
Comment on lines 75 to +185
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no tests covering the new metrics_enabled code path in configure() or the _register_litellm_metrics_callback() function, even though the existing test file test_tracing_configure.py provides coverage for the tracing_enabled and logging_enabled paths. At minimum, tests should verify:

  • When OTEL_METRICS_ENABLED=true and fastapi_app is provided, the /metrics endpoint is registered.
  • When OTEL_METRICS_ENABLED=true and litellm is available, the LiteLLM callback is appended to litellm.callbacks.
  • The callback correctly skips SDK-instrumented providers and records histograms for others.

Copilot uses AI. Check for mistakes.


def _register_litellm_metrics_callback():
"""Register a LiteLLM callback that records GenAI metrics for providers
where LiteLLM bypasses the provider's Python SDK (e.g., Anthropic).

LiteLLM uses raw httpx POST requests for some providers instead of their
official Python SDKs. This means the OpenTelemetry instrumentors for those
SDKs never fire and no metrics are recorded. This callback fills that gap
by recording metrics directly from LiteLLM's success/failure callbacks.

Providers where LiteLLM uses the SDK directly (e.g., OpenAI) are skipped
to avoid double-counting with the existing instrumentor metrics.
"""
try:
import litellm
from litellm.integrations.custom_logger import CustomLogger
except ImportError:
logging.debug("litellm not installed; skipping LiteLLM metrics callback")
return

meter = metrics.get_meter("kagent.litellm")
token_histogram = meter.create_histogram(
name="gen_ai.client.token.usage",
unit="token",
description="Measures number of input and output tokens used",
)
duration_histogram = meter.create_histogram(
name="gen_ai.client.operation.duration",
unit="s",
description="GenAI operation duration",
)

# Providers where LiteLLM uses the Python SDK directly, so the
# SDK instrumentor already captures metrics. Skip these to avoid
# double-counting.
SDK_INSTRUMENTED_PROVIDERS = frozenset(
{
"openai",
"azure",
"azure_text",
"azure_ai",
}
)

class _MetricsCallback(CustomLogger):
def _record_metrics(self, kwargs, response_obj, start_time, end_time):
provider = kwargs.get("custom_llm_provider", "")
if provider in SDK_INSTRUMENTED_PROVIDERS:
return

model = kwargs.get("model", "unknown")
stream = kwargs.get("stream", False)
# Match attribute names used by the OpenAI instrumentor
# so all providers appear with consistent labels in Prometheus.
base_attrs = {
"gen_ai.system": provider or "unknown",
"gen_ai.response.model": model,
"gen_ai.operation.name": "chat",
"server.address": "",
"stream": stream,
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "stream" key in base_attrs holds a raw Python bool (line 246). OpenTelemetry's Prometheus exporter serializes boolean attributes by calling str() on them, producing "True" or "False" (capital first letter). The OpenAI SDK instrumentation, if it records the same attribute, uses the lowercase string "true" / "false" following the OpenTelemetry semantic conventions. This mismatch means the Prometheus label values will differ between metrics emitted by the OpenAI instrumentor (for SDK-instrumented providers) and those emitted by this callback (for non-SDK providers), making cross-provider dashboards inconsistent.

Convert the value to a lowercase string explicitly, e.g. "stream": str(stream).lower().

Suggested change
"stream": stream,
"stream": str(stream).lower(),

Copilot uses AI. Check for mistakes.
}

duration_s = (end_time - start_time).total_seconds()
duration_histogram.record(duration_s, attributes=base_attrs)

usage = getattr(response_obj, "usage", None)
if usage is None and isinstance(response_obj, dict):
usage = response_obj.get("usage")
if usage is None:
return

input_tokens = getattr(usage, "prompt_tokens", None)
if input_tokens is None and isinstance(usage, dict):
input_tokens = usage.get("prompt_tokens", 0)
output_tokens = getattr(usage, "completion_tokens", None)
if output_tokens is None and isinstance(usage, dict):
output_tokens = usage.get("completion_tokens", 0)

if input_tokens:
token_histogram.record(
input_tokens,
attributes={**base_attrs, "gen_ai.token.type": "input"},
)
if output_tokens:
token_histogram.record(
output_tokens,
attributes={**base_attrs, "gen_ai.token.type": "output"},
)

def log_success_event(self, kwargs, response_obj, start_time, end_time):
try:
self._record_metrics(kwargs, response_obj, start_time, end_time)
except Exception:
logging.debug("Failed to record LiteLLM metrics", exc_info=True)

async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
self.log_success_event(kwargs, response_obj, start_time, end_time)
Comment on lines +276 to +283
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _MetricsCallback only overrides log_success_event and async_log_success_event, but does not override log_failure_event or async_log_failure_event. As a result, the gen_ai.client.operation.duration histogram (and the error-status token counts, if applicable) will be missing entirely for failed LLM calls. According to the OpenTelemetry GenAI semantic conventions, operation duration should be recorded for both successful and failed operations, with an error.type attribute set on failure. Without this, latency dashboards and SLO calculations will be inaccurate because failed (often slow or timed-out) requests are silently excluded.

Copilot uses AI. Check for mistakes.

litellm.callbacks.append(_MetricsCallback())
logging.info("Registered LiteLLM metrics callback for non-SDK providers")
25 changes: 25 additions & 0 deletions python/uv.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading