Skip to content

[Discussion] OpenTelemetry Tracing Enhancement - Semantic Conventions, SpanKind, and Production-Grade Features #172

@RichardoMrMu

Description

@RichardoMrMu

Background

I previously submitted PR #130 to enhance the OpenTelemetry tracing implementation in OpenDerisk, which was reviewed and merged by @csunny. However, PR #133 subsequently simplified the opentelemetry.py file and removed most of the enhancements from PR #130.

I'd like to open this discussion to align on the desired direction for OpenTelemetry tracing in OpenDerisk, and propose a plan to incrementally re-introduce production-grade tracing features.

Current State

The current opentelemetry.py (after PR #133) provides basic span creation and export, but lacks several features that are important for production observability:

Feature Status Impact
Semantic Conventions (HTTP/GenAI) ❌ Removed Traces not queryable in Jaeger/Tempo by standard attributes
SpanKind mapping (SERVER/CLIENT/INTERNAL) ❌ Removed No distinction between HTTP entry, LLM calls, and internal ops
Span Events (errors, token usage) ❌ Removed No error details or LLM usage tracking in traces
Span Status (OK/ERROR) ❌ Removed Cannot filter failed spans
Stale span cleanup ❌ Removed Memory leak risk from orphaned spans
Multi-exporter support (gRPC/HTTP/Console) ❌ Removed Only gRPC exporter available
Graceful degradation ❌ Removed Hard ImportError if opentelemetry not installed
Auto-discovery constructor compatibility ❌ Removed Incompatible with model_scan mechanism

Proposal

I propose re-introducing these features incrementally through smaller, focused PRs, making each change easier to review and discuss:

Phase 1: Foundation (Small PRs)

  1. Graceful degradation — Silently disable when opentelemetry is not installed, instead of crashing
  2. Constructor compatibility — Support both model_scan(system_app, tracer_parameters) and legacy (service_name) signatures
  3. Rich Resource attributes — Add service.namespace, deployment.environment, host.name, etc.

Phase 2: Standards Compliance

  1. SpanKind mapping — SERVER for HTTP entry (Webserver), CLIENT for LLM calls (ModelWorker), INTERNAL for others
  2. Semantic Conventions — Standard OTel attributes for HTTP (http.request.method, http.route) and GenAI (gen_ai.request.model, gen_ai.usage.*)
  3. Span Status & Events — OK/ERROR status, exception events, token usage events

Phase 3: Production Hardening

  1. Stale span cleanup — Daemon thread to prevent memory leaks (configurable TTL)
  2. Multi-exporter support — OTLP/gRPC (default), OTLP/HTTP, Console (dev mode)

Why This Matters

For traces to be useful in tools like Jaeger, Grafana Tempo, or Datadog, they must follow OpenTelemetry Semantic Conventions. Without proper SpanKind, semantic attributes, and error handling, traces are hard to query, filter, and visualize.

This is especially important for OpenDerisk's multi-agent architecture, where understanding the full request lifecycle across HTTP → Agent → LLM → Tool calls is critical for debugging and performance optimization.

Reference

Questions for Maintainers

  1. Is the incremental approach acceptable, or do you prefer a single comprehensive PR?
  2. Are there specific features from PR [Feature] Enhance OpenTelemetry tracing with semantic conventions, SpanKind, and error handling #130 that you'd like to keep simplified?
  3. Should advanced features (multi-exporter, stale cleanup) be behind a configuration flag?

I'm happy to discuss and adjust the plan based on the team's preferences. Looking forward to your feedback! 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions