-
Notifications
You must be signed in to change notification settings - Fork 114
[Discussion] OpenTelemetry Tracing Enhancement - Semantic Conventions, SpanKind, and Production-Grade Features #172
Description
Background
I previously submitted PR #130 to enhance the OpenTelemetry tracing implementation in OpenDerisk, which was reviewed and merged by @csunny. However, PR #133 subsequently simplified the opentelemetry.py file and removed most of the enhancements from PR #130.
I'd like to open this discussion to align on the desired direction for OpenTelemetry tracing in OpenDerisk, and propose a plan to incrementally re-introduce production-grade tracing features.
Current State
The current opentelemetry.py (after PR #133) provides basic span creation and export, but lacks several features that are important for production observability:
| Feature | Status | Impact |
|---|---|---|
| Semantic Conventions (HTTP/GenAI) | ❌ Removed | Traces not queryable in Jaeger/Tempo by standard attributes |
| SpanKind mapping (SERVER/CLIENT/INTERNAL) | ❌ Removed | No distinction between HTTP entry, LLM calls, and internal ops |
| Span Events (errors, token usage) | ❌ Removed | No error details or LLM usage tracking in traces |
| Span Status (OK/ERROR) | ❌ Removed | Cannot filter failed spans |
| Stale span cleanup | ❌ Removed | Memory leak risk from orphaned spans |
| Multi-exporter support (gRPC/HTTP/Console) | ❌ Removed | Only gRPC exporter available |
| Graceful degradation | ❌ Removed | Hard ImportError if opentelemetry not installed |
| Auto-discovery constructor compatibility | ❌ Removed | Incompatible with model_scan mechanism |
Proposal
I propose re-introducing these features incrementally through smaller, focused PRs, making each change easier to review and discuss:
Phase 1: Foundation (Small PRs)
- Graceful degradation — Silently disable when
opentelemetryis not installed, instead of crashing - Constructor compatibility — Support both
model_scan(system_app, tracer_parameters)and legacy(service_name)signatures - Rich Resource attributes — Add
service.namespace,deployment.environment,host.name, etc.
Phase 2: Standards Compliance
- SpanKind mapping — SERVER for HTTP entry (Webserver), CLIENT for LLM calls (ModelWorker), INTERNAL for others
- Semantic Conventions — Standard OTel attributes for HTTP (
http.request.method,http.route) and GenAI (gen_ai.request.model,gen_ai.usage.*) - Span Status & Events — OK/ERROR status, exception events, token usage events
Phase 3: Production Hardening
- Stale span cleanup — Daemon thread to prevent memory leaks (configurable TTL)
- Multi-exporter support — OTLP/gRPC (default), OTLP/HTTP, Console (dev mode)
Why This Matters
For traces to be useful in tools like Jaeger, Grafana Tempo, or Datadog, they must follow OpenTelemetry Semantic Conventions. Without proper SpanKind, semantic attributes, and error handling, traces are hard to query, filter, and visualize.
This is especially important for OpenDerisk's multi-agent architecture, where understanding the full request lifecycle across HTTP → Agent → LLM → Tool calls is critical for debugging and performance optimization.
Reference
- PR [Feature] Enhance OpenTelemetry tracing with semantic conventions, SpanKind, and error handling #130 (merged, then reverted): #130 — Full implementation with all features
- PR feat: Frontend optimization and enhance OpenTelemetry tracing #133 (simplified): #133 — Removed most PR [Feature] Enhance OpenTelemetry tracing with semantic conventions, SpanKind, and error handling #130 enhancements
- PR [Feature] Add Prometheus metrics endpoint for OpenDerisk observability #128 (merged): Prometheus metrics — Complements tracing for complete observability
- OTel HTTP Semantic Conventions: https://opentelemetry.io/docs/specs/semconv/http/
- OTel GenAI Semantic Conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/
Questions for Maintainers
- Is the incremental approach acceptable, or do you prefer a single comprehensive PR?
- Are there specific features from PR [Feature] Enhance OpenTelemetry tracing with semantic conventions, SpanKind, and error handling #130 that you'd like to keep simplified?
- Should advanced features (multi-exporter, stale cleanup) be behind a configuration flag?
I'm happy to discuss and adjust the plan based on the team's preferences. Looking forward to your feedback! 🙏