Skip to content

OpenTelemetry (OTLP) protocol support #27

@mattmezza

Description

@mattmezza
  • Implement OTLP/HTTP receiver endpoints for traces, metrics, and logs in Zig
  • Map OTLP data model to Monlight's internal SQLite schema
  • Add a new Trace Viewer service for distributed trace exploration
  • Write conformance tests against the OpenTelemetry Collector test suite
  • Documentation and migration guide for existing OTLP-instrumented applications

Implementation Plan

Goals

  1. Accept telemetry from any standard OpenTelemetry SDK (Python, JS, Go, Java, .NET, Rust) by pointing OTEL_EXPORTER_OTLP_ENDPOINT at Monlight.
  2. Persist OTLP logs into logs.db, OTLP metrics into metrics.db, and OTLP spans into a new traces.db — reusing the existing SQLite modules and retention strategy.
  3. Preserve the "each service < 20MB, total stack < 50MB RAM" constraint. No full protobuf runtime, no gRPC stack, no extra daemons.
  4. Keep the existing native JSON APIs untouched. OTLP is additive.

Non-goals (v1)

  • OTLP/gRPC. Needs HTTP/2 + full protobuf; defer until OTLP/HTTP is solid. Document that users should set OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf or http/json.
  • OpenTelemetry Collector feature parity (processors, samplers, tail sampling, routing). Monlight stays a terminal receiver, not a collector.
  • OTLP push-based profiling signal. Still experimental in the spec.
  • Remote configuration / OpAMP.

Architecture

Two new components, both Zig + SQLite, matching the existing microservice pattern:

   OTel SDK (any lang)
          |
   OTLP/HTTP POST
   /v1/traces  /v1/metrics  /v1/logs
          |
          v
   +------+------+
   | otlp-receiver|  :5014   (new)
   | stateless    |
   | protobuf+JSON|
   +---+---+---+--+
       |   |   |
       |   |   +---> metrics-collector DB writer (reuses shared/sqlite.zig)
       |   +-------> log-viewer DB writer
       +-----------> trace-viewer DB writer
                           |
                    +------+------+
                    | trace-viewer |  :5015   (new)
                    | query + UI   |
                    +------+------+
                           |
                        [traces.db]

Decision: separate receiver vs. extending existing services. A dedicated otlp-receiver keeps OTLP parsing (protobuf decoder, attribute flattening, resource merging) in one place, so each existing service stays focused on one wire format. The receiver writes directly to the three SQLite files rather than forwarding HTTP to the other services — this avoids double-serialisation and keeps the per-signal path under one syscall batch. The trade-off is that the receiver must link the same DB modules as the downstream services; acceptable because shared/sqlite.zig is already the common layer.

Trace Viewer is a separate binary because its query surface (span tree assembly, waterfall UI, trace-by-id lookup, service dependency graph) has no overlap with the three existing services.

Phase 1 — OTLP/HTTP receiver skeleton

  • New otlp-receiver/ service (src/main.zig, src/config.zig, src/http_router.zig).
  • Routes: POST /v1/traces, POST /v1/metrics, POST /v1/logs, GET /health.
  • Content-Type dispatch: application/x-protobuf → protobuf decoder, application/jsonstd.json.
  • Auth: accept Authorization: Bearer <key> and X-API-Key (OTel SDKs set the former via OTEL_EXPORTER_OTLP_HEADERS).
  • Response bodies follow the OTLP spec (ExportTraceServiceResponse etc. with partial_success on partial rejects).
  • Wire up the same shared/rate_limit.zig and shared/auth.zig used by other services.

Phase 2 — Minimal protobuf decoder

  • Hand-rolled streaming decoder in shared/otlp_proto.zig, supporting only the OTLP message subset:
    • ExportTraceServiceRequest, ResourceSpans, ScopeSpans, Span, Status, KeyValue, AnyValue, ArrayValue.
    • ExportMetricsServiceRequest, ResourceMetrics, ScopeMetrics, Metric, Sum, Gauge, Histogram, NumberDataPoint, HistogramDataPoint.
    • ExportLogsServiceRequest, ResourceLogs, ScopeLogs, LogRecord, SeverityNumber.
  • Decoder is tag/wire-type driven, zero-copy into arena allocator owned by the request handler.
  • Generate field-number constants from .proto files checked into shared/otlp/ (vendored from the opentelemetry-proto repo at a pinned commit so upgrades are auditable).
  • Fuzz the decoder with zig build fuzz-otlp against malformed payloads before it touches the DB layer.

Phase 3 — Data model mapping

Logs → logs.db: existing schema already has timestamp, level, body, container, source. Add columns trace_id TEXT, span_id TEXT, resource_json TEXT, attributes_json TEXT, service_name TEXT (materialized from resource.attributes["service.name"] for fast filtering). SeverityNumber maps to the existing level enum via a lookup table. Non-Docker-container log sources get container = NULL; the UI already tolerates this, but filters need updating.

Metrics → metrics.db:

  • Sum (monotonic) → counter
  • Sum (non-monotonic) or Gaugegauge
  • Histogramhistogram (bucket boundaries become a JSON blob; existing aggregation code needs a path that trusts pre-bucketed data instead of recomputing).
  • ExponentialHistogram → unsupported in v1; recorded as partial_success with a rejection reason.
  • OTLP resource + scope attributes flatten to the existing labels map with reserved prefixes (resource.*, scope.*) to avoid collision.

Traces → traces.db (new):

CREATE TABLE spans (
    trace_id         BLOB NOT NULL,        -- 16 bytes
    span_id          BLOB NOT NULL,        -- 8 bytes
    parent_span_id   BLOB,                 -- 8 bytes, nullable for root
    name             TEXT NOT NULL,
    kind             INTEGER NOT NULL,     -- 0..5 per OTLP SpanKind
    start_unix_nano  INTEGER NOT NULL,
    end_unix_nano    INTEGER NOT NULL,
    duration_ns      INTEGER NOT NULL,     -- materialized for sort/index
    status_code      INTEGER NOT NULL,     -- 0=unset,1=ok,2=error
    status_message   TEXT,
    service_name     TEXT NOT NULL,
    scope_name       TEXT,
    attributes_json  TEXT NOT NULL,
    events_json      TEXT,                 -- list of SpanEvent
    links_json       TEXT,                 -- list of SpanLink
    resource_json    TEXT NOT NULL,
    ingested_at      INTEGER NOT NULL,
    PRIMARY KEY (trace_id, span_id)
);
CREATE INDEX idx_spans_trace       ON spans(trace_id);
CREATE INDEX idx_spans_service_ts  ON spans(service_name, start_unix_nano DESC);
CREATE INDEX idx_spans_duration    ON spans(service_name, duration_ns DESC);
CREATE VIRTUAL TABLE spans_fts USING fts5(name, attributes_json, content='spans', content_rowid='rowid');

Retention mirrors metrics: RETENTION_TRACES env var (default 7 days), periodic DELETE WHERE ingested_at < ? sweep.

Phase 4 — Trace Viewer service

  • New trace-viewer/ service on :5015.
  • Endpoints:
    • GET /api/traces — list recent traces, filters: service, min_duration, status, search (FTS5), time range.
    • GET /api/traces/{trace_id} — all spans for a trace, ordered for tree assembly.
    • GET /api/services — distinct service_name values with span counts.
    • GET /api/dependencies — service-to-service edge list derived from parent/child service boundaries (computed on demand over the last 1h, cached for 60s).
    • GET /health.
  • Web UI: waterfall view (uPlot rectangles reusing the metrics-collector chart library), span detail panel, service dependency graph rendered with a tiny SVG layout (no D3 — keep bundle small).
  • UI ships as embedded static assets, same pattern as log-viewer/src/static/.

Phase 5 — Deployment integration

  • Add otlp-receiver and trace-viewer to docker-compose.yml and deploy/docker-compose.monitoring.yml.
  • New env vars documented in the README table (see below).
  • Volume mounts: trace-viewer shares ./data/traces.db, otlp-receiver mounts all three DB paths read-write.
  • Makefile: release-otlp-receiver, release-trace-viewer, updated release-services.
  • CI workflow (.github/workflows/): add build+test jobs for the two new services mirroring the existing matrix.

Phase 6 — Conformance testing

The OpenTelemetry project does not publish a drop-in receiver conformance suite, so we build a pragmatic equivalent:

  1. Round-trip tests (fast, in-repo): zig build test generates encoded OTLP payloads using the vendored .proto definitions, posts them to a locally spawned receiver, and asserts DB rows match.
  2. SDK compatibility tests (slow, CI-gated): deploy/otlp-conformance/ containers running the official OTel SDKs for Python, JS, Go, Java, .NET, and Rust, each emitting a canned workload against the receiver. Assertions run from outside via the query endpoints.
  3. Protocol negative tests: malformed protobuf, oversized payloads, unsupported ExponentialHistogram, missing service.name — all must produce the spec-compliant partial_success or 4xx response.
  4. Interop with otel-cli: smoke-test script under deploy/smoke-test.sh that pipes otel-cli output through the receiver and reads it back from trace-viewer.

Phase 7 — Documentation & migration guide

  • New docs/otlp.md with: endpoint URLs, auth header format, supported signals, unsupported features, size limits, per-signal semantics.
  • Migration guide: side-by-side env var table for popular SDKs showing how to point them at Monlight, e.g.
    OTEL_EXPORTER_OTLP_ENDPOINT=http://monlight.local:5014
    OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
    OTEL_EXPORTER_OTLP_HEADERS="authorization=Bearer <API_KEY>"
    
  • Update the README architecture diagram to show the two new services.
  • Release notes entry explaining that native JSON ingestion is unchanged and both can be used simultaneously (useful during a gradual OTel rollout).

New environment variables

otlp-receiver

Variable Required Default Description
API_KEY Yes Accepted via Authorization: Bearer or X-API-Key
TRACES_DB_PATH No ./data/traces.db SQLite path for spans
LOGS_DB_PATH No ./data/logs.db Shared with log-viewer
METRICS_DB_PATH No ./data/metrics.db Shared with metrics-collector
MAX_PAYLOAD_BYTES No 4194304 Reject requests larger than 4MB
ACCEPT_JSON No true Toggle OTLP/JSON support
LOG_LEVEL No INFO

trace-viewer

Variable Required Default Description
API_KEY Yes
DATABASE_PATH No ./data/traces.db
RETENTION_DAYS No 7 Days of spans to keep
LOG_LEVEL No INFO

Open questions

  1. Shared DB write access. Having otlp-receiver write to logs.db and metrics.db while the owning services also write (or read) raises the WAL contention question. Needs a quick benchmark before Phase 3 lands; fallback is an in-process HTTP forward to the owning service.
  2. Attribute cardinality. OTLP attributes can be unbounded. Propose a default per-label cardinality cap (e.g. 1000 distinct values per key) enforced at ingestion, configurable via env var. Matches what the metrics-collector already assumes.
  3. Head sampling. The receiver does no sampling, but we should document that downstream retention trims traces; users who want head sampling should configure it in their SDK.
  4. Binary-encoded trace IDs in URLs. GET /api/traces/{trace_id} will accept hex-encoded IDs (32 chars) to stay URL-safe; the DB stores raw BLOB for space.
  5. Backpressure. OTLP spec allows 503 Retry-After — we need to decide at what queue depth the receiver starts shedding.

Milestones

# Deliverable Exit criterion
M1 Receiver skeleton, routes, auth, /health curl against all three endpoints returns spec-compliant empty responses
M2 Protobuf decoder + fuzz target Round-trips a span emitted by the Python OTel SDK into in-memory structs
M3 Logs + metrics persistence Python SDK → receiver → visible in existing log-viewer and metrics-collector UIs
M4 traces.db + trace-viewer service + UI Waterfall renders a multi-service trace end-to-end
M5 Conformance suite + docs + release All six SDK containers green in CI; docs/otlp.md merged

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions