Implementation Plan
Goals
- Accept telemetry from any standard OpenTelemetry SDK (Python, JS, Go, Java, .NET, Rust) by pointing
OTEL_EXPORTER_OTLP_ENDPOINT at Monlight.
- Persist OTLP logs into
logs.db, OTLP metrics into metrics.db, and OTLP spans into a new traces.db — reusing the existing SQLite modules and retention strategy.
- Preserve the "each service < 20MB, total stack < 50MB RAM" constraint. No full protobuf runtime, no gRPC stack, no extra daemons.
- Keep the existing native JSON APIs untouched. OTLP is additive.
Non-goals (v1)
- OTLP/gRPC. Needs HTTP/2 + full protobuf; defer until OTLP/HTTP is solid. Document that users should set
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf or http/json.
- OpenTelemetry Collector feature parity (processors, samplers, tail sampling, routing). Monlight stays a terminal receiver, not a collector.
- OTLP push-based profiling signal. Still experimental in the spec.
- Remote configuration / OpAMP.
Architecture
Two new components, both Zig + SQLite, matching the existing microservice pattern:
OTel SDK (any lang)
|
OTLP/HTTP POST
/v1/traces /v1/metrics /v1/logs
|
v
+------+------+
| otlp-receiver| :5014 (new)
| stateless |
| protobuf+JSON|
+---+---+---+--+
| | |
| | +---> metrics-collector DB writer (reuses shared/sqlite.zig)
| +-------> log-viewer DB writer
+-----------> trace-viewer DB writer
|
+------+------+
| trace-viewer | :5015 (new)
| query + UI |
+------+------+
|
[traces.db]
Decision: separate receiver vs. extending existing services. A dedicated otlp-receiver keeps OTLP parsing (protobuf decoder, attribute flattening, resource merging) in one place, so each existing service stays focused on one wire format. The receiver writes directly to the three SQLite files rather than forwarding HTTP to the other services — this avoids double-serialisation and keeps the per-signal path under one syscall batch. The trade-off is that the receiver must link the same DB modules as the downstream services; acceptable because shared/sqlite.zig is already the common layer.
Trace Viewer is a separate binary because its query surface (span tree assembly, waterfall UI, trace-by-id lookup, service dependency graph) has no overlap with the three existing services.
Phase 1 — OTLP/HTTP receiver skeleton
- New
otlp-receiver/ service (src/main.zig, src/config.zig, src/http_router.zig).
- Routes:
POST /v1/traces, POST /v1/metrics, POST /v1/logs, GET /health.
- Content-Type dispatch:
application/x-protobuf → protobuf decoder, application/json → std.json.
- Auth: accept
Authorization: Bearer <key> and X-API-Key (OTel SDKs set the former via OTEL_EXPORTER_OTLP_HEADERS).
- Response bodies follow the OTLP spec (
ExportTraceServiceResponse etc. with partial_success on partial rejects).
- Wire up the same
shared/rate_limit.zig and shared/auth.zig used by other services.
Phase 2 — Minimal protobuf decoder
- Hand-rolled streaming decoder in
shared/otlp_proto.zig, supporting only the OTLP message subset:
ExportTraceServiceRequest, ResourceSpans, ScopeSpans, Span, Status, KeyValue, AnyValue, ArrayValue.
ExportMetricsServiceRequest, ResourceMetrics, ScopeMetrics, Metric, Sum, Gauge, Histogram, NumberDataPoint, HistogramDataPoint.
ExportLogsServiceRequest, ResourceLogs, ScopeLogs, LogRecord, SeverityNumber.
- Decoder is tag/wire-type driven, zero-copy into arena allocator owned by the request handler.
- Generate field-number constants from
.proto files checked into shared/otlp/ (vendored from the opentelemetry-proto repo at a pinned commit so upgrades are auditable).
- Fuzz the decoder with
zig build fuzz-otlp against malformed payloads before it touches the DB layer.
Phase 3 — Data model mapping
Logs → logs.db: existing schema already has timestamp, level, body, container, source. Add columns trace_id TEXT, span_id TEXT, resource_json TEXT, attributes_json TEXT, service_name TEXT (materialized from resource.attributes["service.name"] for fast filtering). SeverityNumber maps to the existing level enum via a lookup table. Non-Docker-container log sources get container = NULL; the UI already tolerates this, but filters need updating.
Metrics → metrics.db:
Sum (monotonic) → counter
Sum (non-monotonic) or Gauge → gauge
Histogram → histogram (bucket boundaries become a JSON blob; existing aggregation code needs a path that trusts pre-bucketed data instead of recomputing).
ExponentialHistogram → unsupported in v1; recorded as partial_success with a rejection reason.
- OTLP resource + scope attributes flatten to the existing
labels map with reserved prefixes (resource.*, scope.*) to avoid collision.
Traces → traces.db (new):
CREATE TABLE spans (
trace_id BLOB NOT NULL, -- 16 bytes
span_id BLOB NOT NULL, -- 8 bytes
parent_span_id BLOB, -- 8 bytes, nullable for root
name TEXT NOT NULL,
kind INTEGER NOT NULL, -- 0..5 per OTLP SpanKind
start_unix_nano INTEGER NOT NULL,
end_unix_nano INTEGER NOT NULL,
duration_ns INTEGER NOT NULL, -- materialized for sort/index
status_code INTEGER NOT NULL, -- 0=unset,1=ok,2=error
status_message TEXT,
service_name TEXT NOT NULL,
scope_name TEXT,
attributes_json TEXT NOT NULL,
events_json TEXT, -- list of SpanEvent
links_json TEXT, -- list of SpanLink
resource_json TEXT NOT NULL,
ingested_at INTEGER NOT NULL,
PRIMARY KEY (trace_id, span_id)
);
CREATE INDEX idx_spans_trace ON spans(trace_id);
CREATE INDEX idx_spans_service_ts ON spans(service_name, start_unix_nano DESC);
CREATE INDEX idx_spans_duration ON spans(service_name, duration_ns DESC);
CREATE VIRTUAL TABLE spans_fts USING fts5(name, attributes_json, content='spans', content_rowid='rowid');
Retention mirrors metrics: RETENTION_TRACES env var (default 7 days), periodic DELETE WHERE ingested_at < ? sweep.
Phase 4 — Trace Viewer service
- New
trace-viewer/ service on :5015.
- Endpoints:
GET /api/traces — list recent traces, filters: service, min_duration, status, search (FTS5), time range.
GET /api/traces/{trace_id} — all spans for a trace, ordered for tree assembly.
GET /api/services — distinct service_name values with span counts.
GET /api/dependencies — service-to-service edge list derived from parent/child service boundaries (computed on demand over the last 1h, cached for 60s).
GET /health.
- Web UI: waterfall view (uPlot rectangles reusing the metrics-collector chart library), span detail panel, service dependency graph rendered with a tiny SVG layout (no D3 — keep bundle small).
- UI ships as embedded static assets, same pattern as
log-viewer/src/static/.
Phase 5 — Deployment integration
- Add
otlp-receiver and trace-viewer to docker-compose.yml and deploy/docker-compose.monitoring.yml.
- New env vars documented in the README table (see below).
- Volume mounts:
trace-viewer shares ./data/traces.db, otlp-receiver mounts all three DB paths read-write.
- Makefile:
release-otlp-receiver, release-trace-viewer, updated release-services.
- CI workflow (
.github/workflows/): add build+test jobs for the two new services mirroring the existing matrix.
Phase 6 — Conformance testing
The OpenTelemetry project does not publish a drop-in receiver conformance suite, so we build a pragmatic equivalent:
- Round-trip tests (fast, in-repo):
zig build test generates encoded OTLP payloads using the vendored .proto definitions, posts them to a locally spawned receiver, and asserts DB rows match.
- SDK compatibility tests (slow, CI-gated):
deploy/otlp-conformance/ containers running the official OTel SDKs for Python, JS, Go, Java, .NET, and Rust, each emitting a canned workload against the receiver. Assertions run from outside via the query endpoints.
- Protocol negative tests: malformed protobuf, oversized payloads, unsupported
ExponentialHistogram, missing service.name — all must produce the spec-compliant partial_success or 4xx response.
- Interop with
otel-cli: smoke-test script under deploy/smoke-test.sh that pipes otel-cli output through the receiver and reads it back from trace-viewer.
Phase 7 — Documentation & migration guide
- New
docs/otlp.md with: endpoint URLs, auth header format, supported signals, unsupported features, size limits, per-signal semantics.
- Migration guide: side-by-side env var table for popular SDKs showing how to point them at Monlight, e.g.
OTEL_EXPORTER_OTLP_ENDPOINT=http://monlight.local:5014
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_EXPORTER_OTLP_HEADERS="authorization=Bearer <API_KEY>"
- Update the README architecture diagram to show the two new services.
- Release notes entry explaining that native JSON ingestion is unchanged and both can be used simultaneously (useful during a gradual OTel rollout).
New environment variables
otlp-receiver
| Variable |
Required |
Default |
Description |
API_KEY |
Yes |
|
Accepted via Authorization: Bearer or X-API-Key |
TRACES_DB_PATH |
No |
./data/traces.db |
SQLite path for spans |
LOGS_DB_PATH |
No |
./data/logs.db |
Shared with log-viewer |
METRICS_DB_PATH |
No |
./data/metrics.db |
Shared with metrics-collector |
MAX_PAYLOAD_BYTES |
No |
4194304 |
Reject requests larger than 4MB |
ACCEPT_JSON |
No |
true |
Toggle OTLP/JSON support |
LOG_LEVEL |
No |
INFO |
|
trace-viewer
| Variable |
Required |
Default |
Description |
API_KEY |
Yes |
|
|
DATABASE_PATH |
No |
./data/traces.db |
|
RETENTION_DAYS |
No |
7 |
Days of spans to keep |
LOG_LEVEL |
No |
INFO |
|
Open questions
- Shared DB write access. Having
otlp-receiver write to logs.db and metrics.db while the owning services also write (or read) raises the WAL contention question. Needs a quick benchmark before Phase 3 lands; fallback is an in-process HTTP forward to the owning service.
- Attribute cardinality. OTLP attributes can be unbounded. Propose a default per-label cardinality cap (e.g. 1000 distinct values per key) enforced at ingestion, configurable via env var. Matches what the metrics-collector already assumes.
- Head sampling. The receiver does no sampling, but we should document that downstream retention trims traces; users who want head sampling should configure it in their SDK.
- Binary-encoded trace IDs in URLs.
GET /api/traces/{trace_id} will accept hex-encoded IDs (32 chars) to stay URL-safe; the DB stores raw BLOB for space.
- Backpressure. OTLP spec allows
503 Retry-After — we need to decide at what queue depth the receiver starts shedding.
Milestones
| # |
Deliverable |
Exit criterion |
| M1 |
Receiver skeleton, routes, auth, /health |
curl against all three endpoints returns spec-compliant empty responses |
| M2 |
Protobuf decoder + fuzz target |
Round-trips a span emitted by the Python OTel SDK into in-memory structs |
| M3 |
Logs + metrics persistence |
Python SDK → receiver → visible in existing log-viewer and metrics-collector UIs |
| M4 |
traces.db + trace-viewer service + UI |
Waterfall renders a multi-service trace end-to-end |
| M5 |
Conformance suite + docs + release |
All six SDK containers green in CI; docs/otlp.md merged |
Implementation Plan
Goals
OTEL_EXPORTER_OTLP_ENDPOINTat Monlight.logs.db, OTLP metrics intometrics.db, and OTLP spans into a newtraces.db— reusing the existing SQLite modules and retention strategy.Non-goals (v1)
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuforhttp/json.Architecture
Two new components, both Zig + SQLite, matching the existing microservice pattern:
Decision: separate receiver vs. extending existing services. A dedicated
otlp-receiverkeeps OTLP parsing (protobuf decoder, attribute flattening, resource merging) in one place, so each existing service stays focused on one wire format. The receiver writes directly to the three SQLite files rather than forwarding HTTP to the other services — this avoids double-serialisation and keeps the per-signal path under one syscall batch. The trade-off is that the receiver must link the same DB modules as the downstream services; acceptable becauseshared/sqlite.zigis already the common layer.Trace Viewer is a separate binary because its query surface (span tree assembly, waterfall UI, trace-by-id lookup, service dependency graph) has no overlap with the three existing services.
Phase 1 — OTLP/HTTP receiver skeleton
otlp-receiver/service (src/main.zig,src/config.zig,src/http_router.zig).POST /v1/traces,POST /v1/metrics,POST /v1/logs,GET /health.application/x-protobuf→ protobuf decoder,application/json→std.json.Authorization: Bearer <key>andX-API-Key(OTel SDKs set the former viaOTEL_EXPORTER_OTLP_HEADERS).ExportTraceServiceResponseetc. withpartial_successon partial rejects).shared/rate_limit.zigandshared/auth.zigused by other services.Phase 2 — Minimal protobuf decoder
shared/otlp_proto.zig, supporting only the OTLP message subset:ExportTraceServiceRequest,ResourceSpans,ScopeSpans,Span,Status,KeyValue,AnyValue,ArrayValue.ExportMetricsServiceRequest,ResourceMetrics,ScopeMetrics,Metric,Sum,Gauge,Histogram,NumberDataPoint,HistogramDataPoint.ExportLogsServiceRequest,ResourceLogs,ScopeLogs,LogRecord,SeverityNumber..protofiles checked intoshared/otlp/(vendored from the opentelemetry-proto repo at a pinned commit so upgrades are auditable).zig build fuzz-otlpagainst malformed payloads before it touches the DB layer.Phase 3 — Data model mapping
Logs →
logs.db: existing schema already has timestamp, level, body, container, source. Add columnstrace_id TEXT,span_id TEXT,resource_json TEXT,attributes_json TEXT,service_name TEXT(materialized fromresource.attributes["service.name"]for fast filtering).SeverityNumbermaps to the existing level enum via a lookup table. Non-Docker-container log sources getcontainer = NULL; the UI already tolerates this, but filters need updating.Metrics →
metrics.db:Sum(monotonic) →counterSum(non-monotonic) orGauge→gaugeHistogram→histogram(bucket boundaries become a JSON blob; existing aggregation code needs a path that trusts pre-bucketed data instead of recomputing).ExponentialHistogram→ unsupported in v1; recorded aspartial_successwith a rejection reason.labelsmap with reserved prefixes (resource.*,scope.*) to avoid collision.Traces →
traces.db(new):Retention mirrors metrics:
RETENTION_TRACESenv var (default 7 days), periodicDELETE WHERE ingested_at < ?sweep.Phase 4 — Trace Viewer service
trace-viewer/service on:5015.GET /api/traces— list recent traces, filters:service,min_duration,status,search(FTS5), time range.GET /api/traces/{trace_id}— all spans for a trace, ordered for tree assembly.GET /api/services— distinctservice_namevalues with span counts.GET /api/dependencies— service-to-service edge list derived from parent/child service boundaries (computed on demand over the last 1h, cached for 60s).GET /health.log-viewer/src/static/.Phase 5 — Deployment integration
otlp-receiverandtrace-viewertodocker-compose.ymlanddeploy/docker-compose.monitoring.yml.trace-viewershares./data/traces.db,otlp-receivermounts all three DB paths read-write.release-otlp-receiver,release-trace-viewer, updatedrelease-services..github/workflows/): add build+test jobs for the two new services mirroring the existing matrix.Phase 6 — Conformance testing
The OpenTelemetry project does not publish a drop-in receiver conformance suite, so we build a pragmatic equivalent:
zig build testgenerates encoded OTLP payloads using the vendored.protodefinitions, posts them to a locally spawned receiver, and asserts DB rows match.deploy/otlp-conformance/containers running the official OTel SDKs for Python, JS, Go, Java, .NET, and Rust, each emitting a canned workload against the receiver. Assertions run from outside via the query endpoints.ExponentialHistogram, missingservice.name— all must produce the spec-compliantpartial_successor 4xx response.otel-cli: smoke-test script underdeploy/smoke-test.shthat pipesotel-clioutput through the receiver and reads it back fromtrace-viewer.Phase 7 — Documentation & migration guide
docs/otlp.mdwith: endpoint URLs, auth header format, supported signals, unsupported features, size limits, per-signal semantics.New environment variables
otlp-receiver
API_KEYAuthorization: BearerorX-API-KeyTRACES_DB_PATH./data/traces.dbLOGS_DB_PATH./data/logs.dbMETRICS_DB_PATH./data/metrics.dbMAX_PAYLOAD_BYTES4194304ACCEPT_JSONtrueLOG_LEVELINFOtrace-viewer
API_KEYDATABASE_PATH./data/traces.dbRETENTION_DAYS7LOG_LEVELINFOOpen questions
otlp-receiverwrite tologs.dbandmetrics.dbwhile the owning services also write (or read) raises the WAL contention question. Needs a quick benchmark before Phase 3 lands; fallback is an in-process HTTP forward to the owning service.GET /api/traces/{trace_id}will accept hex-encoded IDs (32 chars) to stay URL-safe; the DB stores rawBLOBfor space.503 Retry-After— we need to decide at what queue depth the receiver starts shedding.Milestones
/healthcurlagainst all three endpoints returns spec-compliant empty responsestraces.db+ trace-viewer service + UIdocs/otlp.mdmerged