Skip to content

feat(cloudflare): add telemetry collection and /metrics endpoint#400

Open
vahidlazio wants to merge 9 commits intomainfrom
feat/cloudflare-telemetry-analytics-engine
Open

feat(cloudflare): add telemetry collection and /metrics endpoint#400
vahidlazio wants to merge 9 commits intomainfrom
feat/cloudflare-telemetry-analytics-engine

Conversation

@vahidlazio
Copy link
Copy Markdown
Contributor

Summary

  • Add Prometheus-compatible telemetry to the Cloudflare resolver with the same metric names as the WASM providers (confidence_resolve_latency_microseconds, confidence_resolves_total, confidence_memory_bytes) so they can share Grafana dashboards
  • Telemetry recording is deferred to ctx.wait_until — the resolve hot path only captures two Date.now() timestamps
  • Queue consumer accumulates cross-isolate deltas into a cumulative TelemetrySnapshot persisted in KV, serving a /metrics endpoint that any Prometheus scraper can hit
  • Deployer auto-creates the KV namespace (same pattern as existing queue creation)
  • Adds accumulate_delta() to TelemetrySnapshot in the shared crate for reconstructing full histograms from compressed BucketSpan deltas
  • Adds serde::Serialize/Deserialize derives to TelemetrySnapshot (behind json feature gate)

Data flow

Hot path:     Date.now() → resolve → Date.now() → push {elapsed_us, reasons}
wait_until:   drain pending → TELEMETRY.record/mark → delta_snapshot → Queue
Queue consumer (batched, all isolates):
  ├─ KV: read cumulative → accumulate deltas → write back + prometheus text
  └─ POST to Confidence backend (existing)
/metrics:     read prometheus text from KV

Test plan

  • cargo test -p confidence_resolver -- telemetry passes (22 tests)
  • cargo check -p confidence-cloudflare-resolver compiles clean
  • Deploy to a test account and verify /metrics returns valid Prometheus text
  • Verify Prometheus scraper can ingest the /metrics endpoint
  • Verify metric names match WASM provider output

🤖 Generated with Claude Code

Add Prometheus-compatible telemetry to the Cloudflare resolver, matching
the same metric names as the WASM providers so they can share dashboards.

- Collect per-flag resolve latency and reason in the fetch handler,
  deferred to ctx.wait_until to keep the hot path clean
- Include telemetry deltas in WriteFlagLogsRequest via checkpoint()
- Queue consumer accumulates cross-isolate deltas into a cumulative
  TelemetrySnapshot persisted in KV
- Serve /metrics endpoint reading Prometheus text from KV
- Add serde derives to TelemetrySnapshot and accumulate_delta() method
  for reconstructing flat histograms from compressed BucketSpans
- Deployer auto-creates KV namespace (same pattern as queue creation)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vahidlazio vahidlazio marked this pull request as draft May 7, 2026 09:17
vahidlazio and others added 5 commits May 7, 2026 11:19
- Await KV put operations in update_prometheus_kv (were fire-and-forget)
- Guard against negative/oversized BucketSpan offsets in accumulate_delta
- Add race condition comment on KV read-modify-write
- Add CORS headers to /metrics endpoint for consistency
- Add unit tests for accumulate_delta: basic, negative offset, oversized

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allow indexing_slicing and arithmetic_side_effects on the method since
bounds are checked before every index. Use saturating_add for resize.
Re-sync Go WASM module.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vahidlazio vahidlazio marked this pull request as ready for review May 7, 2026 11:50
Previously aggregate_batch discarded all telemetry data except the SDK
field from the first message. Now it merges latency histograms, resolve
rate counters, and gauge fields across all messages in the batch, so the
Confidence backend receives aggregated telemetry matching what the WASM
providers send.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
.evaluation_context
.clone()
.unwrap_or_default();
let start = js_sys::Date::now();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer using Performance::now() for better resolution.

};
let body = text.unwrap_or_default();
let headers = Headers::new();
headers.set("Content-Type", "text/plain; version=0.0.4; charset=utf-8")?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be some special Content-Type we should use for Prometheus.

let router = Router::new();

let response = router
.get_async("/metrics", |_req, ctx| {
Copy link
Copy Markdown
Contributor

@andreas-karlsson andreas-karlsson May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use this parser to check that we can parse the output of this.

Edit: No need to do this since it's the same serializer that we test in go.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we could consider supporting the same options on this endpoint like we do in local:

But that could also wait for a followup PR. Maybe even better.

cat >> wrangler.toml <<EOF

[[kv_namespaces]]
binding = "METRICS_KV"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to CONFIDENCE_METRICS_KV

vahidlazio and others added 2 commits May 8, 2026 16:54
- Use performance.now() instead of Date.now() for better timing resolution
- Rename METRICS_KV binding to CONFIDENCE_METRICS_KV
- Fallback to Date.now() if performance API is unavailable

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants