feat(cloudflare): add telemetry collection and /metrics endpoint#400
Open
vahidlazio wants to merge 9 commits intomainfrom
Open
feat(cloudflare): add telemetry collection and /metrics endpoint#400vahidlazio wants to merge 9 commits intomainfrom
vahidlazio wants to merge 9 commits intomainfrom
Conversation
Add Prometheus-compatible telemetry to the Cloudflare resolver, matching the same metric names as the WASM providers so they can share dashboards. - Collect per-flag resolve latency and reason in the fetch handler, deferred to ctx.wait_until to keep the hot path clean - Include telemetry deltas in WriteFlagLogsRequest via checkpoint() - Queue consumer accumulates cross-isolate deltas into a cumulative TelemetrySnapshot persisted in KV - Serve /metrics endpoint reading Prometheus text from KV - Add serde derives to TelemetrySnapshot and accumulate_delta() method for reconstructing flat histograms from compressed BucketSpans - Deployer auto-creates KV namespace (same pattern as queue creation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Await KV put operations in update_prometheus_kv (were fire-and-forget) - Guard against negative/oversized BucketSpan offsets in accumulate_delta - Add race condition comment on KV read-modify-write - Add CORS headers to /metrics endpoint for consistency - Add unit tests for accumulate_delta: basic, negative offset, oversized Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allow indexing_slicing and arithmetic_side_effects on the method since bounds are checked before every index. Use saturating_add for resize. Re-sync Go WASM module. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously aggregate_batch discarded all telemetry data except the SDK field from the first message. Now it merges latency histograms, resolve rate counters, and gauge fields across all messages in the batch, so the Confidence backend receives aggregated telemetry matching what the WASM providers send. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| .evaluation_context | ||
| .clone() | ||
| .unwrap_or_default(); | ||
| let start = js_sys::Date::now(); |
Contributor
There was a problem hiding this comment.
Prefer using Performance::now() for better resolution.
| }; | ||
| let body = text.unwrap_or_default(); | ||
| let headers = Headers::new(); | ||
| headers.set("Content-Type", "text/plain; version=0.0.4; charset=utf-8")?; |
Contributor
There was a problem hiding this comment.
I think there might be some special Content-Type we should use for Prometheus.
| let router = Router::new(); | ||
|
|
||
| let response = router | ||
| .get_async("/metrics", |_req, ctx| { |
Contributor
There was a problem hiding this comment.
We should use this parser to check that we can parse the output of this.
Edit: No need to do this since it's the same serializer that we test in go.
Contributor
There was a problem hiding this comment.
Also we could consider supporting the same options on this endpoint like we do in local:
But that could also wait for a followup PR. Maybe even better.
| cat >> wrangler.toml <<EOF | ||
|
|
||
| [[kv_namespaces]] | ||
| binding = "METRICS_KV" |
Contributor
There was a problem hiding this comment.
Change to CONFIDENCE_METRICS_KV
- Use performance.now() instead of Date.now() for better timing resolution - Rename METRICS_KV binding to CONFIDENCE_METRICS_KV - Fallback to Date.now() if performance API is unavailable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
confidence_resolve_latency_microseconds,confidence_resolves_total,confidence_memory_bytes) so they can share Grafana dashboardsctx.wait_until— the resolve hot path only captures twoDate.now()timestampsTelemetrySnapshotpersisted in KV, serving a/metricsendpoint that any Prometheus scraper can hitaccumulate_delta()toTelemetrySnapshotin the shared crate for reconstructing full histograms from compressedBucketSpandeltasserde::Serialize/Deserializederives toTelemetrySnapshot(behindjsonfeature gate)Data flow
Test plan
cargo test -p confidence_resolver -- telemetrypasses (22 tests)cargo check -p confidence-cloudflare-resolvercompiles clean/metricsreturns valid Prometheus text/metricsendpoint🤖 Generated with Claude Code