Metrics

Prometheus metrics exposed by the gateway at the configured metrics path (default /metrics).

Phase 1 Metrics

Phase 1 established the foundational proxy layer. Three Prometheus instruments were introduced to measure baseline gateway behavior before any routing logic existed.

Instruments

Metric	Type	Labels	Description
`draftthinker_requests_total`	Counter	`model`, `status`	Total requests by model name and HTTP status code
`draftthinker_upstream_latency_seconds`	Histogram	`provider`	Upstream LLM provider latency in seconds
`draftthinker_errors_total`	Counter	`type`	Total errors by error type

Latency Histogram Buckets

upstream_latency_seconds uses the following bucket boundaries (in seconds):

0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30

Exit Criteria

Target	Measurement
Proxy overhead < 5ms P99	Gateway processing time excluding upstream model inference

Phase 1 did not include entropy or routing metrics. Those were added in Phase 2.

Phase 2 Metrics

Phase 2 added entropy-based routing. Two new instruments track the entropy analysis and routing decision outcomes.

Metric	Type	Labels	Description
`draftthinker_entropy_distribution`	Histogram	(none)	Distribution of per-token Shannon entropy values in bits
`draftthinker_routing_decisions_total`	Counter	`decision`	Total routing decisions by outcome (`accept` or `escalate`)

Entropy Histogram Buckets

entropy_distribution uses the following bucket boundaries (in bits):

0, 0.25, 0.5, 0.75, 1.0, 1.5, 2.0, 2.5, 3.0

Routing Decision Labels

The decision label on routing_decisions_total takes one of:

accept: drafter response served directly
escalate: request forwarded to heavyweight model

Phase 3 Metrics

Phase 3 calibration produces offline metrics (not Prometheus). These are generated by the sweep tool in benchmarks/cmd/sweep/ and written to CSV.

Metric	Description
Escalation Rate	Fraction of requests routed to heavyweight at a given threshold
Draft Accuracy	Fraction of accepted drafts judged acceptable (TN / (TN + FN))
Cost Reduction	`1 - estimated_cost / baseline_cost`
Precision	TP / (TP + FP)
Recall	TP / (TP + FN)
F1	Harmonic mean of precision and recall

Confusion Matrix

	Would Escalate	Would Accept
Draft Unacceptable	TP (correct escalation)	FN (bad answer served)
Draft Acceptable	FP (unnecessary cost)	TN (correct acceptance)

Threshold Selection

The sweep tool auto-selects the threshold with the highest F1 score among thresholds where draft accuracy >= 95%. Results are written to benchmarks/results/sweep.csv with a human-readable summary printed to stdout.

Calibration Results (518 valid prompts, gpt-4.1-nano drafter, gpt-4.1 heavyweight)

Threshold	Escalation Rate	Draft Accuracy	Cost Reduction	F1
1.00	68.9%	100.0%	8.2%	0.06
1.25	49.2%	99.6%	31.0%	0.08
1.50	30.9%	98.6%	56.2%	0.07
1.75	13.9%	98.4%	81.2%	0.10
2.00	6.0%	98.2%	91.6%	0.10
2.25	0.4%	97.9%	99.0%	0.00
2.50	0.0%	97.9%	99.2%	0.00

Selected threshold: T=2.0 (94% draft acceptance, 98.2% accuracy, 91.6% cost reduction).

Phase 4 Metrics

Phase 4 added speculative execution. Three new Prometheus instruments track when the gateway fires a parallel heavyweight call, whether it gets used or cancelled, and how much latency the head start saves.

Metric	Type	Labels	Description
`draftthinker_speculative_triggers_total`	Counter	(none)	Total speculative heavyweight calls fired (soft threshold exceeded)
`draftthinker_speculative_cancellations_total`	Counter	(none)	Speculative calls cancelled (drafter recovered before hard threshold)
`draftthinker_speculative_latency_saved_seconds`	Histogram	(none)	Head-start time saved on escalated requests that had a running speculative call

Latency Saved Histogram Buckets

speculative_latency_saved_seconds uses the same bucket boundaries as upstream_latency_seconds:

0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30

Interpreting Speculative Metrics

Trigger rate = speculative_triggers_total / requests_total. Healthy range depends on workload; higher means more requests hit the soft threshold.
Cancellation ratio = speculative_cancellations_total / speculative_triggers_total. This is wasted compute. Target < 10% of total escalation cost.
Latency saved = the histogram shows how much head start the heavyweight got before escalation confirmed. Higher values mean more latency eliminated from the user-facing request.

Phase 5 Metrics

Phase 5 added a semantic cache layer. Three new Prometheus instruments track cache effectiveness and lookup performance.

Metric	Type	Labels	Description
`draftthinker_cache_hits_total`	Counter	(none)	Total cache hits (semantically similar prompt found and response returned)
`draftthinker_cache_misses_total`	Counter	(none)	Total cache misses (no similar prompt or expired Redis entry)
`draftthinker_cache_lookup_latency_seconds`	Histogram	(none)	End-to-end cache lookup latency including embedding and vector search

Cache Lookup Latency Histogram Buckets

cache_lookup_latency_seconds uses the following bucket boundaries (in seconds):

0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5

Routing Decision Labels

Phase 5 adds a new decision label value to routing_decisions_total:

cache_hit: response served from semantic cache (skips entire draft pipeline)

Interpreting Cache Metrics

Hit rate = cache_hits_total / (cache_hits_total + cache_misses_total). Higher means more requests skip the draft pipeline entirely.
Lookup latency = dominated by the embedding API call. Target < 50ms for the full lookup (embed + vector search + Redis get).
Cache miss with lazy cleanup = a vector match was found in Qdrant but the Redis TTL had expired. The orphaned Qdrant point is deleted. Logged as "cache: lazy cleanup of orphaned qdrant point".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics

Phase 1 Metrics

Instruments

Latency Histogram Buckets

Exit Criteria

Phase 2 Metrics

Entropy Histogram Buckets

Routing Decision Labels

Phase 3 Metrics

Confusion Matrix

Threshold Selection

Calibration Results (518 valid prompts, gpt-4.1-nano drafter, gpt-4.1 heavyweight)

Phase 4 Metrics

Latency Saved Histogram Buckets

Interpreting Speculative Metrics

Phase 5 Metrics

Cache Lookup Latency Histogram Buckets

Routing Decision Labels

Interpreting Cache Metrics

FilesExpand file tree

METRICS.md

Latest commit

History

METRICS.md

File metadata and controls

Metrics

Phase 1 Metrics

Instruments

Latency Histogram Buckets

Exit Criteria

Phase 2 Metrics

Entropy Histogram Buckets

Routing Decision Labels

Phase 3 Metrics

Confusion Matrix

Threshold Selection

Calibration Results (518 valid prompts, gpt-4.1-nano drafter, gpt-4.1 heavyweight)

Phase 4 Metrics

Latency Saved Histogram Buckets

Interpreting Speculative Metrics

Phase 5 Metrics

Cache Lookup Latency Histogram Buckets

Routing Decision Labels

Interpreting Cache Metrics