Prometheus metrics exposed by the gateway at the configured metrics path (default /metrics).
Phase 1 established the foundational proxy layer. Three Prometheus instruments were introduced to measure baseline gateway behavior before any routing logic existed.
| Metric | Type | Labels | Description |
|---|---|---|---|
draftthinker_requests_total |
Counter | model, status |
Total requests by model name and HTTP status code |
draftthinker_upstream_latency_seconds |
Histogram | provider |
Upstream LLM provider latency in seconds |
draftthinker_errors_total |
Counter | type |
Total errors by error type |
upstream_latency_seconds uses the following bucket boundaries (in seconds):
0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30
| Target | Measurement |
|---|---|
| Proxy overhead < 5ms P99 | Gateway processing time excluding upstream model inference |
Phase 1 did not include entropy or routing metrics. Those were added in Phase 2.
Phase 2 added entropy-based routing. Two new instruments track the entropy analysis and routing decision outcomes.
| Metric | Type | Labels | Description |
|---|---|---|---|
draftthinker_entropy_distribution |
Histogram | (none) | Distribution of per-token Shannon entropy values in bits |
draftthinker_routing_decisions_total |
Counter | decision |
Total routing decisions by outcome (accept or escalate) |
entropy_distribution uses the following bucket boundaries (in bits):
0, 0.25, 0.5, 0.75, 1.0, 1.5, 2.0, 2.5, 3.0
The decision label on routing_decisions_total takes one of:
accept: drafter response served directlyescalate: request forwarded to heavyweight model
Phase 3 calibration produces offline metrics (not Prometheus). These are generated by the sweep tool in benchmarks/cmd/sweep/ and written to CSV.
| Metric | Description |
|---|---|
| Escalation Rate | Fraction of requests routed to heavyweight at a given threshold |
| Draft Accuracy | Fraction of accepted drafts judged acceptable (TN / (TN + FN)) |
| Cost Reduction | 1 - estimated_cost / baseline_cost |
| Precision | TP / (TP + FP) |
| Recall | TP / (TP + FN) |
| F1 | Harmonic mean of precision and recall |
| Would Escalate | Would Accept | |
|---|---|---|
| Draft Unacceptable | TP (correct escalation) | FN (bad answer served) |
| Draft Acceptable | FP (unnecessary cost) | TN (correct acceptance) |
The sweep tool auto-selects the threshold with the highest F1 score among thresholds where draft accuracy >= 95%. Results are written to benchmarks/results/sweep.csv with a human-readable summary printed to stdout.
| Threshold | Escalation Rate | Draft Accuracy | Cost Reduction | F1 |
|---|---|---|---|---|
| 1.00 | 68.9% | 100.0% | 8.2% | 0.06 |
| 1.25 | 49.2% | 99.6% | 31.0% | 0.08 |
| 1.50 | 30.9% | 98.6% | 56.2% | 0.07 |
| 1.75 | 13.9% | 98.4% | 81.2% | 0.10 |
| 2.00 | 6.0% | 98.2% | 91.6% | 0.10 |
| 2.25 | 0.4% | 97.9% | 99.0% | 0.00 |
| 2.50 | 0.0% | 97.9% | 99.2% | 0.00 |
Selected threshold: T=2.0 (94% draft acceptance, 98.2% accuracy, 91.6% cost reduction).
Phase 4 added speculative execution. Three new Prometheus instruments track when the gateway fires a parallel heavyweight call, whether it gets used or cancelled, and how much latency the head start saves.
| Metric | Type | Labels | Description |
|---|---|---|---|
draftthinker_speculative_triggers_total |
Counter | (none) | Total speculative heavyweight calls fired (soft threshold exceeded) |
draftthinker_speculative_cancellations_total |
Counter | (none) | Speculative calls cancelled (drafter recovered before hard threshold) |
draftthinker_speculative_latency_saved_seconds |
Histogram | (none) | Head-start time saved on escalated requests that had a running speculative call |
speculative_latency_saved_seconds uses the same bucket boundaries as upstream_latency_seconds:
0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30
- Trigger rate =
speculative_triggers_total/requests_total. Healthy range depends on workload; higher means more requests hit the soft threshold. - Cancellation ratio =
speculative_cancellations_total/speculative_triggers_total. This is wasted compute. Target < 10% of total escalation cost. - Latency saved = the histogram shows how much head start the heavyweight got before escalation confirmed. Higher values mean more latency eliminated from the user-facing request.
Phase 5 added a semantic cache layer. Three new Prometheus instruments track cache effectiveness and lookup performance.
| Metric | Type | Labels | Description |
|---|---|---|---|
draftthinker_cache_hits_total |
Counter | (none) | Total cache hits (semantically similar prompt found and response returned) |
draftthinker_cache_misses_total |
Counter | (none) | Total cache misses (no similar prompt or expired Redis entry) |
draftthinker_cache_lookup_latency_seconds |
Histogram | (none) | End-to-end cache lookup latency including embedding and vector search |
cache_lookup_latency_seconds uses the following bucket boundaries (in seconds):
0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5
Phase 5 adds a new decision label value to routing_decisions_total:
cache_hit: response served from semantic cache (skips entire draft pipeline)
- Hit rate =
cache_hits_total/ (cache_hits_total+cache_misses_total). Higher means more requests skip the draft pipeline entirely. - Lookup latency = dominated by the embedding API call. Target < 50ms for the full lookup (embed + vector search + Redis get).
- Cache miss with lazy cleanup = a vector match was found in Qdrant but the Redis TTL had expired. The orphaned Qdrant point is deleted. Logged as "cache: lazy cleanup of orphaned qdrant point".