Skip to content

Latest commit

 

History

History
147 lines (98 loc) · 5.95 KB

File metadata and controls

147 lines (98 loc) · 5.95 KB

Metrics

Prometheus metrics exposed by the gateway at the configured metrics path (default /metrics).

Phase 1 Metrics

Phase 1 established the foundational proxy layer. Three Prometheus instruments were introduced to measure baseline gateway behavior before any routing logic existed.

Instruments

Metric Type Labels Description
draftthinker_requests_total Counter model, status Total requests by model name and HTTP status code
draftthinker_upstream_latency_seconds Histogram provider Upstream LLM provider latency in seconds
draftthinker_errors_total Counter type Total errors by error type

Latency Histogram Buckets

upstream_latency_seconds uses the following bucket boundaries (in seconds):

0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30

Exit Criteria

Target Measurement
Proxy overhead < 5ms P99 Gateway processing time excluding upstream model inference

Phase 1 did not include entropy or routing metrics. Those were added in Phase 2.

Phase 2 Metrics

Phase 2 added entropy-based routing. Two new instruments track the entropy analysis and routing decision outcomes.

Metric Type Labels Description
draftthinker_entropy_distribution Histogram (none) Distribution of per-token Shannon entropy values in bits
draftthinker_routing_decisions_total Counter decision Total routing decisions by outcome (accept or escalate)

Entropy Histogram Buckets

entropy_distribution uses the following bucket boundaries (in bits):

0, 0.25, 0.5, 0.75, 1.0, 1.5, 2.0, 2.5, 3.0

Routing Decision Labels

The decision label on routing_decisions_total takes one of:

  • accept: drafter response served directly
  • escalate: request forwarded to heavyweight model

Phase 3 Metrics

Phase 3 calibration produces offline metrics (not Prometheus). These are generated by the sweep tool in benchmarks/cmd/sweep/ and written to CSV.

Metric Description
Escalation Rate Fraction of requests routed to heavyweight at a given threshold
Draft Accuracy Fraction of accepted drafts judged acceptable (TN / (TN + FN))
Cost Reduction 1 - estimated_cost / baseline_cost
Precision TP / (TP + FP)
Recall TP / (TP + FN)
F1 Harmonic mean of precision and recall

Confusion Matrix

Would Escalate Would Accept
Draft Unacceptable TP (correct escalation) FN (bad answer served)
Draft Acceptable FP (unnecessary cost) TN (correct acceptance)

Threshold Selection

The sweep tool auto-selects the threshold with the highest F1 score among thresholds where draft accuracy >= 95%. Results are written to benchmarks/results/sweep.csv with a human-readable summary printed to stdout.

Calibration Results (518 valid prompts, gpt-4.1-nano drafter, gpt-4.1 heavyweight)

Threshold Escalation Rate Draft Accuracy Cost Reduction F1
1.00 68.9% 100.0% 8.2% 0.06
1.25 49.2% 99.6% 31.0% 0.08
1.50 30.9% 98.6% 56.2% 0.07
1.75 13.9% 98.4% 81.2% 0.10
2.00 6.0% 98.2% 91.6% 0.10
2.25 0.4% 97.9% 99.0% 0.00
2.50 0.0% 97.9% 99.2% 0.00

Selected threshold: T=2.0 (94% draft acceptance, 98.2% accuracy, 91.6% cost reduction).

Phase 4 Metrics

Phase 4 added speculative execution. Three new Prometheus instruments track when the gateway fires a parallel heavyweight call, whether it gets used or cancelled, and how much latency the head start saves.

Metric Type Labels Description
draftthinker_speculative_triggers_total Counter (none) Total speculative heavyweight calls fired (soft threshold exceeded)
draftthinker_speculative_cancellations_total Counter (none) Speculative calls cancelled (drafter recovered before hard threshold)
draftthinker_speculative_latency_saved_seconds Histogram (none) Head-start time saved on escalated requests that had a running speculative call

Latency Saved Histogram Buckets

speculative_latency_saved_seconds uses the same bucket boundaries as upstream_latency_seconds:

0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30

Interpreting Speculative Metrics

  • Trigger rate = speculative_triggers_total / requests_total. Healthy range depends on workload; higher means more requests hit the soft threshold.
  • Cancellation ratio = speculative_cancellations_total / speculative_triggers_total. This is wasted compute. Target < 10% of total escalation cost.
  • Latency saved = the histogram shows how much head start the heavyweight got before escalation confirmed. Higher values mean more latency eliminated from the user-facing request.

Phase 5 Metrics

Phase 5 added a semantic cache layer. Three new Prometheus instruments track cache effectiveness and lookup performance.

Metric Type Labels Description
draftthinker_cache_hits_total Counter (none) Total cache hits (semantically similar prompt found and response returned)
draftthinker_cache_misses_total Counter (none) Total cache misses (no similar prompt or expired Redis entry)
draftthinker_cache_lookup_latency_seconds Histogram (none) End-to-end cache lookup latency including embedding and vector search

Cache Lookup Latency Histogram Buckets

cache_lookup_latency_seconds uses the following bucket boundaries (in seconds):

0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5

Routing Decision Labels

Phase 5 adds a new decision label value to routing_decisions_total:

  • cache_hit: response served from semantic cache (skips entire draft pipeline)

Interpreting Cache Metrics

  • Hit rate = cache_hits_total / (cache_hits_total + cache_misses_total). Higher means more requests skip the draft pipeline entirely.
  • Lookup latency = dominated by the embedding API call. Target < 50ms for the full lookup (embed + vector search + Redis get).
  • Cache miss with lazy cleanup = a vector match was found in Qdrant but the Redis TTL had expired. The orphaned Qdrant point is deleted. Logged as "cache: lazy cleanup of orphaned qdrant point".