A/B testing: validate token savings heuristics against real LLM token counts

## Summary

The token savings estimation system (TokenSavingsEstimator, TokenizerApproximation, ModelFamily) currently uses **heuristic approximations** — chars-per-token ratios and baseline scale factors — to estimate token counts. These values are educated guesses based on published tokenizer research, but they have not been validated against real-world MCP interactions.

We should run structured A/B comparisons to capture actual token counts and validate (or tune) these heuristics.

## What needs validation

| Heuristic | Current Value (Unknown/default) | Notes |
|---|---|---|
| Prose chars-per-token | 4.0 | Based on GPT-4 BPE averages |
| JSON chars-per-token | 3.2 | Structured text tokenizes denser |
| Code chars-per-token | 3.5 | Symbols + identifiers |
| Baseline scale factor | 1.0 (varies by model family) | Multiplier on baseline overhead |
| Content-kind detection | density threshold 4% | `{};()` punctuation ratio for Code vs Prose |

Model-family-specific ratios also need validation (e.g., Claude Haiku at 3.8 prose, GPT-4o at 3.9, etc.).

## Proposed approach

### Phase 1: Instrumentation
- [ ] Add opt-in telemetry that captures **actual token counts** from LLM API responses alongside our heuristic estimates
- [ ] Log paired data: `(heuristic_estimate, actual_tokens, model_id, content_kind, content_length)`
- [ ] Store in a local JSONL file (privacy-first — no external telemetry without consent)

### Phase 2: Data collection
- [ ] Run a representative set of MCP workflows (project creation, package management, build+test, template discovery) across multiple model families
- [ ] Capture at least ~100 paired observations per model family for statistical significance
- [ ] Include diverse content types: short prompts, long tool responses, JSON payloads, code output

### Phase 3: Analysis & calibration
- [ ] Compute per-model-family mean absolute error (MAE) and mean absolute percentage error (MAPE)
- [ ] Fit updated chars-per-token ratios via least-squares regression on actual data
- [ ] Validate the `ContentKind` detection accuracy (confusion matrix: Prose/Json/Code)
- [ ] Determine if `BaselineScaleFactor` values track reality or need restructuring
- [ ] Publish calibrated profile as `v2` (keeping `v1` as fallback)

### Phase 4: Ongoing validation
- [ ] Add a CI smoke test that compares heuristic estimates to a frozen set of known-good pairs
- [ ] Consider a `--calibrate` mode that auto-tunes from collected data

## Success criteria

- MAPE < 15% across all model families for MCP token estimates
- MAPE < 25% for baseline estimates (inherently noisier due to prompt engineering variance)
- `ContentKind` detection accuracy > 90%
- At least 3 model families validated (OpenAI GPT-4o, Claude Sonnet, Gemini Pro)

## Related code

- `DotNetMcp/Telemetry/TokenSavingsEstimator.cs` — core estimation logic
- `DotNetMcp/Telemetry/TokenSavingsModels.cs` — `ModelFamily`, `ContentKind`, `TokenizerApproximation`
- `DotNetMcp.Tests/Telemetry/TokenSavingsEstimatorTests.cs` — current test coverage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A/B testing: validate token savings heuristics against real LLM token counts #461

Summary

What needs validation

Proposed approach

Phase 1: Instrumentation

Phase 2: Data collection

Phase 3: Analysis & calibration

Phase 4: Ongoing validation

Success criteria

Related code

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Heuristic	Current Value (Unknown/default)	Notes
Prose chars-per-token	4.0	Based on GPT-4 BPE averages
JSON chars-per-token	3.2	Structured text tokenizes denser
Code chars-per-token	3.5	Symbols + identifiers
Baseline scale factor	1.0 (varies by model family)	Multiplier on baseline overhead
Content-kind detection	density threshold 4%	`{};()` punctuation ratio for Code vs Prose

A/B testing: validate token savings heuristics against real LLM token counts #461

Description

Summary

What needs validation

Proposed approach

Phase 1: Instrumentation

Phase 2: Data collection

Phase 3: Analysis & calibration

Phase 4: Ongoing validation

Success criteria

Related code

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions