Summary
The token savings estimation system (TokenSavingsEstimator, TokenizerApproximation, ModelFamily) currently uses heuristic approximations — chars-per-token ratios and baseline scale factors — to estimate token counts. These values are educated guesses based on published tokenizer research, but they have not been validated against real-world MCP interactions.
We should run structured A/B comparisons to capture actual token counts and validate (or tune) these heuristics.
What needs validation
| Heuristic |
Current Value (Unknown/default) |
Notes |
| Prose chars-per-token |
4.0 |
Based on GPT-4 BPE averages |
| JSON chars-per-token |
3.2 |
Structured text tokenizes denser |
| Code chars-per-token |
3.5 |
Symbols + identifiers |
| Baseline scale factor |
1.0 (varies by model family) |
Multiplier on baseline overhead |
| Content-kind detection |
density threshold 4% |
{};() punctuation ratio for Code vs Prose |
Model-family-specific ratios also need validation (e.g., Claude Haiku at 3.8 prose, GPT-4o at 3.9, etc.).
Proposed approach
Phase 1: Instrumentation
Phase 2: Data collection
Phase 3: Analysis & calibration
Phase 4: Ongoing validation
Success criteria
- MAPE < 15% across all model families for MCP token estimates
- MAPE < 25% for baseline estimates (inherently noisier due to prompt engineering variance)
ContentKind detection accuracy > 90%
- At least 3 model families validated (OpenAI GPT-4o, Claude Sonnet, Gemini Pro)
Related code
DotNetMcp/Telemetry/TokenSavingsEstimator.cs — core estimation logic
DotNetMcp/Telemetry/TokenSavingsModels.cs — ModelFamily, ContentKind, TokenizerApproximation
DotNetMcp.Tests/Telemetry/TokenSavingsEstimatorTests.cs — current test coverage
Summary
The token savings estimation system (TokenSavingsEstimator, TokenizerApproximation, ModelFamily) currently uses heuristic approximations — chars-per-token ratios and baseline scale factors — to estimate token counts. These values are educated guesses based on published tokenizer research, but they have not been validated against real-world MCP interactions.
We should run structured A/B comparisons to capture actual token counts and validate (or tune) these heuristics.
What needs validation
{};()punctuation ratio for Code vs ProseModel-family-specific ratios also need validation (e.g., Claude Haiku at 3.8 prose, GPT-4o at 3.9, etc.).
Proposed approach
Phase 1: Instrumentation
(heuristic_estimate, actual_tokens, model_id, content_kind, content_length)Phase 2: Data collection
Phase 3: Analysis & calibration
ContentKinddetection accuracy (confusion matrix: Prose/Json/Code)BaselineScaleFactorvalues track reality or need restructuringv2(keepingv1as fallback)Phase 4: Ongoing validation
--calibratemode that auto-tunes from collected dataSuccess criteria
ContentKinddetection accuracy > 90%Related code
DotNetMcp/Telemetry/TokenSavingsEstimator.cs— core estimation logicDotNetMcp/Telemetry/TokenSavingsModels.cs—ModelFamily,ContentKind,TokenizerApproximationDotNetMcp.Tests/Telemetry/TokenSavingsEstimatorTests.cs— current test coverage