Skip to content

A/B testing: validate token savings heuristics against real LLM token counts #461

Description

@jongalloway

Summary

The token savings estimation system (TokenSavingsEstimator, TokenizerApproximation, ModelFamily) currently uses heuristic approximations — chars-per-token ratios and baseline scale factors — to estimate token counts. These values are educated guesses based on published tokenizer research, but they have not been validated against real-world MCP interactions.

We should run structured A/B comparisons to capture actual token counts and validate (or tune) these heuristics.

What needs validation

Heuristic Current Value (Unknown/default) Notes
Prose chars-per-token 4.0 Based on GPT-4 BPE averages
JSON chars-per-token 3.2 Structured text tokenizes denser
Code chars-per-token 3.5 Symbols + identifiers
Baseline scale factor 1.0 (varies by model family) Multiplier on baseline overhead
Content-kind detection density threshold 4% {};() punctuation ratio for Code vs Prose

Model-family-specific ratios also need validation (e.g., Claude Haiku at 3.8 prose, GPT-4o at 3.9, etc.).

Proposed approach

Phase 1: Instrumentation

  • Add opt-in telemetry that captures actual token counts from LLM API responses alongside our heuristic estimates
  • Log paired data: (heuristic_estimate, actual_tokens, model_id, content_kind, content_length)
  • Store in a local JSONL file (privacy-first — no external telemetry without consent)

Phase 2: Data collection

  • Run a representative set of MCP workflows (project creation, package management, build+test, template discovery) across multiple model families
  • Capture at least ~100 paired observations per model family for statistical significance
  • Include diverse content types: short prompts, long tool responses, JSON payloads, code output

Phase 3: Analysis & calibration

  • Compute per-model-family mean absolute error (MAE) and mean absolute percentage error (MAPE)
  • Fit updated chars-per-token ratios via least-squares regression on actual data
  • Validate the ContentKind detection accuracy (confusion matrix: Prose/Json/Code)
  • Determine if BaselineScaleFactor values track reality or need restructuring
  • Publish calibrated profile as v2 (keeping v1 as fallback)

Phase 4: Ongoing validation

  • Add a CI smoke test that compares heuristic estimates to a frozen set of known-good pairs
  • Consider a --calibrate mode that auto-tunes from collected data

Success criteria

  • MAPE < 15% across all model families for MCP token estimates
  • MAPE < 25% for baseline estimates (inherently noisier due to prompt engineering variance)
  • ContentKind detection accuracy > 90%
  • At least 3 model families validated (OpenAI GPT-4o, Claude Sonnet, Gemini Pro)

Related code

  • DotNetMcp/Telemetry/TokenSavingsEstimator.cs — core estimation logic
  • DotNetMcp/Telemetry/TokenSavingsModels.csModelFamily, ContentKind, TokenizerApproximation
  • DotNetMcp.Tests/Telemetry/TokenSavingsEstimatorTests.cs — current test coverage

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions