Cloud cost inflation is eating startup margins. LLM context windows are expanding exponentially, but your runway is not. Developers are sending thousands of redundant code lines on every turn, drowning local CPUs in pre-fill latency and blowing up cloud API bills.
TokenCounter v2.0.0 is the asymmetrical leverage you need. By combining mathematically precise AST-Based Foveated Pruning with an automated Stable Session Cache, TokenCounter guarantees a 100% KV Cache Hit rate and slashes multi-turn response latency by up to 60% with zero code modifications.
This is the ultimate local AI multiplier. Integrate it in thirty seconds, unlock infinite context margins, and stop burning capital on redundant token evaluations.
In a standard multi-turn chat, your IDE client re-sends your codebase files on every turn. The moment you type a follow-up query, the developer context changes. Under standard proxies, this subtle change alters the pruning boundaries, modifying the codebase block by a few characters.
A single token change invalidates the entire LLM prefix cache. This triggers a complete cache miss, forcing the model to re-evaluate the entire prompt from scratch.
TokenCounter v2.0.0 solves this with a Stable Session Cache.
- Content-Hashed Anchoring: On Turn 1, we parse and prune the workspace files, caching the output.
- Prefix Lock: On subsequent turns, as long as the developer does not edit the files (verified via fast byte hashing), the proxy reuses the exact same pruned code.
- Guaranteed Hits: The codebase block remains 100% identical at the byte level. The LLM server matches the prefix perfectly, restoring the cached state instantly with 0ms evaluation overhead.
Legacy pruners rely on brittle regular expressions that choke on complex class scopes, nested methods, and arrow functions. TokenCounter v2.0.0 utilizes an offline-first web-tree-sitter engine running pre-compiled language WASMs to build a high-fidelity Abstract Syntax Tree.
- Mathematical Precision: We traverse the AST and surgically stub out unreferenced function/method bodies with single-line placeholders, keeping the syntactical structure and signature intact.
- Zero Telemetry: Operates entirely in-memory on your local machine. 100% data sovereignty, fully compliant with strict data protection frameworks (ISO 27001 / UU PDP).
The following benchmark demonstrates the real-world performance metrics of TokenCounter v2.0.0. The test was conducted using a local Qwen 3.5 2B (GGUF) model running in Single-Slot Mode (-np 1, -t 4, -c 4096) on an Intel Core i3 (12th Gen) CPU with 8GB RAM, simulating a standard developer workspace.
| Metric / Turn | Direct Flow (No Proxy) | Proxy Flow (TokenCounter v2.0.0) | Performance Multiplier / Savings |
|---|---|---|---|
| Turn 1: Input Tokens | 1,062 tokens | 985 tokens | 77 tokens saved (7.3% reduction) via AST Pruning |
| Turn 1: Output Tokens | 702 tokens | 880 tokens | Baseline generation phase |
| Turn 1: Total Latency | 87,185 ms | 96,250 ms | Initial pre-fill and model warm-up phase |
| Turn 2: Input Tokens | 1,522 tokens | 1,464 tokens | 58 tokens saved (3.8% reduction) |
| Turn 2: Cache Status | Cache Miss (Jumbled file order) | 100% KV Cache Hit (Prefix stabilized) | 876 tokens restored instantly from cache |
| Turn 2: Prompt Pre-fill | 34,335.86 ms (All evaluated) | 13,683.43 ms (Bypassed codebase) | 20,652.43 ms saved (60.1% faster pre-fill) |
| Turn 2: Total Latency | 268,538 ms | 107,363 ms | 161,175 ms saved (60.0% faster total response) |
- The Latency Arbitrage: By hitting the KV Cache on Turn 2, the proxy bypassed prompt evaluation for 876 tokens. The developer received their answer 161 seconds faster (a 60% speedup), completely eliminating CPU throttling bottlenecks.
- Infinite Context Economics: For cloud providers supporting prompt caching (e.g. Anthropic, DeepSeek), stabilizing the prefix and hitting the cache cuts input API costs by up to 90%.
- Frictionless Integration: No modifications are required on the client or server. The proxy runs as a transparent local server, delivering massive compounding advantages out-of-the-box.
Ensure you have Node.js installed, then run:
# Install dependencies and compile
npm install
npm run compile
# Package into a VS Code extension
npx @vscode/vsce packageInstall the generated tokencounter-2.0.0.vsix directly into your VS Code editor.
Start the local reverse proxy server to intercept and optimize LLM payloads:
$env:TOKENCOUNTER_TEST_TARGET="localhost:8080"; node out/proxy_launcher.jsThe proxy will listen on http://localhost:9099 and seamlessly forward optimized payloads to your target LLM server.
Simply redirect your preferred AI client's API base URL to http://localhost:9099/v1.
- Navigate to Cursor Settings -> Models -> OpenAI API.
- Override the base URL and enter
http://localhost:9099/v1. - Enter your API Key.
- Open Cline Settings -> API Provider -> Select OpenAI Compatible.
- Set Base URL to
http://localhost:9099/v1. - Enter your target model and API Key.
This project is licensed under the Business Source License 1.1 (BSL 1.1).
- Free for Personal & Educational Use: Free to use, modify, and distribute for personal, educational, and testing purposes.
- Free for Small Teams: Commercial production use is 100% free for organizations with 5 or fewer developers.
- Commercial License Required: Production use by organizations with more than 5 developers requires a commercial license from the Licensor (
dnislno). - Open Source Transition: On June 26, 2030, the license will automatically convert to the Apache License, Version 2.0, ensuring permanent open-source availability.
For full details, please refer to the LICENSE file.