π Read the full launch post and watch the 2D tensor collision demo here: https://maniksundar.substack.com/p/the-physics-illusion-why-llms-still
LLM-generated unit tests for KV-cache routing kernels suffer from a silent failure mode: the LLM hallucinates the same bug in both the implementation and the test, causing the test to pass while the kernel remains incorrect. This happens because LLMs reason from the same flawed mental model when writing both code and tests. ImpactArbiter addresses this by using a two-stage RAG pipeline: first, a Distill Agent extracts and summarizes the routing logic from the actual research paper; second, a Coding Agent writes the implementation and test based on that summary. The generated code is then run through a PyTorch autograd trap that compares gradient signatures against SymPy oracles. The trap catches bugs that unit tests miss, even when the LLM's own test_route() assertions pass.
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -e .OpenAI:
export OPENAI_API_KEY="your-openai-api-key"
# On Windows (PowerShell): $env:OPENAI_API_KEY="your-openai-api-key"Claude (Anthropic):
export ANTHROPIC_API_KEY="your-anthropic-api-key"
# On Windows (PowerShell): $env:ANTHROPIC_API_KEY="your-anthropic-api-key"Gemini / Vertex AI:
# Ensure gcloud is authenticated and project is set
gcloud auth login
gcloud config set project impactagent
export GOOGLE_CLOUD_PROJECT="impactagent"
# On Windows (PowerShell): $env:GOOGLE_CLOUD_PROJECT="impactagent"For persistent configuration, add these to your .env file:
OPENAI_API_KEY=your-openai-api-key
ANTHROPIC_API_KEY=your-anthropic-api-key
GOOGLE_CLOUD_PROJECT=impactagent
impactarbiter auto-heal --oracle radix --model gemini--full-agent-trace: Display LLM Chain-of-Thought reasoning before code generation and heal attempts--live: Use live LLM API calls instead of cached deterministic replay (requires API key)--mock: Run offline evaluation with deterministic replay (default if no API key)
Example with live LLM generation and full trace:
impactarbiter verify --workflow agentic-kv-scheduler --full-agent-trace --liveβββββββββββββββββββ IMPACT ARBITER β AUTO-HEAL βββββββββββββββββββ
Model: vertex_ai/gemini-2.5-pro
[PAPER DOWNLOADED]
https://arxiv.org/pdf/2312.07104.pdf
[QUICK DISTILL]
### KV Cache Routing Specification: Planner-Executor Handoff
...
[GENERATED CODE & TESTS]
def route_radix_2d(b_local_idx, head_idx, prefix_length_h, total_blocks_h, block_size):
k = prefix_length_h + b_local_idx
logical_block = k // block_size
offset = k % block_size
return (head_idx, logical_block, offset)
[LLM UNIT TEST PASS β
]
LLM self-validation passed.
[AUTOGRAD TRAP FAIL β HARD_BLOCK]
divergence=1.00e+00 > tol=1e-04
GRADIENT DIVERGENCE MAP β KV_cache.grad (head Γ block Γ offset)
Token (b=5,h=0,prefix_h=60,N_h=4) | Expected: head=0 block=0 offset=1 | Got: head=0 block=4 offset=1
Non-zero gradient at: [0, 4, 1, :] β misrouted 128 floats
[AUTO-HEAL attempt 1/3]
def route_radix_2d(b_local_idx, head_idx, prefix_length_h, total_blocks_h, block_size):
absolute_idx = prefix_length_h + b_local_idx
logical_block = (absolute_idx // block_size) % total_blocks_h
offset = absolute_idx % block_size
return (head_idx, logical_block, offset)
[FINAL PASS β
]
divergence=0.00e+00 (after 1 heal attempts)
The autograd trap itself is fully deterministic, which means identical code always produces identical gradient divergence results.
What varies is whether the LLM generates correct or incorrect routing logic on a given run.
This mirrors real production reality: agent-generated serving code sometimes passes, but often silently fails on ragged boundaries.
ImpactArbiter gives you deterministic verification of whichever code the agent produces, so you're not relying on hoping the model "got it right this time."
In practice, Gemini 2.5 Pro generates incorrect routing on roughly 65% of attempts for the critical 2D ring-buffer wrap cases. The trap catches every incorrect implementation with zero false negatives.
Recommended for production-relevant demo.
| b_local_idx | head_idx | prefix_h | N_h | expected block | expected offset | note |
|---|---|---|---|---|---|---|
| 0 | 0 | 47 | 8 | 2 | 15 | ragged straddle β partial-block carry-over |
| 5 | 0 | 60 | 4 | 0 | 1 | ring-buffer wrap: abs=65, block 4 wraps to 0 |
| 0 | 3 | 200 | 4 | 0 | 8 | ring-buffer deep wrap (multiple revolutions) |
Boundary fixtures [15, 99, 100, 105, 128] are maintained for historical comparison mode.
| Oracle | Total Runs | Trap Fired | Healed Successfully |
|---|---|---|---|
| radix-2d | 32 | 21 | 21 |
| radix (1D) | 15 | 9 | 9 |
| vllm (Paged) | 15 | 0 | 0 |
src/
βββ oracles/ # SymPy ASTs + lambdified callables
βββ trap/ # autograd trap & ASCII divergence map
βββ fuzzer/ # explicit boundary fixtures
βββ cli/ # auto-heal pipeline + litellm agent + paper extractor
βββ db/ # nextpaper.db (SQLite) validation_traces
tests/
βββ test_paged_oracle.py
βββ test_radix_oracle.py
βββ test_trap.py
pytest tests/ -vThe four load-bearing claims in tests/test_trap.py must all pass.
ImpactArbiter is an open project β contributions of new oracles, fuzz cases, and bug reports are welcome. See CONTRIBUTING.md for the full contributor guide. A short summary:
- Reporting issues: Open a GitHub issue using the relevant template under
.github/ISSUE_TEMPLATE/. Bug reports must include the exact CLI command, the model used, the divergence map (or stack trace), and the contents ofnextpaper.dbrow(s) when relevant. - Contributing a new oracle: Every oracle must ship as a triple β (1) a SymPy AST plus a
lambdifiedcallable insrc/oracles/, (2) a deterministic autograd trap insrc/trap/, and (3) explicit boundary fixtures insrc/fuzzer/. PRs without all three will be sent back. - Discussing methodology: Open a methodology issue β these are reviewed weekly and used to calibrate the mock hallucination rates and per-oracle session windows.
| Template | When to use |
|---|---|
bug_report.md |
The trap, auto-heal, evaluator, or CSV export produces incorrect or crashing behavior. |
oracle_contribution.md |
You want to propose a new attention/routing oracle (e.g. FlashInfer, MLA, sliding-window). |
methodology.md |
You disagree with a hallucination-rate calibration, session-window definition, or trap tolerance. |
A bug is only actionable when we can replay the failure deterministically. Please include:
- Command line β The exact
impactarbiter ...invocation, including all flags. - Environment β Python version, OS, and whether
--liveor--mockwas used. - Model identity (if
--live) β e.g.vertex_ai/gemini-2.5-pro. Do not paste API keys. - Failure surface β Paste either the gradient divergence map, the auto-heal stack trace, or the
results.csvsummary block. - Expected vs actual β What the oracle predicts vs what the agent / trap returned.
If the bug is in the trap itself (false PASS or false FAIL), attach the offending agent function and the corresponding boundary case from src/fuzzer/. Trap correctness bugs are treated as P0.
MIT.