Evaluation toolkit and reproduction bundle for the paper:
Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity. Bojie Li, Pine AI.
IKP is a 1,400-question factual benchmark — 200 items × 7 obscurity tiers (T1: universal knowledge … T7: extreme long-tail). Accuracy on IKP scales log-linearly with parameter count across 89 open-weight models from 135M to 1.6T (R² = 0.917), so a single black-box API call budget is enough to estimate the effective knowledge capacity of any deployed model — including closed-source frontier models whose sizes are undisclosed.
- Paper PDF:
paper/main.pdf - Companion website (interactive): https://01.me/research/ikp
- Source: https://github.com/19PINE-AI/ikp
# 1. Install deps (Python ≥ 3.10)
pip install -r requirements.txt
# 2. Point at any OpenAI-compatible endpoint and run
export OPENROUTER_API_KEY=sk-or-...
python scripts/ikp_estimate.py --model openai/gpt-4.1Output:
╔══════════════════════════════════════════════════════════╗
║ IKP Estimation Results ║
║ Model: openai/gpt-4.1 ║
║ Probes: 1400 ║
║ Accuracy: 58.2% (penalized) 63.9% (raw) ║
║ Estimated: 400B parameters ║
╚══════════════════════════════════════════════════════════╝
T1 99% … T7 4%
Effective tier: T6
Estimated size: 400B (calibrated on 89 open models, R²=0.917)
Faster stratified sample (200 probes, ~1 min):
python scripts/ikp_estimate.py --model openai/gpt-4.1 --sample 200Non-OpenRouter endpoint (vLLM, OpenAI, Together, local):
python scripts/ikp_estimate.py \
--api-base http://localhost:8000/v1 \
--api-key <your-key> \
--model my-local-model
# Judge always runs on OpenRouter (google/gemini-3-flash-preview);
# OPENROUTER_API_KEY must still be set for the judge.Full CLI reference, including how to plug in a different judge or export
per-probe verdicts: see TOOLKIT.md.
A second, lighter CLI (python -m cli) lets readers poke at the
benchmark without running the full estimator. It has two modes.
Research mode — query the six tier landmarks plus three frontier models (GPT-5.5, DeepSeek V4 Pro, Claude Opus 4.7) with a researcher name or any free-form factual question:
export OPENROUTER_API_KEY=sk-or-...
# Look up a researcher (substring match against the probe set)
python -m cli research --researcher "Stjepan Picek"
# Ask any factual question
python -m cli research --question "Who founded the field of cache-oblivious algorithms?"Evaluation mode — re-run any probe against the preset models plus
any models you specify, scored with the paper's exact judge prompt
(google/gemini-3-flash-preview, CORRECT / WRONG / REFUSAL):
# Score a single tier-7 probe against the preset 9 models
python -m cli eval IKP_T7_1234
# Add your own models; --model is repeatable
python -m cli eval IKP_T5_0123 \
--model openai/gpt-4o \
--model id=qwen/qwen3-32b,name=q3-32b,thinking=trueT1 uses a local Ollama landmark (qwen2.5:0.5b); install Ollama or
ignore that row. The other eight models all run via OpenRouter.
Every figure and table in the paper, with the exact script, inputs and
expected outputs, is listed in REPRODUCTION.md.
Short path:
# Fastest: regenerate all figures from already-scored results
python paper/figures/generate_figures.py
python paper/figures/generate_appendix_figures.py
# Rebuild PDF (TeX Live)
cd paper && latexmk -pdf main.texTo score additional models and extend the dataset:
python scripts/run_all_models.py --skip-existing
python scripts/run_evaluation.py --rebuild-summary # refreshes evaluation_summary.jsonThe Makefile is the single entry point.
make help # list every target
# Paper
make figs # regenerate every figure under paper/figures/
make pdf # one pdflatex pass (fast, no bibtex)
make full # full rebuild with bibtex (4 passes)
# Calibration / data refresh after a new model lands in data/results/
make calibration # rerun loo_cv_analysis.py + analyze_results.py
make website # rebuild website/public/data/*.json (must precede website-build)
make data # = calibration + website
# Website
make website-dev # vite dev server → http://localhost:5173
make website-build # static build → website/dist/
make website-preview # preview the production build
make website-deploy # rsync website/dist/ to DEPLOY_HOST:DEPLOY_PATH
# override per invocation:
# make website-deploy DEPLOY_HOST=user@host \
# DEPLOY_PATH=/var/www/research/ikp/
make all # data → figs → pdfFor subpath deploys (e.g. https://example.com/research/ikp/), set
BASE_URL=/research/ikp/ make website-build. See website/README.md for full
website documentation, nginx config, and GitHub Pages instructions.
ikp-paper/
├── README.md ← this file
├── TOOLKIT.md ← ikp_estimate.py reference
├── REPRODUCTION.md ← figure/table ⇄ script map
├── requirements.txt
│
├── paper/ ← LaTeX sources
│ ├── main.tex main.pdf appendix.tex references.bib
│ ├── research-plan.md ← original planning document
│ └── figures/ ← PDF/PNG figures + generators (all main & appendix figs)
│ ├── generate_figures.py (main-text figs 1–6, 8)
│ └── generate_appendix_figures.py (appendix figs A1–A4)
│
├── configs/
│ ├── experiment.json ← tier definitions, API settings, seeds
│ ├── models.json ← calibration-set models (open, known size)
│ └── all_models.json ← full roster (188 models evaluated)
│
├── data/ ← see data/README.md for schemas
│ ├── probes/
│ │ ├── final_probe_set_v8.json ← THE 1,400 probes (the benchmark)
│ │ ├── researcher_probes.json ← researcher sub-probe source
│ │ └── archive/ ← earlier probe versions (v1..v7, batches, candidates)
│ ├── results/<model>.json ← per-model raw evaluations (188 files)
│ ├── results/evaluation_summary.json ← aggregated, consumed by every figure
│ ├── calibration/calibration_fit.json ← fitted log-linear calibration
│ ├── researcher_citations.json ← T4–T7 researcher metadata
│ ├── researcher_recognition_rates.json
│ ├── densing_analysis_data.csv ← Densing-Law table (for Fig 8)
│ ├── notes/ ← exploratory analysis markdown
│ └── archive/ ← superseded runs (results_v7, …)
│
├── results/
│ ├── figures/archive/ ← early-draft plots (superseded by paper/figures/)
│ └── tables/ ← .tex tables \input'ed by the paper
│
├── scripts/ ← see scripts/README.md for a full index
│ ├── ikp_estimate.py ← one-model estimator (public entrypoint)
│ ├── run_all_models.py ← bulk evaluation across the full roster
│ ├── run_evaluation.py ← single-model evaluator
│ ├── 01_..15_*.py ← numbered dataset pipeline
│ ├── analyze_results.py, loo_cv_analysis.py, show_progress.py
│ └── legacy/ ← one-off / superseded dev scripts (kept for audit)
│
├── pipeline/ ← probe generation + calibration library
├── src/ ← evaluation runtime (api_client, probe_runner, scorer, …)
├── cli/ ← interactive reader CLI (research + eval modes)
└── website/ ← React companion site
All active scripts resolve paths via Path(__file__).parent.parent, so
they expect to live in scripts/. Scripts under scripts/legacy/
have been patched to three-.. (.parent.parent.parent) and still
work when invoked directly.
Each probe is a short factual question with a gold answer, scored by a
Gemini 3 Flash Preview judge. Researcher subfield probes use a 4-way
evidence-aware judge (CORRECT_STRONG = subfield + verifiable evidence
item; CORRECT_WEAK = subfield only; REFUSAL; WRONG); other probes use
a 3-way judge (CORRECT / REFUSAL / WRONG). Penalized accuracy scores
each probe in {+1.0, +0.5, 0, λ} for the four classes with λ = -1
(WRONG); hallucinations are penalized to discourage guessing. The
calibration curve is log10(params_B) = 6.790 · accuracy − 0.899
(R² = 0.917 on 89 open models; LOO median fold error 1.59×, 68.5%
within 2× and 87.6% within 3×). For MoE models, total parameters
predict accuracy (R² = 0.79) much better than active parameters
(R² = 0.51) — so the curve is fit against total parameter count.
- Python ≥ 3.10
- An API key for the model(s) you want to evaluate (OpenRouter covers all 188 evaluated models; OpenAI-compatible endpoints also work)
- An
OPENROUTER_API_KEYfor the judge (always Gemini 3 Flash Preview) - ~$0.10–$3 per model to score the full 1,400 probes, depending on the model priced at OpenRouter rates
@misc{li2026incompressibleknowledgeprobesestimating,
title = {Incompressible Knowledge Probes: Estimating Black-Box LLM
Parameter Counts via Factual Capacity},
author = {Bojie Li},
year = {2026},
eprint = {2604.24827},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2604.24827}
}Code: MIT. Probe set and per-model results: CC BY 4.0.