Incompressible Knowledge Probes (IKP)

Evaluation toolkit and reproduction bundle for the paper:

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity. Bojie Li, Pine AI.

IKP is a 1,400-question factual benchmark — 200 items × 7 obscurity tiers (T1: universal knowledge … T7: extreme long-tail). Accuracy on IKP scales log-linearly with parameter count across 89 open-weight models from 135M to 1.6T (R² = 0.917), so a single black-box API call budget is enough to estimate the effective knowledge capacity of any deployed model — including closed-source frontier models whose sizes are undisclosed.

Paper PDF: paper/main.pdf
Companion website (interactive): https://01.me/research/ikp
Source: https://github.com/19PINE-AI/ikp

Quickstart — estimate a model

# 1. Install deps (Python ≥ 3.10)
pip install -r requirements.txt

# 2. Point at any OpenAI-compatible endpoint and run
export OPENROUTER_API_KEY=sk-or-...
python scripts/ikp_estimate.py --model openai/gpt-4.1

Output:

  ╔══════════════════════════════════════════════════════════╗
  ║  IKP Estimation Results                                 ║
  ║  Model:     openai/gpt-4.1                              ║
  ║  Probes:    1400                                         ║
  ║  Accuracy:  58.2% (penalized)  63.9% (raw)              ║
  ║  Estimated:  400B parameters                             ║
  ╚══════════════════════════════════════════════════════════╝
  T1   99%  …  T7    4%
  Effective tier: T6
  Estimated size: 400B (calibrated on 89 open models, R²=0.917)

Faster stratified sample (200 probes, ~1 min):

python scripts/ikp_estimate.py --model openai/gpt-4.1 --sample 200

Non-OpenRouter endpoint (vLLM, OpenAI, Together, local):

python scripts/ikp_estimate.py \
    --api-base http://localhost:8000/v1 \
    --api-key  <your-key> \
    --model    my-local-model
# Judge always runs on OpenRouter (google/gemini-3-flash-preview);
# OPENROUTER_API_KEY must still be set for the judge.

Full CLI reference, including how to plug in a different judge or export per-probe verdicts: see TOOLKIT.md.

Interactive CLI — explore the benchmark

A second, lighter CLI (python -m cli) lets readers poke at the benchmark without running the full estimator. It has two modes.

Research mode — query the six tier landmarks plus three frontier models (GPT-5.5, DeepSeek V4 Pro, Claude Opus 4.7) with a researcher name or any free-form factual question:

export OPENROUTER_API_KEY=sk-or-...

# Look up a researcher (substring match against the probe set)
python -m cli research --researcher "Stjepan Picek"

# Ask any factual question
python -m cli research --question "Who founded the field of cache-oblivious algorithms?"

Evaluation mode — re-run any probe against the preset models plus any models you specify, scored with the paper's exact judge prompt (google/gemini-3-flash-preview, CORRECT / WRONG / REFUSAL):

# Score a single tier-7 probe against the preset 9 models
python -m cli eval IKP_T7_1234

# Add your own models; --model is repeatable
python -m cli eval IKP_T5_0123 \
    --model openai/gpt-4o \
    --model id=qwen/qwen3-32b,name=q3-32b,thinking=true

T1 uses a local Ollama landmark (qwen2.5:0.5b); install Ollama or ignore that row. The other eight models all run via OpenRouter.

Reproducing the paper

Every figure and table in the paper, with the exact script, inputs and expected outputs, is listed in REPRODUCTION.md.

Short path:

# Fastest: regenerate all figures from already-scored results
python paper/figures/generate_figures.py
python paper/figures/generate_appendix_figures.py

# Rebuild PDF (TeX Live)
cd paper && latexmk -pdf main.tex

To score additional models and extend the dataset:

python scripts/run_all_models.py --skip-existing
python scripts/run_evaluation.py --rebuild-summary  # refreshes evaluation_summary.json

Build the paper / website

The Makefile is the single entry point.

make help              # list every target

# Paper
make figs              # regenerate every figure under paper/figures/
make pdf               # one pdflatex pass (fast, no bibtex)
make full              # full rebuild with bibtex (4 passes)

# Calibration / data refresh after a new model lands in data/results/
make calibration       # rerun loo_cv_analysis.py + analyze_results.py
make website           # rebuild website/public/data/*.json (must precede website-build)
make data              # = calibration + website

# Website
make website-dev       # vite dev server  → http://localhost:5173
make website-build     # static build     → website/dist/
make website-preview   # preview the production build
make website-deploy    # rsync website/dist/ to DEPLOY_HOST:DEPLOY_PATH
                       # override per invocation:
                       #   make website-deploy DEPLOY_HOST=user@host \
                       #                       DEPLOY_PATH=/var/www/research/ikp/

make all               # data → figs → pdf

For subpath deploys (e.g. https://example.com/research/ikp/), set BASE_URL=/research/ikp/ make website-build. See website/README.md for full website documentation, nginx config, and GitHub Pages instructions.

Repo layout

ikp-paper/
├── README.md               ← this file
├── TOOLKIT.md              ← ikp_estimate.py reference
├── REPRODUCTION.md         ← figure/table ⇄ script map
├── requirements.txt
│
├── paper/                  ← LaTeX sources
│   ├── main.tex  main.pdf  appendix.tex  references.bib
│   ├── research-plan.md    ← original planning document
│   └── figures/            ← PDF/PNG figures + generators (all main & appendix figs)
│       ├── generate_figures.py            (main-text figs 1–6, 8)
│       └── generate_appendix_figures.py   (appendix figs A1–A4)
│
├── configs/
│   ├── experiment.json     ← tier definitions, API settings, seeds
│   ├── models.json         ← calibration-set models (open, known size)
│   └── all_models.json     ← full roster (188 models evaluated)
│
├── data/                   ← see data/README.md for schemas
│   ├── probes/
│   │   ├── final_probe_set_v8.json  ← THE 1,400 probes (the benchmark)
│   │   ├── researcher_probes.json   ← researcher sub-probe source
│   │   └── archive/                 ← earlier probe versions (v1..v7, batches, candidates)
│   ├── results/<model>.json         ← per-model raw evaluations (188 files)
│   ├── results/evaluation_summary.json  ← aggregated, consumed by every figure
│   ├── calibration/calibration_fit.json ← fitted log-linear calibration
│   ├── researcher_citations.json        ← T4–T7 researcher metadata
│   ├── researcher_recognition_rates.json
│   ├── densing_analysis_data.csv        ← Densing-Law table (for Fig 8)
│   ├── notes/                           ← exploratory analysis markdown
│   └── archive/                         ← superseded runs (results_v7, …)
│
├── results/
│   ├── figures/archive/    ← early-draft plots (superseded by paper/figures/)
│   └── tables/             ← .tex tables \input'ed by the paper
│
├── scripts/                ← see scripts/README.md for a full index
│   ├── ikp_estimate.py     ← one-model estimator (public entrypoint)
│   ├── run_all_models.py   ← bulk evaluation across the full roster
│   ├── run_evaluation.py   ← single-model evaluator
│   ├── 01_..15_*.py        ← numbered dataset pipeline
│   ├── analyze_results.py, loo_cv_analysis.py, show_progress.py
│   └── legacy/             ← one-off / superseded dev scripts (kept for audit)
│
├── pipeline/               ← probe generation + calibration library
├── src/                    ← evaluation runtime (api_client, probe_runner, scorer, …)
├── cli/                    ← interactive reader CLI (research + eval modes)
└── website/                ← React companion site

All active scripts resolve paths via Path(__file__).parent.parent, so they expect to live in scripts/. Scripts under scripts/legacy/ have been patched to three-.. (.parent.parent.parent) and still work when invoked directly.

How it works (one paragraph)

Each probe is a short factual question with a gold answer, scored by a Gemini 3 Flash Preview judge. Researcher subfield probes use a 4-way evidence-aware judge (CORRECT_STRONG = subfield + verifiable evidence item; CORRECT_WEAK = subfield only; REFUSAL; WRONG); other probes use a 3-way judge (CORRECT / REFUSAL / WRONG). Penalized accuracy scores each probe in {+1.0, +0.5, 0, λ} for the four classes with λ = -1 (WRONG); hallucinations are penalized to discourage guessing. The calibration curve is log10(params_B) = 6.790 · accuracy − 0.899 (R² = 0.917 on 89 open models; LOO median fold error 1.59×, 68.5% within 2× and 87.6% within 3×). For MoE models, total parameters predict accuracy (R² = 0.79) much better than active parameters (R² = 0.51) — so the curve is fit against total parameter count.

Requirements

Python ≥ 3.10
An API key for the model(s) you want to evaluate (OpenRouter covers all 188 evaluated models; OpenAI-compatible endpoints also work)
An OPENROUTER_API_KEY for the judge (always Gemini 3 Flash Preview)
~$0.10–$3 per model to score the full 1,400 probes, depending on the model priced at OpenRouter rates

Citing

@misc{li2026incompressibleknowledgeprobesestimating,
  title         = {Incompressible Knowledge Probes: Estimating Black-Box LLM
                   Parameter Counts via Factual Capacity},
  author        = {Bojie Li},
  year          = {2026},
  eprint        = {2604.24827},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url           = {https://arxiv.org/abs/2604.24827}
}

License

Code: MIT. Probe set and per-model results: CC BY 4.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Incompressible Knowledge Probes (IKP)

Quickstart — estimate a model

Interactive CLI — explore the benchmark

Reproducing the paper

Build the paper / website

Repo layout

How it works (one paragraph)

Requirements

Citing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
cli		cli
configs		configs
data		data
paper		paper
pipeline		pipeline
results		results
scripts		scripts
src		src
website		website
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
REPRODUCTION.md		REPRODUCTION.md
TOOLKIT.md		TOOLKIT.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Incompressible Knowledge Probes (IKP)

Quickstart — estimate a model

Interactive CLI — explore the benchmark

Reproducing the paper

Build the paper / website

Repo layout

How it works (one paragraph)

Requirements

Citing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages