Compute and trust should scale with the difficulty of the single question — not with the size of the model.
We took an AI cost-model idea, checked it against the 2019–2026 literature, found that most of it already existed, isolated the one piece that is genuinely new and measurable, and tested it on real data. This README reports the honest result — including where it falls short.
- The idea (difficulty-bound AI): a trivial question (
2+2) should cost almost nothing and come with a correctness certificate; a hard question should cost more compute and carry a stronger proof. Cost should track the instance, not the model size. - The honest finding: the "adaptive compute" and "cheap correctness proof" parts are already published. We don't claim them. (See the table below.)
- The one new, defensible, measurable piece: verification locality
L = d_min(x)— how many sub-claims you must read together to catch an error. It is an axis of verification difficulty, orthogonal to generation, and it can be measured. - The hard number: on HotpotQA (300 multi-hop questions)
L = 2.47(mean). Small and constant — it does not grow with context. - The verdict: a real but bounded contribution. It does not revolutionize AI. It is a small, falsifiable result aligned with where AI is already going (test-time compute, hallucination detection).
"Proportional Intelligence" is a cost model for difficulty-bound AI. Instead of paying a fixed per-token price, each input x is described by a per-instance Resource Trihedron R_x(B, D, V):
- B — compute spent on
x - D — semantic distortion (how far the answer drifts from truth)
- V — verifiability deficit (how hard the answer is to check)
bound by an additive conservation law: B + Φ(D) + V ≥ difficulty(x).
The point is that this is a property of the instance, not of the model.
This table is the core of the repo's credibility. We validated the original "novelty" claims against the literature and most of them did not survive.
| Claim we initially thought was new | Already exists? | Prior work |
|---|---|---|
| Compute adapts to the knowledge/difficulty of the input | ✅ Yes | Adaptive-RAG (2403.14403), PEER (2407.04153), Memory Layers at Scale (2412.09764) |
| A cheap proof / certificate of output correctness | ✅ Yes | Semantic Entropy (Nature, 2024), Self-Proving Models (2405.15722), Proof-Carrying Numbers (2509.06902) |
The per-instance trihedron R_x(B, D, V) as an object (not a per-model metric) |
🟡 Framing | New framing; the components draw on existing notions |
Verification locality L = d_min(x) as a measurable axis |
❌ No | This is the contribution |
Falsifiable law q* ~ m^(1 - 1/L) linking L to spot-check cost |
❌ No | Derived and tested here |
Takeaway: the contribution is not a new mechanism. It is making L = d_min a measurable quantity that predicts when randomized verification beats deterministic verification.
Some 2025–2026 citations above still need manual verification — see Honest limitations.
L = d_min(x) = how many sub-claims must be read together to discover an error. It is an axis of verification difficulty, orthogonal to generation, and it can be measured.
From it follows a falsifiable prediction about the number of spot-checks q* needed to catch an error, where m = number of sub-claims:
q* ~ m^(1 - 1/L)
- L = 1 (fully decomposable) →
q*is constant → the gapG_x = V_det − Vgrows with answer length. - L = m (holistic) →
q* ~ m→G_x → 0(collapse: nothing is gained by spot-checking).
For structured tasks (arithmetic) the Non-Collapse is exact, via Freivalds' algorithm: to check c == a·b, verify c mod p for random primes p — an O(s log) certificate against a Θ(n) answer.
The synthetic bench confirms the law, with measured exponents 0.15, 0.51, 0.77, 0.89, 1.00 for L = 1, 2, 4, 8, m — monotonically increasing, as predicted. (These are from a representative run; the simulator is stochastic, so exact values vary slightly between runs — what is robust is the monotone ordering and the L=1 → 0, L=m → 1 endpoints.)
We then asked: for natural language (where no formal verifier like Freivalds exists), does the gap G_x still grow with difficulty? We measured L on real datasets.
The hard number — HotpotQA, 300 multi-hop questions, L = |supporting_facts|:
| value | |
|---|---|
| mean | 2.47 |
| median | 2 |
| range | 2 – 5 |
| distribution | L2: 65% · L3: 24% · L4: 9% · L5: 1% |
Real multi-hop reasoning needs to combine 2–3 facts (rarely up to 5), and this does not grow with context size. L is a small constant.
The rest of the spectrum (smaller N, in-context judgement, not hard measurement):
| Response type | L | basis |
|---|---|---|
| Wikipedia biography | ≈ 1 | judgement: each fact independently checkable |
| Multi-hop QA (HotpotQA) | 2.47 | measured from supporting_facts |
| Summary — local claims | ≈ 1 | judgement |
| Summary — aggregate/global claims | ≈ m | judgement: refuting them requires reading almost everything |
Neither "total revolution" nor "just renaming." A real and bounded contribution:
- The third (verification) coordinate is real for natural language in the decomposable class — which is most useful factual / multi-hop NL. There
Lis a small constant ⇒G_xgrows ⇒ Non-Collapse holds even without a formal verifier. That was not obvious. - A genuinely holistic class exists (aggregate / global / coherence claims) where
L ≈ mandG_x → 0. It is the exception, but it is a publishable impossibility result in its own right. - So the value is not "we invented cheap verification" (Semantic Entropy etc. already exist) — it is making
L = d_minmeasurable and showing real factual NL sits in the favorable regime.
This does not revolutionize AI. It is a small, falsifiable result pointing in the same direction the field is already moving.
No dependencies — pure stdlib Python 3. Runs immediately:
# 1) Measure the Trihedron (B,D,V) on a real model with 4 falsifiable checks (mock backend, no LLM needed)
python3 esperimento/run.py --backend mock --task both --samples 8
# 2) Synthetic Phase 3: confirm the law q* ~ m^(1 - 1/L)
python3 esperimento/semantic_pcp.py
# 3) Phase 3 on real downloaded data (HotpotQA L=2.47, Wikipedia, CNN/DailyMail)
python3 esperimento/phase3_real.pyrun.py also supports real backends: --backend ollama or --backend openai (OpenAI-compatible). See esperimento/README.md for the full experimental protocol.
proportional-intelligence/
├── README.md ← you are here (EN)
├── PAPER.md ← the write-up (EN)
├── LICENSE ← MIT
├── CITATION.cff ← cite this work
├── CONTRIBUTING.md
├── assets/ ← hero.svg, spectrum.svg, result.svg, method.svg
├── esperimento/ ← code + data (do not rename)
│ ├── run.py ← Trihedron (B,D,V), 4 checks, semantic clustering of H_adv
│ ├── backends.py ← mock / ollama / openai-compat backends (stdlib only)
│ ├── triedro.py ← Trihedron infrastructure
│ ├── semantic_pcp.py ← synthetic Phase 3: q* ~ m^(1 - 1/L)
│ ├── phase3_real.py ← Phase 3 on real data
│ ├── real_data.json ← downloaded Wikipedia / CNN-DailyMail samples
│ ├── hotpot_L.json ← measured HotpotQA L values (300 items)
│ └── README.md ← detailed experimental protocol
└── (Italian deep-dives — see below)
This section is deliberately prominent — it is the point of the repo.
- The Kolmogorov-flavored quantities (
C^t,Φ,difficulty) are incomputable. We use them as proxies, not as absolute thresholds. - No automated judge-model ran in the loop. The environment had no script-accessible LLM, so
Lfor biographies and summaries are small-N in-context judgements by the author. Only HotpotQA is hard data (300 items, measured fromsupporting_facts). Lwas not estimated by injecting errors and measuring the true detectionq*with an automated judge — it was inferred from data structure and reasoning. The next rigorous step is exactly that, on a labeled faithfulness dataset (e.g. FActScore).- The prevalence of the holistic tail in real NL is not quantified. Establishing it needs a judge-LLM in the loop over thousands of labeled items.
- The optimality of the conservation law is conjectural — only the lower bound is argued.
- Everything rests on the
q* ~ m^(1 - 1/L)simulator, which assumes independent errors and well-decomposed claims. Shared latent premises in real NL can correlate claims and raise the effectiveL. - Some 2025–2026 citations still need manual verification.
- Put a judge-LLM in the loop: inject errors, measure the true detection
q*against predictedm^(1 - 1/L). - Quantify the prevalence of the holistic (
L ≈ m) tail on a large labeled faithfulness dataset (FActScore-style). - Formalize the impossibility result for the holistic class.
- Tighten the conservation law from a lower bound toward an optimality argument.
- Verify the 2025–2026 references by hand and pin DOIs in
CITATION.cff.
If you use this work, please cite it via CITATION.cff.
Contributions, replications, and especially falsifications are welcome — the whole point is rigor. See CONTRIBUTING.md. If you can break the q* ~ m^(1 - 1/L) prediction on real data, please open an issue.
MIT © Wavy (@condorstark)
The long-form reasoning chain is written in Italian:
Intelligenza-Proporzionale.md— the concept.Analisi-Critica-Intelligenza-Proporzionale.md— critique against the literature.Soluzione-Rivoluzionaria.md— isolating the new piece (the filename is historical; the content is the bounded result described above, not a revolution).RISULTATO-Fase3.md— the Phase 3 real-data result.