Skip to content

condorstark/proportional-intelligence

Repository files navigation

Proportional Intelligence

Compute and trust should scale with the difficulty of the single question — not with the size of the model.

License: MIT Made with stdlib Python Honest by design No dependencies Real data

hero

We took an AI cost-model idea, checked it against the 2019–2026 literature, found that most of it already existed, isolated the one piece that is genuinely new and measurable, and tested it on real data. This README reports the honest result — including where it falls short.


TL;DR

  • The idea (difficulty-bound AI): a trivial question (2+2) should cost almost nothing and come with a correctness certificate; a hard question should cost more compute and carry a stronger proof. Cost should track the instance, not the model size.
  • The honest finding: the "adaptive compute" and "cheap correctness proof" parts are already published. We don't claim them. (See the table below.)
  • The one new, defensible, measurable piece: verification locality L = d_min(x) — how many sub-claims you must read together to catch an error. It is an axis of verification difficulty, orthogonal to generation, and it can be measured.
  • The hard number: on HotpotQA (300 multi-hop questions) L = 2.47 (mean). Small and constant — it does not grow with context.
  • The verdict: a real but bounded contribution. It does not revolutionize AI. It is a small, falsifiable result aligned with where AI is already going (test-time compute, hallucination detection).

The idea

"Proportional Intelligence" is a cost model for difficulty-bound AI. Instead of paying a fixed per-token price, each input x is described by a per-instance Resource Trihedron R_x(B, D, V):

  • B — compute spent on x
  • D — semantic distortion (how far the answer drifts from truth)
  • V — verifiability deficit (how hard the answer is to check)

bound by an additive conservation law: B + Φ(D) + V ≥ difficulty(x).

The point is that this is a property of the instance, not of the model.

method


What's genuinely new vs. what already existed

This table is the core of the repo's credibility. We validated the original "novelty" claims against the literature and most of them did not survive.

Claim we initially thought was new Already exists? Prior work
Compute adapts to the knowledge/difficulty of the input ✅ Yes Adaptive-RAG (2403.14403), PEER (2407.04153), Memory Layers at Scale (2412.09764)
A cheap proof / certificate of output correctness ✅ Yes Semantic Entropy (Nature, 2024), Self-Proving Models (2405.15722), Proof-Carrying Numbers (2509.06902)
The per-instance trihedron R_x(B, D, V) as an object (not a per-model metric) 🟡 Framing New framing; the components draw on existing notions
Verification locality L = d_min(x) as a measurable axis No This is the contribution
Falsifiable law q* ~ m^(1 - 1/L) linking L to spot-check cost No Derived and tested here

Takeaway: the contribution is not a new mechanism. It is making L = d_min a measurable quantity that predicts when randomized verification beats deterministic verification.

Some 2025–2026 citations above still need manual verification — see Honest limitations.


The measurable new piece: verification locality L = d_min

L = d_min(x) = how many sub-claims must be read together to discover an error. It is an axis of verification difficulty, orthogonal to generation, and it can be measured.

From it follows a falsifiable prediction about the number of spot-checks q* needed to catch an error, where m = number of sub-claims:

q*  ~  m^(1 - 1/L)
  • L = 1 (fully decomposable) → q* is constant → the gap G_x = V_det − V grows with answer length.
  • L = m (holistic) → q* ~ mG_x → 0 (collapse: nothing is gained by spot-checking).

For structured tasks (arithmetic) the Non-Collapse is exact, via Freivalds' algorithm: to check c == a·b, verify c mod p for random primes p — an O(s log) certificate against a Θ(n) answer.

spectrum

The synthetic bench confirms the law, with measured exponents 0.15, 0.51, 0.77, 0.89, 1.00 for L = 1, 2, 4, 8, m — monotonically increasing, as predicted. (These are from a representative run; the simulator is stochastic, so exact values vary slightly between runs — what is robust is the monotone ordering and the L=1 → 0, L=m → 1 endpoints.)


Result on real data

We then asked: for natural language (where no formal verifier like Freivalds exists), does the gap G_x still grow with difficulty? We measured L on real datasets.

result

The hard number — HotpotQA, 300 multi-hop questions, L = |supporting_facts|:

value
mean 2.47
median 2
range 2 – 5
distribution L2: 65% · L3: 24% · L4: 9% · L5: 1%

Real multi-hop reasoning needs to combine 2–3 facts (rarely up to 5), and this does not grow with context size. L is a small constant.

The rest of the spectrum (smaller N, in-context judgement, not hard measurement):

Response type L basis
Wikipedia biography ≈ 1 judgement: each fact independently checkable
Multi-hop QA (HotpotQA) 2.47 measured from supporting_facts
Summary — local claims ≈ 1 judgement
Summary — aggregate/global claims ≈ m judgement: refuting them requires reading almost everything

Honest verdict

Neither "total revolution" nor "just renaming." A real and bounded contribution:

  1. The third (verification) coordinate is real for natural language in the decomposable class — which is most useful factual / multi-hop NL. There L is a small constant ⇒ G_x grows ⇒ Non-Collapse holds even without a formal verifier. That was not obvious.
  2. A genuinely holistic class exists (aggregate / global / coherence claims) where L ≈ m and G_x → 0. It is the exception, but it is a publishable impossibility result in its own right.
  3. So the value is not "we invented cheap verification" (Semantic Entropy etc. already exist) — it is making L = d_min measurable and showing real factual NL sits in the favorable regime.

This does not revolutionize AI. It is a small, falsifiable result pointing in the same direction the field is already moving.


Quickstart

No dependencies — pure stdlib Python 3. Runs immediately:

# 1) Measure the Trihedron (B,D,V) on a real model with 4 falsifiable checks (mock backend, no LLM needed)
python3 esperimento/run.py --backend mock --task both --samples 8

# 2) Synthetic Phase 3: confirm the law q* ~ m^(1 - 1/L)
python3 esperimento/semantic_pcp.py

# 3) Phase 3 on real downloaded data (HotpotQA L=2.47, Wikipedia, CNN/DailyMail)
python3 esperimento/phase3_real.py

run.py also supports real backends: --backend ollama or --backend openai (OpenAI-compatible). See esperimento/README.md for the full experimental protocol.


Repository structure

proportional-intelligence/
├── README.md                  ← you are here (EN)
├── PAPER.md                   ← the write-up (EN)
├── LICENSE                    ← MIT
├── CITATION.cff               ← cite this work
├── CONTRIBUTING.md
├── assets/                    ← hero.svg, spectrum.svg, result.svg, method.svg
├── esperimento/               ← code + data (do not rename)
│   ├── run.py                 ← Trihedron (B,D,V), 4 checks, semantic clustering of H_adv
│   ├── backends.py            ← mock / ollama / openai-compat backends (stdlib only)
│   ├── triedro.py             ← Trihedron infrastructure
│   ├── semantic_pcp.py        ← synthetic Phase 3: q* ~ m^(1 - 1/L)
│   ├── phase3_real.py         ← Phase 3 on real data
│   ├── real_data.json         ← downloaded Wikipedia / CNN-DailyMail samples
│   ├── hotpot_L.json          ← measured HotpotQA L values (300 items)
│   └── README.md              ← detailed experimental protocol
└── (Italian deep-dives — see below)

Honest limitations

This section is deliberately prominent — it is the point of the repo.

  • The Kolmogorov-flavored quantities (C^t, Φ, difficulty) are incomputable. We use them as proxies, not as absolute thresholds.
  • No automated judge-model ran in the loop. The environment had no script-accessible LLM, so L for biographies and summaries are small-N in-context judgements by the author. Only HotpotQA is hard data (300 items, measured from supporting_facts).
  • L was not estimated by injecting errors and measuring the true detection q* with an automated judge — it was inferred from data structure and reasoning. The next rigorous step is exactly that, on a labeled faithfulness dataset (e.g. FActScore).
  • The prevalence of the holistic tail in real NL is not quantified. Establishing it needs a judge-LLM in the loop over thousands of labeled items.
  • The optimality of the conservation law is conjectural — only the lower bound is argued.
  • Everything rests on the q* ~ m^(1 - 1/L) simulator, which assumes independent errors and well-decomposed claims. Shared latent premises in real NL can correlate claims and raise the effective L.
  • Some 2025–2026 citations still need manual verification.

Roadmap

  • Put a judge-LLM in the loop: inject errors, measure the true detection q* against predicted m^(1 - 1/L).
  • Quantify the prevalence of the holistic (L ≈ m) tail on a large labeled faithfulness dataset (FActScore-style).
  • Formalize the impossibility result for the holistic class.
  • Tighten the conservation law from a lower bound toward an optimality argument.
  • Verify the 2025–2026 references by hand and pin DOIs in CITATION.cff.

Citation

If you use this work, please cite it via CITATION.cff.

Contributing

Contributions, replications, and especially falsifications are welcome — the whole point is rigor. See CONTRIBUTING.md. If you can break the q* ~ m^(1 - 1/L) prediction on real data, please open an issue.

License

MIT © Wavy (@condorstark)


Deep-dives (in Italian 🇮🇹)

The long-form reasoning chain is written in Italian:

  1. Intelligenza-Proporzionale.md — the concept.
  2. Analisi-Critica-Intelligenza-Proporzionale.md — critique against the literature.
  3. Soluzione-Rivoluzionaria.md — isolating the new piece (the filename is historical; the content is the bounded result described above, not a revolution).
  4. RISULTATO-Fase3.md — the Phase 3 real-data result.

Releases

No releases published

Packages

 
 
 

Contributors

Languages