ML evaluation claims should be locked before the experiment runs, not reported after.
falsify commits a claim — metric, threshold, dataset hash, seed — as a SHA-256 manifest. Run the eval. The hash either matches or it doesn't.
$ falsify lock claim.yaml
locked: sha256:a3f9...c821
$ falsify verdict claim.yaml
PASS accuracy 0.934 >= 0.90 (hash verified)
# tampered:
$ falsify verdict claim.yaml
TAMPERED sha256 mismatch — spec modified after locking (exit 3)4 reference implementations — Python, JavaScript, Go, Rust — byte-equivalent on 12 conformance vectors. Designed for ML eval rigor. Maps to EU AI Act Article 12 evidence as a side effect.
Pre-registration + CI for AI-agent claims. Lock the claim and threshold with SHA-256 before running the experiment — or the result doesn't count.
Code: MIT. "FALSIFY" name and chevron logo: ™ reserved. See NOTICE · docs/COMMERCIAL.md.
Latest — 2026-05-02 · v0.1.3 released (release notes ·
pip install falsify==0.1.3). PRML v0.1 specification published with four reference implementations (Python · JavaScript · Go · Rust) all reproducing the 12 v0.1 vectors and 6 v0.2 candidate vectors byte-for-byte. 14-page arXiv preprint and v0.2 RFC roadmap (freeze 2026-05-22) open for public review.
Your team claims the model hits 94% accuracy. You ship it. Three weeks later a customer proves the real number is 71%.
The claim was never falsifiable. Nobody wrote down — cryptographically, before the experiment ran — what "94%" meant, which dataset, which metric, which threshold. So when the number changed, nobody could say whether the claim was wrong, the data drifted, or the metric got silently relaxed.
Falsify fixes this with a single idea from science: you must pre-register the claim before you run the experiment. If you change the spec after seeing the data, the hash changes, the audit trail breaks, and CI fails with exit code 3.
$ falsify lock accuracy_claim # SHA-256 the spec
$ falsify run accuracy_claim # reproducible experiment
$ falsify verdict accuracy_claim # exit 0 = PASS, 10 = FAIL, 3 = tampered
Deterministic exit codes are the API. CI gates on them. Humans read the audit trail. The claim either survives contact with the data or it doesn't.
▶ Watch the 90-second demo on YouTube
Lock a claim, run it, watch it PASS. Then tamper with the threshold and watch CI refuse to run. Full storyboard in docs/DEMO_SCRIPT.md.
Every week another paper, blog post, or product launch claims an AI metric that quietly evaporates under scrutiny. It's not usually malice — it's that the claim was never structured to be falsifiable. Falsify is the smallest possible tool that forces that structure.
- ML teams — gate deploys on pre-registered accuracy / NDCG / recall
- DevOps — treat p95 latency claims the same way you treat tests
- LLM pipelines — pin prompt + eval + threshold so "it works" means something
- Research — replicate a paper by running its spec.lock.json
See docs/CASE_STUDIES.md for three concrete adoption stories.
Current version: 0.1.0 — run python3 falsify.py --version.
Working with Claude Code? See CLAUDE.md.
Falsify is the reference implementation of PRML v0.1 — Pre-Registered ML Manifest Specification. The spec, conformance suite, and adjacent documents live under spec/:
spec/PRML-v0.1.md— the spec (RFC-style, CC BY 4.0)spec/test-vectors/v0.1/— 12 conformance vectors with locked SHA-256 digestsspec/analysis/positioning-v0.1.md— PRML vs in-toto / SLSA / Model Cards / HELM / ClinicalTrials.govspec/analysis/canonicalization-portability-v0.1.md— three cross-language findings from the JS second implementationspec/compliance/AI-Act-mapping-v0.1.md— EU AI Act Article 12/17/18/50/72/73 mappingspec/compliance/landing.md— compliance-audience landing copyspec/paper/— 14-page arXiv preprint (LaTeX, CC BY 4.0)spec/v0.2/ROADMAP.md— v0.2 RFC roadmap (freeze 2026-05-22)
Reference implementations (four languages, all 12 v0.1 + 6 v0.2 candidate vectors pass byte-for-byte):
- Python:
falsify.py— original reference, uses PyYAML - Node.js:
impl/js/— second reference, ~400 LOC, hand-rolled, zero deps - Go:
impl/go/— third reference, ~450 LOC, hand-rolled, stdlib only - Rust:
impl/rust/— fourth reference, ~600 LOC, hand-rolled, two deps (serde_json,sha2)
Hosted spec at spec.falsify.dev/v0.1. Public review thread at GitHub Discussion #6. Comments via hello@studio-11.co.
AI agents make empirical claims all day — "accuracy is up", "the new retriever is faster", "this filter catches every edge case". We rarely pin down the threshold, the metric, or the stopping rule before the data arrives.
Without pre-registration, every verdict is post-hoc rationalization: the goalposts move a little, the sample is chosen a little, the winning explanation is kept.
Falsification Engine forces scientific discipline onto that loop. You declare the test, lock the spec with a cryptographic hash, run the experiment, and read the exit code. PASS or FAIL is mechanical, not rhetorical — and CI enforces it on every push.
- A single-file CLI (
falsify) with 18 subcommands:init,lock,run,verdict,guard,list,stats,diff,hook,doctor,version,export,verify,replay,why,trend,score,bench. - A
commit-msggit hook that blocks commits whose messages contradict a locked verdict. - A GitHub Actions workflow that re-verdicts every push and PR across Python 3.11 and 3.12.
- Five Claude Code skills and two forked-context subagents that draft specs, audit arbitrary text against the verdict log, review PR diffs for honesty violations, and keep the log itself fresh.
pip install falsifyThat's it. The falsify command is on your PATH, the docs site
is at https://falsify.dev, and the project page is at
https://pypi.org/project/falsify.
Requires Python 3.11+.
git clone https://github.com/studio-11-co/falsify
cd falsify
pip install -e .The -e editable form is for hacking on falsify itself — your
edits to falsify.py take effect immediately without reinstalling.
docker build -t falsify-demo . && docker run --rm -it falsify-demoRuns the auto-demo in a clean container. See docs/DOCKER.md for interactive and repo-mount modes.
Consume falsify's hooks from your own repo:
repos:
- repo: https://github.com/studio-11-co/falsify
rev: v0.1.3
hooks:
- id: falsify-guard
- id: falsify-doctorThen pre-commit install && pre-commit install --hook-type commit-msg.
See docs/PRE_COMMIT.md for the full list of
exported hooks and how this repo eats its own dog food.
./demo.sh # auto-narrated: PASS → tamper → FAIL → guard block
# Either form works — `falsify` is the installed entry point,
# `python3 falsify.py` is the uninstalled fallback.
falsify init my_claim
# edit .falsify/my_claim/spec.yaml to fill in the template
falsify lock my_claim
falsify run my_claim
falsify verdict my_claim
falsify hook install # enable the commit-msg guardExit code 0 on PASS, 10 on FAIL. Everything else is documented
below.
New to pre-registration? Walk through TUTORIAL.md — 15 minutes, zero to first locked claim.
falsify init --template accuracy
falsify lock accuracy
falsify run accuracy
falsify verdict accuracyFive templates ship with a runnable spec + metric + dataset:
accuracy— classifier holdout accuracy ≥ 0.80latency— p95 request latency ≤ 200 msbrier— probabilistic calibration Brier ≤ 0.25llm-judge— LLM-judge agreement rate ≥ 0.75ab— A/B test absolute lift ≥ 0.05
Each scaffolds into claims/<name>/ (sources) and mirrors
spec.yaml into .falsify/<name>/ so the CLI runtime works
without further setup. Override the default name with --name
or the directory with --dir.
make install # pip install pyyaml
make test # run unittest suite
make smoke # run tests/smoke_test.sh
make demo # JUJU end-to-end (lock → run → verdict)See Makefile for all targets (make help).
Questions and objections? See docs/FAQ.md — 15 direct answers to "why not just X?" questions.
Feature matrix vs adjacent tools: docs/COMPARISON.md.
falsify why <name> is the human-friendly companion to verdict
— it always exits 0 and tells you exactly what the next honest
move is:
claim: juju
state: STALE
reasoning: the spec has been edited (sha256:1038219d75a8) but no run
exists against this hash. Last run was against sha256:164f619d4860.
locked: yes (sha256:164f619d4860, 2h ago)
last run: 2026-04-22T02:10:17+00:00 (2h ago)
next action: `falsify run <name>` to produce a fresh verdict against
the current spec.
Add --json for a scripted pipeline, --verbose for full hashes
and the last five runs.
falsify trend <name> draws an ASCII sparkline of the metric
across its recorded runs, marks the threshold line, and classifies
the trajectory as improving, degrading, flat, or
mixed.
claim: juju
threshold: 0.25 (direction: below)
runs: 20 shown (of 20)
▁▂▂▃▃▄▄▅▅▆▆▆▇▇████
TT
threshold=0.25 (shown)
first: 0.12 @ ... (PASS)
last: 0.23 @ ... (PASS)
min: 0.09
max: 0.23
mean: 0.17
latest verdict: PASS
trend: degrading
--ascii swaps in _.oO#; --width resizes the sparkline;
--last caps history (default 20, max 200).
falsify bench spawns each subcommand under a fresh temporary
directory and records per-command latency (min / median / p95 /
max / mean / stddev). Useful as a sanity check before a release
or when investigating a suspected startup-time regression.
falsify bench --runs 5 --commands "--help,list,stats,score"
falsify bench --runs 5 --json # machine-readable output--runs <N> sets the timed-iteration count (default 5, capped at
100); --warmup <N> discards the first N spawns so JIT / import
caches stabilize before timing (default 1).
| Code | Meaning |
|---|---|
| 0 | PASS |
| 10 | FAIL |
| 2 | Bad spec / INCONCLUSIVE |
| 3 | Hash mismatch (spec tampered) |
| 11 | Guard violation (commit blocked) |
Skills (.claude/skills/) — in-session helpers that fire on
trigger phrases.
hypothesis-authorwalks the user through a 5-question dialogue and writes a falsifiablespec.yaml.falsifyis the orchestrator: routes any empirical claim to the right place in the init → lock → run → verdict pipeline.claim-auditruns a fast keyword+regex audit over pasted text and escalates to theclaim-auditorsubagent when paraphrases or2 claims show up.
claim-reviewreads a PR diff and flags unlocked specs, silent threshold edits, andmetric_fnreferences to missing modules — runs in PR CI, exits1on any CRITICAL finding. Seedocs/PR_REVIEW.md.falsify-ci-doctoringestsmake release-checkoutput and maps each FAIL gate to a likely cause and an exact fix command — one-shot triage when CI is red.
Subagents (.claude/agents/) — forked-context agents invoked
via the Task tool for heavier work.
claim-auditordoes the semantic cross-reference that the keyword-passclaim-auditskill deliberately skips; used on PR bodies, release notes, and README edits.verdict-refresherscans.falsify/*/for STALE, INCONCLUSIVE, or UNRUN verdicts and re-runs them through the CLI — keepingguarddecisions trustworthy.
Slash commands (.claude/commands/) — in-IDE shortcuts that
compose the skills and CLI.
/new-claim <template> [name]— guided scaffold → lock → run → verdict for one of the five templates./audit-claims— repo-wide semantic audit; mergeslist/stats/scorewith findings from theclaim-auditskill into a single markdown report./ship-verdict <name>— four-gate release check (verdict, freshness, replay, audit-chain). Exits non-zero on any gate failure. Does not ship; only verifies.
CI (.github/workflows/falsify.yml) — on every push and PR,
the workflow runs the unittest suite, tests/smoke_test.sh, the
JUJU end-to-end (lock → run → verdict), a guard self-check,
and a skill-lint pass over every SKILL.md and agent file.
- Walk through the pipeline in 5 runnable steps: DEMO.md.
- Second-by-second shooting script for the 3-minute video: docs/DEMO_SHOT_LIST.md.
- Four more claim types (accuracy regression, latency gate, prediction calibration, LLM agreement, AB test): docs/EXAMPLES.md.
Expose the verdict store to Claude Desktop / Claude Code via
Model Context Protocol with four read-only tools (list_verdicts,
get_verdict, get_stats, check_claim) and three resource URIs.
pip install -e '.[mcp]'
python -m mcp_server # speaks MCP over stdioThen merge the snippet in
mcp_server/claude_desktop_config.example.json
into your Claude Desktop config, pointing cwd at your local
clone. Every Claude session in your org can now query live
verdicts — no more "I think the latency claim still passes";
Claude just asks the MCP server. Falsify itself runs without the
SDK; if mcp isn't installed, python -m mcp_server exits 2 with
a clear install hint. Full surface in
mcp_server/README.md.
Deploy the two subagents (verdict-refresher, claim-auditor)
to Anthropic Console for scheduled and on-demand execution.
See docs/MANAGED_AGENTS.md for the
setup recipe and manifests under
managed_agents/.
cp hooks/commit-msg .git/hooks/commit-msg
chmod +x .git/hooks/commit-msgOr, as a symlink so hook updates propagate automatically:
ln -sf "$(pwd)/hooks/commit-msg" .git/hooks/commit-msgfalsify.py— single-file Python CLI, stdlib + pyyaml only.impl/js/falsify.js— Node.js second reference implementation (12/12 vectors).spec/PRML-v0.1.md+spec/test-vectors/v0.1/— spec + conformance suite.spec/analysis/— positioning + canonicalization portability findings.spec/compliance/— EU AI Act mapping + compliance landing copy.spec/paper/— 14-page arXiv preprint (LaTeX).spec/v0.2/ROADMAP.md— v0.2 RFC roadmap.hypothesis.schema.yaml— spec schema (claim, falsification, experiment, environment, artifacts).examples/hello_claim/— tiny smoke-test fixture.examples/juju_sample/— anonymized 20-row prediction ledger for the Brier score demo.hooks/commit-msg— the guard hook.tests/—unittestsuite plussmoke_test.shend-to-end driver..claude/skills/— the five in-session skills..claude/agents/— the two forked-context subagents..claude/commands/— the three slash commands..github/workflows/— CI + PRML manifest verification.
Falsify uses itself. Three real claims about this codebase live
under claims/self/:
cli_startup— CLI startup stays under 500ms mediantest_coverage_count— test suite has more than 400 test methodsclaude_surface— Claude integration ships more than 8 artifacts
Run make dogfood to re-verify. CI runs these on every PR.
See CHANGELOG.md for release history.
Two roadmaps run alongside each other:
- CLI tool roadmap: ROADMAP.md —
falsifyfeatures, integrations, dependencies. CLI v0.2 targeted 2026-06-15. - Specification roadmap: spec/v0.2/ROADMAP.md — PRML format evolution, canonicalization grammar, conformance. Spec v0.2 freeze 2026-05-22.
The CLI is downstream of the spec: when spec v0.2 freezes, CLI v0.2 follows about three weeks later. CLI v0.3 is loosely scoped for Q4 2026.
Falsify is a discipline tool, not a zero-trust system. For a full enumeration of attacks defended and NOT defended, with the exact exit code or command that catches each, see docs/ADVERSARIAL.md. For private disclosure of invariant breaks, see .github/SECURITY.md.
MIT. See LICENSE.
See CODE_OF_CONDUCT.md for community standards. See .github/CODEOWNERS for module-level reviewers and .github/dependabot.yml for automated dependency updates. See docs/GLOSSARY.md for definitions of every term used across the docs. See docs/CASE_STUDIES.md for three concrete adoption scenarios: ML team, DevOps team, research group.
Claude Opus 4.7 (1M context), in three days, for the Anthropic Built with Opus 4.7 hackathon.