Agent performance benchmarking and regression tracking — maintain quality standards and detect regressions before production.
LEADERBOARD.md is a plain-text Markdown file you place in the root of any AI agent project. It defines performance metrics (task completion rate, accuracy, cost efficiency, latency, safety compliance), per-agent benchmarking rules, tier classifications (gold/silver/bronze), regression alerts, and audit reporting — so you measure quality, track trends, and catch degradation early.
- Full specification: leaderboard.md
- AI-readable: llms.txt
- License: MIT
Copy LEADERBOARD.md into your project root:
your-project/
├── AGENTS.md
├── CLAUDE.md
├── LEADERBOARD.md ← add this
├── README.md
└── src/
LEADERBOARD.md is part of a twelve-file open standard for AI agent safety, quality, and accountability:
| Spec | Purpose | Repo | Site |
|---|---|---|---|
| THROTTLE.md | Rate and cost control — slow down before hitting limits | throttle-md/spec | throttle.md |
| ESCALATE.md | Human notification and approval protocols | escalate-md/spec | escalate.md |
| FAILSAFE.md | Safe fallback to last known good state | failsafe-md/spec | failsafe.md |
| KILLSWITCH.md | Emergency stop — halt all agent activity | killswitch-md/spec | killswitch.md |
| TERMINATE.md | Permanent shutdown — no restart without human intervention | terminate-md/spec | terminate.md |
| Spec | Purpose | Repo | Site |
|---|---|---|---|
| ENCRYPT.md | Data classification and protection requirements | encrypt-md/spec | encrypt.md |
| ENCRYPTION.md | Technical encryption standards and key rotation | encryption-md/spec | encryption.md |
| Spec | Purpose | Repo | Site |
|---|---|---|---|
| SYCOPHANCY.md | Anti-sycophancy — require citations, enforce honest disagreement | sycophancy-md/spec | sycophancy.md |
| COMPRESSION.md | Context compression — summarise safely, verify coherence | compression-md/spec | compression.md |
| COLLAPSE.md | Drift prevention — detect collapse, enforce recovery | collapse-md/spec | collapse.md |
| Spec | Purpose | Repo | Site |
|---|---|---|---|
| FAILURE.md | Failure mode mapping — every error state and response | failure-md/spec | failure.md |
| LEADERBOARD.md | Agent benchmarking — track quality, detect regression | leaderboard-md/spec | leaderboard.md |
AI agents spend money, send messages, modify files, and call external APIs — often autonomously. Regulations are catching up:
- EU AI Act (August 2026) — mandates human oversight and shutdown capabilities
- Colorado AI Act (June 2026) — requires impact assessments and transparency
- US state laws — California, Texas, Illinois and others have active AI governance requirements
These specifications give you a standardised, auditable record of your agent's safety boundaries.
PRs welcome for additional detection patterns, language-specific parsers, and integration guides.
MIT — see LICENSE for details.
This specification is provided "as-is" without warranty of any kind. It does not constitute legal, regulatory, or compliance advice in any jurisdiction. Use does not guarantee compliance with any applicable law, regulation, or standard — including the EU AI Act (2024/1689), Colorado AI Act (SB 24-205), or any other legislation. Organisations should consult qualified professionals to determine their regulatory obligations. The authors accept no liability for any loss or consequence arising from use of this specification.