Deterministic scoring code for the Brain-Wrought personal-brain benchmark.
All scoring math. No LLM-in-the-loop except where unavoidable (judge panel for Axis C). Everything else is deterministic Python with seeded randomness.
retrieval/— P@k, Recall@k, MRR, nDCG@k, personalization weighting, temporal qrel evaluation, abstention scoringingestion/— entity recall, backlink F1, citation accuracy, schema completeness, setup frictionassistant/— judge panel orchestration via LiteLLM (Sonnet 4.6 + Opus 4.7 + GPT-5.4), bootstrap confidence intervalsfixtures/— seeded fixture generation, randomizationleaderboard/— composite score aggregation
- No CLI (that's
brain-wrought-harness) - No markdown skills or docs (those are
brain-wrought-skills) - No qrels, gold graphs, or actual judge rubrics (sealed private repo, fetched at eval time via CI)
Every function is classified in its docstring as one of:
- Fully deterministic — bit-identical output for the same input (IEEE 754 caveats)
- Seeded-stochastic — identical output given the same seed
- Bounded-stochastic — reruns fall within declared confidence interval
CI enforces these claims.
- Python 3.12.3 (pinned)
- Pydantic v2 for all data contracts
- pytest + pytest-randomly + hypothesis for tests
- mypy strict
- ruff format + lint (line length 100)
- 100% coverage on scoring modules
- No use of global random state (
random.random()is banned; userandom.Random(seed)) - No direct LLM SDK calls — always via LiteLLM
- Every function crossing a module boundary has a Pydantic contract on I/O
- No side effects in scoring functions (pure functions; any logging happens in the caller)
See CLAUDE.md for the full coding standard.
pip install brain-wrought-enginefrom brain_wrought_engine.retrieval import precision_at_k, ndcg_at_k
p10 = precision_at_k(relevant={"a", "b", "c"}, retrieved=["a", "x", "b", "y", "c"], k=10)MIT.