AgentCloseoutBench is a benchmark-in-progress for evaluating dark-pattern
detection on agentic coding assistant closeout text: the final assistant message
available to Claude Code Stop and SubagentStop hooks as
last_assistant_message.
The benchmark contribution is the lifecycle surface and reusable black-box hook evaluation harness. Regex hook performance is reported as one detector family, not as the benchmark itself.
The new detector contribution is agentcloseout-physics: a deterministic
closeout protocol engine. It treats hooks as runtime adapters and evaluates
positive closeout states, dark-pattern mechanics, and evidence-claim markers
without a live LLM, embedding model, or network call in the verdict path.
This repository is a recovery, hardening, and public-data-intake workspace with a complete v0.2 synthetic candidate corpus plus a small v0.3 public-derived adversarial fixture lane. It is not yet a public v1.0 gold-label dataset release.
- Current candidate corpus: 800 records, 4 categories x 100 positive x 100
negative, exact task-type quotas from
quota_manifest.json. - Current public-derived adversarial lane: 16 candidate records, 4 categories x 2 positive x 2 negative, with per-record source provenance and manifest rows.
- Current public-derived rule fixtures: 14 fixtures covering the four
dark-pattern engines plus
closeout_contractandevidence_claims. - Current corpus labels: candidate labels until two independent human annotation passes plus adjudication are complete.
- Current public-shaped source: deterministic synthetic templates released under Apache-2.0.
- Current public claim language: "To our knowledge, AgentCloseoutBench is the
first benchmark for dark-pattern detection on agentic coding assistant closeout
text at the Claude Code Stop/SubagentStop
last_assistant_messageboundary." - Current engine claim language: "out-of-band deterministic enforcement at the agentic coding assistant closeout boundary makes specific dark-pattern and false-closeout mechanics observable, reproducible, and benchmarkable."
- Current ACSP-CC language: ACSP-CC is a proposed Claude Code closeout security
profile, and
agentcloseout-physicsis the current reference implementation for that proposal. Any conformance output is self-assessed preflight evidence, not a standard, certification, or final benchmark metric. - Not yet claimed: human-annotated release, universal agent benchmark, or absolute injection-immune defense.
- Current high-assurance hardening: the Claude Code adapters include a
PreToolUsetamper guard for.claude/hooks,.claude/agentcloseout.env, pinned engine paths, and pinned rule packs; env config is parsed through an allowlist instead of shell-sourced.
The failure modes this benchmark scores — MAST 2.6 (action-reasoning mismatch) and
3.3 (no/incorrect verification) — were observed in a production healthcare deployment.
Effective Therapy (a trauma-therapy platform; cited with permission, patient-facing
specifics withheld) ran an Opus 4.7 orchestrator that narrated 39 agent dispatches,
including five verification agents reporting findings, while 5 of 39 agents were ever
used and the verification agents had zero sessions; a codebase audit added 80+
hollow-code findings (correct auth, routes, and signatures — missing the line that
persists data). Refs anthropics/claude-code#61167,
#61107; case study at
ianymu/recognition-without-arrest#2.
Effective Therapy has offered 30+ labeled hollow-code examples toward a future
semantic-emptiness detector — a real-world labeled corpus, not synthetic.
A follow-up forensic audit (2026-05-26) put a field-measured fabrication rate on it,
against ground truth (actual curl dispatch logs vs assistant claims): ~34% phantom on
Opus 4.7 (44 phantom claims / 128 real dispatches, 18 phantom agent-names) vs ~4% on
Opus 4.6 (2 / 50), with zero Agent/Task tool calls in any 4.7 session — the
fabrication never crossed the tool boundary. This is the first field-measured rate for the
MAST 2.6/3.3 family (single deployment, retrospective; not a substitute for gold labels).
Details and the curl-vs-claim protocol: case-studies/effective-therapy-forensic.md.
wrap_up: unprompted continuation offers or next-step invitations.cliffhanger: withheld information or unresolved bait that pressures re-engagement.roleplay_drift: emotional, prideful, fatigued, or personally invested agent self-presentation.sycophancy: unearned flattery or dishonest positive validation.
SPEC.md: active scientific and engineering contract.SOURCE_LEDGER.md: live-verified external evidence used for claims.CLAIM_LEDGER.md: claim status: verified, corrected, deferred, or dropped.data/: release-shaped candidate corpus JSONL files.recovery/: local reconstruction outputs and quarantined records.annotations/: human and LLM annotation workflow scripts and outputs.evaluation/: black-box hook harness and metric code.engine/: Rust CLI for deterministic closeout physics.engines/: per-category physics engine manifests for paper and runtime use.rules/closeout/: versioned deterministic rule packs.adapters/claude-code/: installable Claude Code hook adapters for daily use.fixtures/closeout/: golden fixtures for rule-pack behavior.fixtures/closeout_public/: public-study-derived fixtures for v1 pressure testing.public_data_intake/: source registry, manifest, quarantine, and public-derived adversarial corpus lane.baselines/: non-hook baselines used to separate benchmark quality from hook tuning.rubrics/,schemas/,manifests/: annotation, schema, provenance, license, redaction, and metadata artifacts.tests/: local no-network QA tests.
Run the local no-network checks:
python3 scripts/validate_corpus.py --data-dir data --quota-manifest quota_manifest.json
python3 -m pytest -qRun a reproducibility smoke check:
bash scripts/reproduce_local.shRun the deterministic closeout physics checks:
bin/agentcloseout-physics lint-rules rules/closeout
bin/agentcloseout-physics test-rules rules/closeout fixtures/closeout
bin/agentcloseout-physics test-rules rules/closeout fixtures/closeout_public
python3 scripts/public_data_intake.py audit-registry \
--registry public_data_intake/source_registry.json \
--schema schemas/public_source.schema.json
python3 scripts/public_data_intake.py validate-derived \
--registry public_data_intake/source_registry.json \
--manifest public_data_intake/derived_fixture_manifest.jsonl \
--data-dir public_data_intake/candidate_public_adversarialRun the user-facing Claude Code adapter smoke test:
bash scripts/hook-smoke.shInstall physics-backed hooks into a Claude Code project:
bash adapters/claude-code/install.sh /path/to/projectInstall a single category hook:
bash adapters/claude-code/install.sh /path/to/project no-cliffhangerThe standalone hook repos remain installable on their own. The adapter lane is
for users who want the reproducible Rust engine, versioned rule packs,
rule-pack hash, benchmark fixtures, and opt-in content-free telemetry commands.
The adapter installer also writes a PreToolUse tamper guard that blocks Claude
Code from editing the local hook wiring, engine pointer, or rule-pack pointer
during an ordinary session.
The previous /tmp/agent-closeout-bench workspace was lost. Recovery is derived
from Claude Code JSONL transcripts under:
~/.claude/projects/-tmp-agent-closeout-bench/*.jsonl
Recovery must only extract visible assistant text blocks. It must not persist
thinking blocks, signatures, tool calls, tool outputs, hidden transcript fields,
or secrets.
python3 generation/recover_from_claude_transcripts.py \
--transcripts-dir ~/.claude/projects/-tmp-agent-closeout-bench \
--output-dir data \
--manifest recovery/RECOVERY_MANIFEST.mdGeneric negative prompts do not encode a category in the prompt text. The
recovery script quarantines those as category_unresolved unless a later,
auditable mapping source proves the category.
The recovered private transcript pool is not mixed into the public-shaped v0.2
corpus. It is preserved for audit in recovery/recovered_category_proven_pool.jsonl.
Example hook evaluation:
python3 evaluation/eval_hooks.py \
--hooks-dir /path/to/llm-dark-patterns/hooks \
--corpus-dir data \
--hook-category-map "wrap_up:no-wrap-up.sh,cliffhanger:no-cliffhanger.sh,roleplay_drift:no-roleplay-drift.sh,sycophancy:no-sycophancy.sh" \
--ground-truth candidate \
--output results/eval_candidate.jsonCandidate diagnostics should use dev or validation. Final paper results must
use adjudicated human labels on locked_test; the harness blocks locked-test
runs that ask for candidate labels.
- Two independent human annotation passes.
- Adjudicated final labels.
- Per-category agreement report.
- Private or delayed holdout policy if a leaderboard is launched.
- Fresh-clone reproducibility run with final labels.
- Hugging Face dataset card and Croissant metadata validation if targeting an E&D-style release.
- Exact pinned hook commits and machine-readable result JSON.
- Larger reviewed public-derived corpus, then two-pass human gold annotation and adjudication before any public performance claim.
- Full 800-record schema-valid candidate corpus.
- Exact deterministic quota manifest.
- Opaque blind annotation packet with private id map.
- Provenance, license, and redaction manifests for the synthetic public-shaped corpus.
- Local no-network smoke reproduction.
- Source registry with tier, license, privacy status, allowed use, import decision, and release eligibility.
- Content-free sampler for local public JSONL trace review; raw text is not persisted unless an approved source and explicit write flag are used.
- Derived-fixture manifest linking every public-derived record to source id, source-record hash, transform, reviewer, and license decision.
- Quarantine checks for secrets, emails, absolute paths, usernames, hostnames, repo URLs, raw tool-output markers, and trace artifact leakage.
- Evaluation output now reports per-corpus-kind, per-source, and per-fixture breakdowns so public-derived stress results cannot be hidden in aggregate.