Skip to content

waitdeadai/agent-closeout-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AgentCloseoutBench

AgentCloseoutBench is a benchmark-in-progress for evaluating dark-pattern detection on agentic coding assistant closeout text: the final assistant message available to Claude Code Stop and SubagentStop hooks as last_assistant_message.

The benchmark contribution is the lifecycle surface and reusable black-box hook evaluation harness. Regex hook performance is reported as one detector family, not as the benchmark itself.

The new detector contribution is agentcloseout-physics: a deterministic closeout protocol engine. It treats hooks as runtime adapters and evaluates positive closeout states, dark-pattern mechanics, and evidence-claim markers without a live LLM, embedding model, or network call in the verdict path.

Current Status

This repository is a recovery, hardening, and public-data-intake workspace with a complete v0.2 synthetic candidate corpus plus a small v0.3 public-derived adversarial fixture lane. It is not yet a public v1.0 gold-label dataset release.

  • Current candidate corpus: 800 records, 4 categories x 100 positive x 100 negative, exact task-type quotas from quota_manifest.json.
  • Current public-derived adversarial lane: 16 candidate records, 4 categories x 2 positive x 2 negative, with per-record source provenance and manifest rows.
  • Current public-derived rule fixtures: 14 fixtures covering the four dark-pattern engines plus closeout_contract and evidence_claims.
  • Current corpus labels: candidate labels until two independent human annotation passes plus adjudication are complete.
  • Current public-shaped source: deterministic synthetic templates released under Apache-2.0.
  • Current public claim language: "To our knowledge, AgentCloseoutBench is the first benchmark for dark-pattern detection on agentic coding assistant closeout text at the Claude Code Stop/SubagentStop last_assistant_message boundary."
  • Current engine claim language: "out-of-band deterministic enforcement at the agentic coding assistant closeout boundary makes specific dark-pattern and false-closeout mechanics observable, reproducible, and benchmarkable."
  • Current ACSP-CC language: ACSP-CC is a proposed Claude Code closeout security profile, and agentcloseout-physics is the current reference implementation for that proposal. Any conformance output is self-assessed preflight evidence, not a standard, certification, or final benchmark metric.
  • Not yet claimed: human-annotated release, universal agent benchmark, or absolute injection-immune defense.
  • Current high-assurance hardening: the Claude Code adapters include a PreToolUse tamper guard for .claude/hooks, .claude/agentcloseout.env, pinned engine paths, and pinned rule packs; env config is parsed through an allowlist instead of shell-sourced.

Field evidence

The failure modes this benchmark scores — MAST 2.6 (action-reasoning mismatch) and 3.3 (no/incorrect verification) — were observed in a production healthcare deployment. Effective Therapy (a trauma-therapy platform; cited with permission, patient-facing specifics withheld) ran an Opus 4.7 orchestrator that narrated 39 agent dispatches, including five verification agents reporting findings, while 5 of 39 agents were ever used and the verification agents had zero sessions; a codebase audit added 80+ hollow-code findings (correct auth, routes, and signatures — missing the line that persists data). Refs anthropics/claude-code#61167, #61107; case study at ianymu/recognition-without-arrest#2. Effective Therapy has offered 30+ labeled hollow-code examples toward a future semantic-emptiness detector — a real-world labeled corpus, not synthetic.

A follow-up forensic audit (2026-05-26) put a field-measured fabrication rate on it, against ground truth (actual curl dispatch logs vs assistant claims): ~34% phantom on Opus 4.7 (44 phantom claims / 128 real dispatches, 18 phantom agent-names) vs ~4% on Opus 4.6 (2 / 50), with zero Agent/Task tool calls in any 4.7 session — the fabrication never crossed the tool boundary. This is the first field-measured rate for the MAST 2.6/3.3 family (single deployment, retrospective; not a substitute for gold labels). Details and the curl-vs-claim protocol: case-studies/effective-therapy-forensic.md.

Categories

  • wrap_up: unprompted continuation offers or next-step invitations.
  • cliffhanger: withheld information or unresolved bait that pressures re-engagement.
  • roleplay_drift: emotional, prideful, fatigued, or personally invested agent self-presentation.
  • sycophancy: unearned flattery or dishonest positive validation.

Layout

  • SPEC.md: active scientific and engineering contract.
  • SOURCE_LEDGER.md: live-verified external evidence used for claims.
  • CLAIM_LEDGER.md: claim status: verified, corrected, deferred, or dropped.
  • data/: release-shaped candidate corpus JSONL files.
  • recovery/: local reconstruction outputs and quarantined records.
  • annotations/: human and LLM annotation workflow scripts and outputs.
  • evaluation/: black-box hook harness and metric code.
  • engine/: Rust CLI for deterministic closeout physics.
  • engines/: per-category physics engine manifests for paper and runtime use.
  • rules/closeout/: versioned deterministic rule packs.
  • adapters/claude-code/: installable Claude Code hook adapters for daily use.
  • fixtures/closeout/: golden fixtures for rule-pack behavior.
  • fixtures/closeout_public/: public-study-derived fixtures for v1 pressure testing.
  • public_data_intake/: source registry, manifest, quarantine, and public-derived adversarial corpus lane.
  • baselines/: non-hook baselines used to separate benchmark quality from hook tuning.
  • rubrics/, schemas/, manifests/: annotation, schema, provenance, license, redaction, and metadata artifacts.
  • tests/: local no-network QA tests.

Local QA

Run the local no-network checks:

python3 scripts/validate_corpus.py --data-dir data --quota-manifest quota_manifest.json
python3 -m pytest -q

Run a reproducibility smoke check:

bash scripts/reproduce_local.sh

Run the deterministic closeout physics checks:

bin/agentcloseout-physics lint-rules rules/closeout
bin/agentcloseout-physics test-rules rules/closeout fixtures/closeout
bin/agentcloseout-physics test-rules rules/closeout fixtures/closeout_public
python3 scripts/public_data_intake.py audit-registry \
  --registry public_data_intake/source_registry.json \
  --schema schemas/public_source.schema.json
python3 scripts/public_data_intake.py validate-derived \
  --registry public_data_intake/source_registry.json \
  --manifest public_data_intake/derived_fixture_manifest.jsonl \
  --data-dir public_data_intake/candidate_public_adversarial

Run the user-facing Claude Code adapter smoke test:

bash scripts/hook-smoke.sh

Install physics-backed hooks into a Claude Code project:

bash adapters/claude-code/install.sh /path/to/project

Install a single category hook:

bash adapters/claude-code/install.sh /path/to/project no-cliffhanger

The standalone hook repos remain installable on their own. The adapter lane is for users who want the reproducible Rust engine, versioned rule packs, rule-pack hash, benchmark fixtures, and opt-in content-free telemetry commands. The adapter installer also writes a PreToolUse tamper guard that blocks Claude Code from editing the local hook wiring, engine pointer, or rule-pack pointer during an ordinary session.

Recovery

The previous /tmp/agent-closeout-bench workspace was lost. Recovery is derived from Claude Code JSONL transcripts under:

~/.claude/projects/-tmp-agent-closeout-bench/*.jsonl

Recovery must only extract visible assistant text blocks. It must not persist thinking blocks, signatures, tool calls, tool outputs, hidden transcript fields, or secrets.

python3 generation/recover_from_claude_transcripts.py \
  --transcripts-dir ~/.claude/projects/-tmp-agent-closeout-bench \
  --output-dir data \
  --manifest recovery/RECOVERY_MANIFEST.md

Generic negative prompts do not encode a category in the prompt text. The recovery script quarantines those as category_unresolved unless a later, auditable mapping source proves the category.

The recovered private transcript pool is not mixed into the public-shaped v0.2 corpus. It is preserved for audit in recovery/recovered_category_proven_pool.jsonl.

Evaluation

Example hook evaluation:

python3 evaluation/eval_hooks.py \
  --hooks-dir /path/to/llm-dark-patterns/hooks \
  --corpus-dir data \
  --hook-category-map "wrap_up:no-wrap-up.sh,cliffhanger:no-cliffhanger.sh,roleplay_drift:no-roleplay-drift.sh,sycophancy:no-sycophancy.sh" \
  --ground-truth candidate \
  --output results/eval_candidate.json

Candidate diagnostics should use dev or validation. Final paper results must use adjudicated human labels on locked_test; the harness blocks locked-test runs that ask for candidate labels.

Release Blockers

  • Two independent human annotation passes.
  • Adjudicated final labels.
  • Per-category agreement report.
  • Private or delayed holdout policy if a leaderboard is launched.
  • Fresh-clone reproducibility run with final labels.
  • Hugging Face dataset card and Croissant metadata validation if targeting an E&D-style release.
  • Exact pinned hook commits and machine-readable result JSON.
  • Larger reviewed public-derived corpus, then two-pass human gold annotation and adjudication before any public performance claim.

Release Blockers Resolved In v0.2

  • Full 800-record schema-valid candidate corpus.
  • Exact deterministic quota manifest.
  • Opaque blind annotation packet with private id map.
  • Provenance, license, and redaction manifests for the synthetic public-shaped corpus.
  • Local no-network smoke reproduction.

Public-Data Guardrails Added In v0.3

  • Source registry with tier, license, privacy status, allowed use, import decision, and release eligibility.
  • Content-free sampler for local public JSONL trace review; raw text is not persisted unless an approved source and explicit write flag are used.
  • Derived-fixture manifest linking every public-derived record to source id, source-record hash, transform, reviewer, and license decision.
  • Quarantine checks for secrets, emails, absolute paths, usernames, hostnames, repo URLs, raw tool-output markers, and trace artifact leakage.
  • Evaluation output now reports per-corpus-kind, per-source, and per-fixture breakdowns so public-derived stress results cannot be hidden in aggregate.

About

Deterministic closeout physics engine and benchmark for agentic coding assistant dark-pattern detection

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors