A Claude Code skill that does a task and verifies its own output: classify the goal, write a checker before writing the answer, then iterate until the checker passes — or hit a hard cap and say so honestly.
Use it when you care whether the output actually meets the requirements — not just that it looks plausible. Good fits: code with behavior requirements, text with length or content rules, structured outputs (JSON, CSV), anything where "did this work" has a definable answer. Bad fit: open-ended creative tasks with no checkable criterion (the skill will tell you so and proceed with a clear caveat).
The skill is a single markdown file. Drop it into Claude Code's skills directory:
mkdir -p ~/.claude/skills/verified-task
curl -fsSL https://raw.githubusercontent.com/klittle32/verified-task/main/skill/SKILL.md \
-o ~/.claude/skills/verified-task/SKILL.mdNew Claude Code sessions will pick it up. Existing sessions cache the skill list at startup and need a restart.
In any Claude Code session, invoke /verified-task <prompt>. The skill states its classification, shows its verifier before doing any work, and reports scores at the end.
Examples (each exercises a different path through the loop):
-
/verified-task write a Python function that returns the median of a list, with a docstring under 30 words explaining the algorithmhybrid — behavior tests + word count + docstring clarity judge -
/verified-task draft a 3-sentence email declining a Friday meeting, polite but not apologetichybrid — sentence count + no-apology check + tone judge -
/verified-task write a tweet under 280 chars announcing a fictional product called "Cloudlift", including a specific number, no marketing buzzwordshybrid — length + content rules + voice judge -
/verified-task output a JSON array of the first 10 prime numbersdeterministic only — parse and compare; no judge spawned -
/verified-task write a haiku about a quiet morningsubjective only — judge with 5-7-5 + atmosphere rubric, calibrated before use
A run ends with one of:
Verified: 3/3 behavior tests pass, docstring 27 words, clarity 4/5.
…meaning every check fired and passed. Or:
Did not verify (3 attempts). Returning attempt 2.
Attempt 1: 1/3 deterministic pass, judge 3/5 [missing: word_count, no_buzzwords]
Attempt 2: 2/3 deterministic pass, judge 4/5 [missing: word_count]
Attempt 3: 2/3 deterministic pass, judge 2/5 [missing: word_count]
…meaning the cap was hit; the best attempt (ranked by deterministic count, then judge score, then most-recent) is returned with an honest caveat.
When two consecutive attempts fail the same check, the skill pauses before burning attempt 3 and asks whether to keep trying, relax the verifier (with a proposed specific change), or abort.
The clearest tell that a task fits: you catch yourself thinking "I'll just eyeball it when it's done." That eyeballing is the verifier — write it down and let the skill enforce it.
A non-exhaustive map of common shapes:
- Regex authoring — "match IPv4 but reject 256.x.x.x and leading zeros; here are 10 must-match and 10 must-reject examples"
- SQL queries — verify against an expected result set
- Pure functions with gnarly edge cases — date math, currency parsing, URL canonicalization
- Schema-conformant fixtures — "generate 50 mock user records matching this JSON schema, realistic email/phone formats, ≥ 5 unique countries"
- Format conversion — XML→JSON, CSV→Markdown, OpenAPI→TypeScript types (round-trip equality is the verifier)
- Migration scripts — verifier asserts idempotency (run twice, second is a no-op)
- Bug fixes from a failing test — paste the test, iterate until it passes without breaking the rest
- Polyfill / port — verifier compares outputs across both implementations on a shared input set
- Commit messages — imperative mood, under 72 chars, references the right ticket
- PR descriptions — must contain Summary + Test plan sections + design-doc link
- Tweet / LinkedIn post — under N chars, hook in the first line, ends with a question
- Conference talk abstract — exactly 150 words; names audience, takeaway, one concrete example
- Job description — required sections present; no buzzwords from a banned list; judged as inclusive
- Resume bullets — STAR format, action-verb led, contains a quantified outcome
- README section — has Install/Usage/License headers + reads clearly to someone new
- Customer support reply — acknowledges issue, proposes a concrete next step, tone judged "warm, not corporate"
- Slack apology — accountable without grovelling, ≤ 3 sentences
- Cold outreach email — single CTA, under 90 words, mentions one specific thing about the recipient
- Status update to your manager — progress / blockers / next-steps under 100 words, no hedging
- Performance-review self-assessment — specific, quantified, no passive voice
- Feedback to a teammate — situation/behavior/impact structure, tone "candid but kind"
- Reading-level rewrite — Flesch-Kincaid score as the verifier
- Translate preserving glossary — verifier checks specific terms appear untranslated
- Legalese → plain English — judge on clarity, plus a check that every defined term still appears
- Field extraction from unstructured text — verifier re-extracts from the output JSON and compares to a known answer set
- Dedupe / categorize a list — every input gets exactly one label from a fixed set
- Prose → bulleted list of exactly N items, each action-verb led
- Haiku / limerick / sonnet — syllable counts and rhyme are deterministic; "is it good" is judged
- Naming — N options under M letters, pronounceable, thematically related to X
- Headline variants — each under 60 chars, each containing a verb, scored on punchiness
- Daily standup — yesterday / today / blockers under 60 words, no vague verbs like "worked on"
- Calendar event titles — attendees + topic + desired outcome
- TODO list rewrite — each item action-verb-led, scoped to one session, has an acceptance criterion
- Goal decomposition — subtasks ≤ 2 hours each, each with an explicit "done when…" line
- Meeting notes — Decisions, Action items (with owners), Open questions sections
- Error messages — specific, actionable, non-blaming (judged); no leaked paths or stack frames (deterministic)
- Privacy notice — required clauses present (regex) + readable (judged)
- Accessibility-pass HTML — every
<img>has alt, inputs have labels, contrast ratios computed
- Pure exploration — "what do you think of this design?" There's nothing to verify.
- Single-shot trivia / lookups — the verifier is more work than the answer.
- Multi-day, multi-file refactors — the 3-attempt cap and single verifier aren't the right shape; reach for a plan + iterative review.
- Anything where you'd struggle to state "done" in one paragraph — that's a signal you don't yet know what done looks like. The skill will surface this as "can't write a verifier" rather than guess.
Two adjacent things exist, and it's worth knowing where this skill differs.
Claude Code's built-in /goal sets a completion condition and lets the model keep iterating until it judges the goal satisfied. The check is implicit and self-judged — fine for fluid, exploratory work, less reliable when "done" has a concrete shape.
Anthropic's Managed Agents Outcomes (API feature) is the closest neighbor: rubric defined before work, grader in a fresh context, capped iteration. If you're building on the API, use Outcomes — that's its home.
This skill is the Claude Code-native version of the same idea, with four opinionated additions Outcomes doesn't have:
/goal |
Outcomes | verified-task | |
|---|---|---|---|
| Verifier written before work | no | yes | yes |
| Grader in fresh context | no (self-judged) | yes | yes (Agent subagent) |
| Iteration cap | open-ended | 3 default | 3 hard |
| Deterministic / subjective / hybrid routing | — | — | yes |
| Per-assertion sanity check via counterexamples | — | — | yes |
| Escalation pause on repeated identical failure | — | — | yes |
| Explicit cap-hit best-attempt output format | — | — | yes |
| Runs as a Claude Code skill (no API plumbing) | yes | no | yes |
The four bold rows are the actual contribution. Each came from an observed failure mode in earlier versions; the per-improvement design docs in plan/ (#1–#6) record the reasoning, and the evals/ harness exercises them against a labeled corpus.
In normal LLM use, you ask for something and get an answer. Verification — whether the answer actually meets the requirements — is something the human does (or doesn't) after the fact. This skill flips the order: it makes verification the spine of the work, not an afterthought.
The conceptual root is evals. An eval is a structured way to measure whether an AI output is good. Most people first meet evals as a development-time tool: you change a prompt, rerun the eval, see if the score went up. But the same machinery works at runtime — the agent's "am I done yet?" signal can be the eval itself. That's what this skill operationalizes.
Two patterns do most of the work:
-
Routed evaluation. Before doing the work, classify how it will be checked. Some criteria are deterministic (does the function return the right value? does the tweet contain a digit?) and can be verified by code. Others are subjective (is the tone warm? is the docstring clear?) and need an LLM judge. Many real goals are hybrid — the skill runs both and gates on both. Deterministic checks are free, fast, and don't drift; reserve the more expensive judge for the parts code can't capture.
-
Independent judging. When an LLM judge is needed, it runs in a separate subagent with fresh context. A model grading its own work in the same context is consistently lenient — the same blind spot that caused a flaw will hide it. A fresh subagent that sees only the artifact and the rubric (not the goal, not the reasoning, not the prior attempts) is the cheapest way to get honest scores.
- Classify the verification approach: deterministic, subjective, or hybrid.
- Write the verifier first — and verify the verifier. For each deterministic assertion, write a targeted counterexample and confirm that assertion fires (a silently-passing assertion is worse than no verifier). For each subjective rubric, calibrate by judging one known-bad and one known-good example, requiring coordinated divergence, before trusting it on real work.
- Iterate up to 3 attempts. Each attempt: do the work, run the verifier(s), record what passed/failed. If the same check fails on attempts 1 and 2, escalate to the user with options (keep trying, relax the verifier, abort) instead of silently burning attempt 3.
- Return the verified output, or — if the cap was hit — the best attempt with an explicit "did not verify: " caveat. Best is
(deterministic passed DESC, judge score DESC, most_recent); the output is a per-attempt summary so the comparison is visible at a glance.
Key invariants:
- The verifier is locked for the model once written — it can't quietly fix it mid-loop to paper over a failure. The user can authorize a relaxation via the escalation question, but never silently.
- LLM judges always run in a separate subagent. No inline self-judging.
- Cap at 3 iterations. Past that, the goal probably needs human input.
verified-task/
├── README.md # this file — design rationale + how to use
├── CLAUDE.md # invariants and live-edit notes for Claude Code
├── skill/
│ └── SKILL.md # the skill itself (the product — what gets installed)
├── plan/ # per-improvement design docs (#0–#6)
├── evals/ # eval harness — prompts, judge, runner, labeled corpus
│ └── README.md # how the eval harness works
└── workspace/ # per-run skill workspaces, gitignored
skill/SKILL.md is the entire product. Everything else (plan/, evals/, workspace/) is the methodology and test bed behind it.
The skill works end-to-end. An eval harness (evals/) measures it against 10 prompts × 3 runs and validates a Sonnet 4.6 judge against hand labels; see evals/README.md. The seven planned improvements from the original roadmap have all landed — per-item design docs are preserved under plan/, and individual commits carry their before/after notes. Future work happens in new plan items rather than this section.
MIT — see LICENSE.