verified-task

A Claude Code skill that does a task and verifies its own output: classify the goal, write a checker before writing the answer, then iterate until the checker passes — or hit a hard cap and say so honestly.

Use it when you care whether the output actually meets the requirements — not just that it looks plausible. Good fits: code with behavior requirements, text with length or content rules, structured outputs (JSON, CSV), anything where "did this work" has a definable answer. Bad fit: open-ended creative tasks with no checkable criterion (the skill will tell you so and proceed with a clear caveat).

Install

The skill is a single markdown file. Drop it into Claude Code's skills directory:

mkdir -p ~/.claude/skills/verified-task
curl -fsSL https://raw.githubusercontent.com/klittle32/verified-task/main/skill/SKILL.md \
  -o ~/.claude/skills/verified-task/SKILL.md

New Claude Code sessions will pick it up. Existing sessions cache the skill list at startup and need a restart.

How to use it

In any Claude Code session, invoke /verified-task <prompt>. The skill states its classification, shows its verifier before doing any work, and reports scores at the end.

Examples (each exercises a different path through the loop):

/verified-task write a Python function that returns the median of a list, with a docstring under 30 words explaining the algorithm hybrid — behavior tests + word count + docstring clarity judge
/verified-task draft a 3-sentence email declining a Friday meeting, polite but not apologetic hybrid — sentence count + no-apology check + tone judge
/verified-task write a tweet under 280 chars announcing a fictional product called "Cloudlift", including a specific number, no marketing buzzwords hybrid — length + content rules + voice judge
/verified-task output a JSON array of the first 10 prime numbers deterministic only — parse and compare; no judge spawned
/verified-task write a haiku about a quiet morning subjective only — judge with 5-7-5 + atmosphere rubric, calibrated before use

A run ends with one of:

Verified: 3/3 behavior tests pass, docstring 27 words, clarity 4/5.

…meaning every check fired and passed. Or:

Did not verify (3 attempts). Returning attempt 2.

Attempt 1: 1/3 deterministic pass, judge 3/5 [missing: word_count, no_buzzwords]
Attempt 2: 2/3 deterministic pass, judge 4/5 [missing: word_count]
Attempt 3: 2/3 deterministic pass, judge 2/5 [missing: word_count]

…meaning the cap was hit; the best attempt (ranked by deterministic count, then judge score, then most-recent) is returned with an honest caveat.

When two consecutive attempts fail the same check, the skill pauses before burning attempt 3 and asks whether to keep trying, relax the verifier (with a proposed specific change), or abort.

Where it fits

The clearest tell that a task fits: you catch yourself thinking "I'll just eyeball it when it's done." That eyeballing is the verifier — write it down and let the skill enforce it.

A non-exhaustive map of common shapes:

Code & data (deterministic-heavy — usually no judge)

Regex authoring — "match IPv4 but reject 256.x.x.x and leading zeros; here are 10 must-match and 10 must-reject examples"
SQL queries — verify against an expected result set
Pure functions with gnarly edge cases — date math, currency parsing, URL canonicalization
Schema-conformant fixtures — "generate 50 mock user records matching this JSON schema, realistic email/phone formats, ≥ 5 unique countries"
Format conversion — XML→JSON, CSV→Markdown, OpenAPI→TypeScript types (round-trip equality is the verifier)
Migration scripts — verifier asserts idempotency (run twice, second is a no-op)
Bug fixes from a failing test — paste the test, iterate until it passes without breaking the rest
Polyfill / port — verifier compares outputs across both implementations on a shared input set

Writing with hard constraints (hybrid sweet spot)

Commit messages — imperative mood, under 72 chars, references the right ticket
PR descriptions — must contain Summary + Test plan sections + design-doc link
Tweet / LinkedIn post — under N chars, hook in the first line, ends with a question
Conference talk abstract — exactly 150 words; names audience, takeaway, one concrete example
Job description — required sections present; no buzzwords from a banned list; judged as inclusive
Resume bullets — STAR format, action-verb led, contains a quantified outcome
README section — has Install/Usage/License headers + reads clearly to someone new

Communications & tone (judge-heavy)

Customer support reply — acknowledges issue, proposes a concrete next step, tone judged "warm, not corporate"
Slack apology — accountable without grovelling, ≤ 3 sentences
Cold outreach email — single CTA, under 90 words, mentions one specific thing about the recipient
Status update to your manager — progress / blockers / next-steps under 100 words, no hedging
Performance-review self-assessment — specific, quantified, no passive voice
Feedback to a teammate — situation/behavior/impact structure, tone "candid but kind"

Transformations & extractions

Reading-level rewrite — Flesch-Kincaid score as the verifier
Translate preserving glossary — verifier checks specific terms appear untranslated
Legalese → plain English — judge on clarity, plus a check that every defined term still appears
Field extraction from unstructured text — verifier re-extracts from the output JSON and compares to a known answer set
Dedupe / categorize a list — every input gets exactly one label from a fixed set
Prose → bulleted list of exactly N items, each action-verb led

Creative with structure

Haiku / limerick / sonnet — syllable counts and rhyme are deterministic; "is it good" is judged
Naming — N options under M letters, pronounceable, thematically related to X
Headline variants — each under 60 chars, each containing a verb, scored on punchiness

Personal / workflow

Daily standup — yesterday / today / blockers under 60 words, no vague verbs like "worked on"
Calendar event titles — attendees + topic + desired outcome
TODO list rewrite — each item action-verb-led, scoped to one session, has an acceptance criterion
Goal decomposition — subtasks ≤ 2 hours each, each with an explicit "done when…" line
Meeting notes — Decisions, Action items (with owners), Open questions sections

Compliance-ish

Error messages — specific, actionable, non-blaming (judged); no leaked paths or stack frames (deterministic)
Privacy notice — required clauses present (regex) + readable (judged)
Accessibility-pass HTML — every <img> has alt, inputs have labels, contrast ratios computed

Where it's not worth it

Pure exploration — "what do you think of this design?" There's nothing to verify.
Single-shot trivia / lookups — the verifier is more work than the answer.
Multi-day, multi-file refactors — the 3-attempt cap and single verifier aren't the right shape; reach for a plan + iterative review.
Anything where you'd struggle to state "done" in one paragraph — that's a signal you don't yet know what done looks like. The skill will surface this as "can't write a verifier" rather than guess.

How this compares to `/goal` and the Outcomes API

Two adjacent things exist, and it's worth knowing where this skill differs.

Claude Code's built-in /goal sets a completion condition and lets the model keep iterating until it judges the goal satisfied. The check is implicit and self-judged — fine for fluid, exploratory work, less reliable when "done" has a concrete shape.

Anthropic's Managed Agents Outcomes (API feature) is the closest neighbor: rubric defined before work, grader in a fresh context, capped iteration. If you're building on the API, use Outcomes — that's its home.

This skill is the Claude Code-native version of the same idea, with four opinionated additions Outcomes doesn't have:

	`/goal`	Outcomes	verified-task
Verifier written before work	no	yes	yes
Grader in fresh context	no (self-judged)	yes	yes (Agent subagent)
Iteration cap	open-ended	3 default	3 hard
Deterministic / subjective / hybrid routing	—	—	yes
Per-assertion sanity check via counterexamples	—	—	yes
Escalation pause on repeated identical failure	—	—	yes
Explicit cap-hit best-attempt output format	—	—	yes
Runs as a Claude Code skill (no API plumbing)	yes	no	yes

The four bold rows are the actual contribution. Each came from an observed failure mode in earlier versions; the per-improvement design docs in plan/ (#1–#6) record the reasoning, and the evals/ harness exercises them against a labeled corpus.

Why this exists

In normal LLM use, you ask for something and get an answer. Verification — whether the answer actually meets the requirements — is something the human does (or doesn't) after the fact. This skill flips the order: it makes verification the spine of the work, not an afterthought.

The conceptual root is evals. An eval is a structured way to measure whether an AI output is good. Most people first meet evals as a development-time tool: you change a prompt, rerun the eval, see if the score went up. But the same machinery works at runtime — the agent's "am I done yet?" signal can be the eval itself. That's what this skill operationalizes.

Two patterns do most of the work:

Routed evaluation. Before doing the work, classify how it will be checked. Some criteria are deterministic (does the function return the right value? does the tweet contain a digit?) and can be verified by code. Others are subjective (is the tone warm? is the docstring clear?) and need an LLM judge. Many real goals are hybrid — the skill runs both and gates on both. Deterministic checks are free, fast, and don't drift; reserve the more expensive judge for the parts code can't capture.
Independent judging. When an LLM judge is needed, it runs in a separate subagent with fresh context. A model grading its own work in the same context is consistently lenient — the same blind spot that caused a flaw will hide it. A fresh subagent that sees only the artifact and the rubric (not the goal, not the reasoning, not the prior attempts) is the cheapest way to get honest scores.

The loop

Classify the verification approach: deterministic, subjective, or hybrid.
Write the verifier first — and verify the verifier. For each deterministic assertion, write a targeted counterexample and confirm that assertion fires (a silently-passing assertion is worse than no verifier). For each subjective rubric, calibrate by judging one known-bad and one known-good example, requiring coordinated divergence, before trusting it on real work.
Iterate up to 3 attempts. Each attempt: do the work, run the verifier(s), record what passed/failed. If the same check fails on attempts 1 and 2, escalate to the user with options (keep trying, relax the verifier, abort) instead of silently burning attempt 3.
Return the verified output, or — if the cap was hit — the best attempt with an explicit "did not verify: " caveat. Best is (deterministic passed DESC, judge score DESC, most_recent); the output is a per-attempt summary so the comparison is visible at a glance.

Key invariants:

The verifier is locked for the model once written — it can't quietly fix it mid-loop to paper over a failure. The user can authorize a relaxation via the escalation question, but never silently.
LLM judges always run in a separate subagent. No inline self-judging.
Cap at 3 iterations. Past that, the goal probably needs human input.

Project layout

verified-task/
├── README.md           # this file — design rationale + how to use
├── CLAUDE.md           # invariants and live-edit notes for Claude Code
├── skill/
│   └── SKILL.md        # the skill itself (the product — what gets installed)
├── plan/               # per-improvement design docs (#0–#6)
├── evals/              # eval harness — prompts, judge, runner, labeled corpus
│   └── README.md       # how the eval harness works
└── workspace/          # per-run skill workspaces, gitignored

skill/SKILL.md is the entire product. Everything else (plan/, evals/, workspace/) is the methodology and test bed behind it.

Status

The skill works end-to-end. An eval harness (evals/) measures it against 10 prompts × 3 runs and validates a Sonnet 4.6 judge against hand labels; see evals/README.md. The seven planned improvements from the original roadmap have all landed — per-item design docs are preserved under plan/, and individual commits carry their before/after notes. Future work happens in new plan items rather than this section.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

verified-task

Install

How to use it

Where it fits

Code & data (deterministic-heavy — usually no judge)

Writing with hard constraints (hybrid sweet spot)

Communications & tone (judge-heavy)

Transformations & extractions

Creative with structure

Personal / workflow

Compliance-ish

Where it's not worth it

How this compares to `/goal` and the Outcomes API

Why this exists

The loop

Project layout

Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
evals		evals
plan		plan
skill		skill
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

verified-task

Install

How to use it

Where it fits

Code & data (deterministic-heavy — usually no judge)

Writing with hard constraints (hybrid sweet spot)

Communications & tone (judge-heavy)

Transformations & extractions

Creative with structure

Personal / workflow

Compliance-ish

Where it's not worth it

How this compares to /goal and the Outcomes API

Why this exists

The loop

Project layout

Status

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

How this compares to `/goal` and the Outcomes API

Packages