Skip to content

klittle32/verified-task

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

verified-task

A Claude Code skill that does a task and verifies its own output: classify the goal, write a checker before writing the answer, then iterate until the checker passes — or hit a hard cap and say so honestly.

Use it when you care whether the output actually meets the requirements — not just that it looks plausible. Good fits: code with behavior requirements, text with length or content rules, structured outputs (JSON, CSV), anything where "did this work" has a definable answer. Bad fit: open-ended creative tasks with no checkable criterion (the skill will tell you so and proceed with a clear caveat).

Install

The skill is a single markdown file. Drop it into Claude Code's skills directory:

mkdir -p ~/.claude/skills/verified-task
curl -fsSL https://raw.githubusercontent.com/klittle32/verified-task/main/skill/SKILL.md \
  -o ~/.claude/skills/verified-task/SKILL.md

New Claude Code sessions will pick it up. Existing sessions cache the skill list at startup and need a restart.

How to use it

In any Claude Code session, invoke /verified-task <prompt>. The skill states its classification, shows its verifier before doing any work, and reports scores at the end.

Examples (each exercises a different path through the loop):

  • /verified-task write a Python function that returns the median of a list, with a docstring under 30 words explaining the algorithm hybrid — behavior tests + word count + docstring clarity judge

  • /verified-task draft a 3-sentence email declining a Friday meeting, polite but not apologetic hybrid — sentence count + no-apology check + tone judge

  • /verified-task write a tweet under 280 chars announcing a fictional product called "Cloudlift", including a specific number, no marketing buzzwords hybrid — length + content rules + voice judge

  • /verified-task output a JSON array of the first 10 prime numbers deterministic only — parse and compare; no judge spawned

  • /verified-task write a haiku about a quiet morning subjective only — judge with 5-7-5 + atmosphere rubric, calibrated before use

A run ends with one of:

Verified: 3/3 behavior tests pass, docstring 27 words, clarity 4/5.

…meaning every check fired and passed. Or:

Did not verify (3 attempts). Returning attempt 2.

Attempt 1: 1/3 deterministic pass, judge 3/5 [missing: word_count, no_buzzwords]
Attempt 2: 2/3 deterministic pass, judge 4/5 [missing: word_count]
Attempt 3: 2/3 deterministic pass, judge 2/5 [missing: word_count]

…meaning the cap was hit; the best attempt (ranked by deterministic count, then judge score, then most-recent) is returned with an honest caveat.

When two consecutive attempts fail the same check, the skill pauses before burning attempt 3 and asks whether to keep trying, relax the verifier (with a proposed specific change), or abort.

Where it fits

The clearest tell that a task fits: you catch yourself thinking "I'll just eyeball it when it's done." That eyeballing is the verifier — write it down and let the skill enforce it.

A non-exhaustive map of common shapes:

Code & data (deterministic-heavy — usually no judge)

  • Regex authoring"match IPv4 but reject 256.x.x.x and leading zeros; here are 10 must-match and 10 must-reject examples"
  • SQL queries — verify against an expected result set
  • Pure functions with gnarly edge cases — date math, currency parsing, URL canonicalization
  • Schema-conformant fixtures"generate 50 mock user records matching this JSON schema, realistic email/phone formats, ≥ 5 unique countries"
  • Format conversion — XML→JSON, CSV→Markdown, OpenAPI→TypeScript types (round-trip equality is the verifier)
  • Migration scripts — verifier asserts idempotency (run twice, second is a no-op)
  • Bug fixes from a failing test — paste the test, iterate until it passes without breaking the rest
  • Polyfill / port — verifier compares outputs across both implementations on a shared input set

Writing with hard constraints (hybrid sweet spot)

  • Commit messages — imperative mood, under 72 chars, references the right ticket
  • PR descriptions — must contain Summary + Test plan sections + design-doc link
  • Tweet / LinkedIn post — under N chars, hook in the first line, ends with a question
  • Conference talk abstract — exactly 150 words; names audience, takeaway, one concrete example
  • Job description — required sections present; no buzzwords from a banned list; judged as inclusive
  • Resume bullets — STAR format, action-verb led, contains a quantified outcome
  • README section — has Install/Usage/License headers + reads clearly to someone new

Communications & tone (judge-heavy)

  • Customer support reply — acknowledges issue, proposes a concrete next step, tone judged "warm, not corporate"
  • Slack apology — accountable without grovelling, ≤ 3 sentences
  • Cold outreach email — single CTA, under 90 words, mentions one specific thing about the recipient
  • Status update to your manager — progress / blockers / next-steps under 100 words, no hedging
  • Performance-review self-assessment — specific, quantified, no passive voice
  • Feedback to a teammate — situation/behavior/impact structure, tone "candid but kind"

Transformations & extractions

  • Reading-level rewrite — Flesch-Kincaid score as the verifier
  • Translate preserving glossary — verifier checks specific terms appear untranslated
  • Legalese → plain English — judge on clarity, plus a check that every defined term still appears
  • Field extraction from unstructured text — verifier re-extracts from the output JSON and compares to a known answer set
  • Dedupe / categorize a list — every input gets exactly one label from a fixed set
  • Prose → bulleted list of exactly N items, each action-verb led

Creative with structure

  • Haiku / limerick / sonnet — syllable counts and rhyme are deterministic; "is it good" is judged
  • Naming — N options under M letters, pronounceable, thematically related to X
  • Headline variants — each under 60 chars, each containing a verb, scored on punchiness

Personal / workflow

  • Daily standup — yesterday / today / blockers under 60 words, no vague verbs like "worked on"
  • Calendar event titles — attendees + topic + desired outcome
  • TODO list rewrite — each item action-verb-led, scoped to one session, has an acceptance criterion
  • Goal decomposition — subtasks ≤ 2 hours each, each with an explicit "done when…" line
  • Meeting notes — Decisions, Action items (with owners), Open questions sections

Compliance-ish

  • Error messages — specific, actionable, non-blaming (judged); no leaked paths or stack frames (deterministic)
  • Privacy notice — required clauses present (regex) + readable (judged)
  • Accessibility-pass HTML — every <img> has alt, inputs have labels, contrast ratios computed

Where it's not worth it

  • Pure exploration — "what do you think of this design?" There's nothing to verify.
  • Single-shot trivia / lookups — the verifier is more work than the answer.
  • Multi-day, multi-file refactors — the 3-attempt cap and single verifier aren't the right shape; reach for a plan + iterative review.
  • Anything where you'd struggle to state "done" in one paragraph — that's a signal you don't yet know what done looks like. The skill will surface this as "can't write a verifier" rather than guess.

How this compares to /goal and the Outcomes API

Two adjacent things exist, and it's worth knowing where this skill differs.

Claude Code's built-in /goal sets a completion condition and lets the model keep iterating until it judges the goal satisfied. The check is implicit and self-judged — fine for fluid, exploratory work, less reliable when "done" has a concrete shape.

Anthropic's Managed Agents Outcomes (API feature) is the closest neighbor: rubric defined before work, grader in a fresh context, capped iteration. If you're building on the API, use Outcomes — that's its home.

This skill is the Claude Code-native version of the same idea, with four opinionated additions Outcomes doesn't have:

/goal Outcomes verified-task
Verifier written before work no yes yes
Grader in fresh context no (self-judged) yes yes (Agent subagent)
Iteration cap open-ended 3 default 3 hard
Deterministic / subjective / hybrid routing yes
Per-assertion sanity check via counterexamples yes
Escalation pause on repeated identical failure yes
Explicit cap-hit best-attempt output format yes
Runs as a Claude Code skill (no API plumbing) yes no yes

The four bold rows are the actual contribution. Each came from an observed failure mode in earlier versions; the per-improvement design docs in plan/ (#1–#6) record the reasoning, and the evals/ harness exercises them against a labeled corpus.

Why this exists

In normal LLM use, you ask for something and get an answer. Verification — whether the answer actually meets the requirements — is something the human does (or doesn't) after the fact. This skill flips the order: it makes verification the spine of the work, not an afterthought.

The conceptual root is evals. An eval is a structured way to measure whether an AI output is good. Most people first meet evals as a development-time tool: you change a prompt, rerun the eval, see if the score went up. But the same machinery works at runtime — the agent's "am I done yet?" signal can be the eval itself. That's what this skill operationalizes.

Two patterns do most of the work:

  1. Routed evaluation. Before doing the work, classify how it will be checked. Some criteria are deterministic (does the function return the right value? does the tweet contain a digit?) and can be verified by code. Others are subjective (is the tone warm? is the docstring clear?) and need an LLM judge. Many real goals are hybrid — the skill runs both and gates on both. Deterministic checks are free, fast, and don't drift; reserve the more expensive judge for the parts code can't capture.

  2. Independent judging. When an LLM judge is needed, it runs in a separate subagent with fresh context. A model grading its own work in the same context is consistently lenient — the same blind spot that caused a flaw will hide it. A fresh subagent that sees only the artifact and the rubric (not the goal, not the reasoning, not the prior attempts) is the cheapest way to get honest scores.

The loop

  1. Classify the verification approach: deterministic, subjective, or hybrid.
  2. Write the verifier first — and verify the verifier. For each deterministic assertion, write a targeted counterexample and confirm that assertion fires (a silently-passing assertion is worse than no verifier). For each subjective rubric, calibrate by judging one known-bad and one known-good example, requiring coordinated divergence, before trusting it on real work.
  3. Iterate up to 3 attempts. Each attempt: do the work, run the verifier(s), record what passed/failed. If the same check fails on attempts 1 and 2, escalate to the user with options (keep trying, relax the verifier, abort) instead of silently burning attempt 3.
  4. Return the verified output, or — if the cap was hit — the best attempt with an explicit "did not verify: " caveat. Best is (deterministic passed DESC, judge score DESC, most_recent); the output is a per-attempt summary so the comparison is visible at a glance.

Key invariants:

  • The verifier is locked for the model once written — it can't quietly fix it mid-loop to paper over a failure. The user can authorize a relaxation via the escalation question, but never silently.
  • LLM judges always run in a separate subagent. No inline self-judging.
  • Cap at 3 iterations. Past that, the goal probably needs human input.

Project layout

verified-task/
├── README.md           # this file — design rationale + how to use
├── CLAUDE.md           # invariants and live-edit notes for Claude Code
├── skill/
│   └── SKILL.md        # the skill itself (the product — what gets installed)
├── plan/               # per-improvement design docs (#0–#6)
├── evals/              # eval harness — prompts, judge, runner, labeled corpus
│   └── README.md       # how the eval harness works
└── workspace/          # per-run skill workspaces, gitignored

skill/SKILL.md is the entire product. Everything else (plan/, evals/, workspace/) is the methodology and test bed behind it.

Status

The skill works end-to-end. An eval harness (evals/) measures it against 10 prompts × 3 runs and validates a Sonnet 4.6 judge against hand labels; see evals/README.md. The seven planned improvements from the original roadmap have all landed — per-item design docs are preserved under plan/, and individual commits carry their before/after notes. Future work happens in new plan items rather than this section.

License

MIT — see LICENSE.

About

A Claude Code skill that does a task and verifies its own output.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages