Terminal-Bench Task Workflow

Company-neutral tooling, prompts, examples, and agent instructions for authoring Harbor / Terminal-Bench style tasks with LLM assistance.

The repo is designed for people who want to create realistic terminal tasks, verify them with a golden solution, review instruction/test alignment, calibrate model pass rates, and package tasks without leaking private run artifacts.

Why This Exists

Terminal task creation is easiest to get wrong in quiet ways: the instruction drifts from the tests, the tests hardcode expected answers, the Docker image leaks verifier files, or the task is classified as medium/hard before anyone has measured model pass rates. This workflow makes those steps explicit and repeatable.

It supports a practical authoring loop:

Draft a natural user-style instruction.
Create a Harbor task skeleton.
Build production-like model-visible inputs under environment/src/.
Write dynamic verifier tests that recompute expected behavior from the inputs.
Keep a deterministic golden solution.
Run oracle, local checks, and LLM alignment review.
Calibrate difficulty with a target/test model and a stronger reference model.
Package only the artifacts that should be published or submitted.

What Is Included

AGENTS.md: the operational guide for an LLM coding agent working in this repo.
docs/: human-readable workflow, quality, calibration, packaging, and review guidance.
tools/: small local helpers around Harbor task creation, validation, run orchestration, review, debug, and zipping.
prompts/: reusable LLM review prompts.
templates/: starting points for task state, task.toml, tests/test.sh, and pytest files.
examples/: sanitized full task packages with golden-solution-log oracle artifacts only, plus task history notes.

Example Tasks

Two complete example packages are included:

examples/incident-drill-packet-builder-medium/: application-development task that fixes an incident drill packet generator across YAML, JSONL, TOML, and schema inputs.
examples/pipeline-bottleneck-profiler-medium/: build-and-deployment task that analyzes CI run exports and produces a deterministic bottleneck report.

Each example includes the full task package and a small model-run-logs/golden-solution-log/ oracle artifact so readers can inspect the expected Harbor log shape. Target/test model logs and strong/reference model logs are intentionally not checked in.

Quick Start

Install Harbor and make sure Docker is available, then create a package skeleton:

python3 tools/new_task.py init-package \
  --domain application-development \
  --slug incident-drill-packet-builder \
  --difficulty medium

Implement the task, then run local checks:

python3 tools/validate_task.py my_tasks/<package>/Tasks/<domain>/<task-dir>
python3 tools/package_validator.py my_tasks/<package>

Run the golden solution through Harbor:

tools/harbor_stage.sh --package-root my_tasks/<package> --stage oracle

Calibrate difficulty only after the task is correct. Use an oracle run for the golden solution, a target/test model for the model you are trying to evaluate or improve, and a stronger reference model as a solvability check:

tools/harbor_stage.sh --package-root my_tasks/<package> --stage model-run --model "$TARGET_MODEL" --k 10 --n 10 --r 3
tools/harbor_stage.sh --package-root my_tasks/<package> --stage model-run --model "$REFERENCE_MODEL" --k 10 --n 10 --r 3

The pass-rate bands live in profiles/default.yaml; adjust them to match your benchmark objective.

For the full process, read docs/workflow.md.

Core Docs

docs/workflow.md: end-to-end task creation phases.
docs/quality-checklist.md: anti-cheating and verifier quality checklist.
docs/calibration.md: target/reference model pass-rate calibration.
docs/llm-review.md: how to run and interpret the LLM reviewer.
docs/packaging.md: final package validation and ZIP creation.
docs/publishing-safety.md: checks to run before publishing examples.

Use Cases

Create open Terminal-Bench task examples from scratch.
Use LLM coding agents to draft, build, review, and iterate on tasks.
Maintain task state across long calibration loops.
Calibrate task difficulty against a target/test model and a strong/reference model.
Package task artifacts while keeping private logs and credentials out of source control.

Safety

Do not commit real .env files, private model transcripts, private assignment material, generated ZIPs, or local jobs/ and checks/ directories. The checked-in examples keep only golden oracle logs, not target/reference model logs.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
examples		examples
profiles		profiles
prompts		prompts
templates		templates
tools		tools
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Terminal-Bench Task Workflow

Why This Exists

What Is Included

Example Tasks

Quick Start

Core Docs

Use Cases

Safety

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Terminal-Bench Task Workflow

Why This Exists

What Is Included

Example Tasks

Quick Start

Core Docs

Use Cases

Safety

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages