Company-neutral tooling, prompts, examples, and agent instructions for authoring Harbor / Terminal-Bench style tasks with LLM assistance.
The repo is designed for people who want to create realistic terminal tasks, verify them with a golden solution, review instruction/test alignment, calibrate model pass rates, and package tasks without leaking private run artifacts.
Terminal task creation is easiest to get wrong in quiet ways: the instruction drifts from the tests, the tests hardcode expected answers, the Docker image leaks verifier files, or the task is classified as medium/hard before anyone has measured model pass rates. This workflow makes those steps explicit and repeatable.
It supports a practical authoring loop:
- Draft a natural user-style instruction.
- Create a Harbor task skeleton.
- Build production-like model-visible inputs under
environment/src/. - Write dynamic verifier tests that recompute expected behavior from the inputs.
- Keep a deterministic golden solution.
- Run oracle, local checks, and LLM alignment review.
- Calibrate difficulty with a target/test model and a stronger reference model.
- Package only the artifacts that should be published or submitted.
AGENTS.md: the operational guide for an LLM coding agent working in this repo.docs/: human-readable workflow, quality, calibration, packaging, and review guidance.tools/: small local helpers around Harbor task creation, validation, run orchestration, review, debug, and zipping.prompts/: reusable LLM review prompts.templates/: starting points for task state,task.toml,tests/test.sh, and pytest files.examples/: sanitized full task packages withgolden-solution-logoracle artifacts only, plus task history notes.
Two complete example packages are included:
examples/incident-drill-packet-builder-medium/: application-development task that fixes an incident drill packet generator across YAML, JSONL, TOML, and schema inputs.examples/pipeline-bottleneck-profiler-medium/: build-and-deployment task that analyzes CI run exports and produces a deterministic bottleneck report.
Each example includes the full task package and a small model-run-logs/golden-solution-log/ oracle artifact so readers can inspect the expected Harbor log shape. Target/test model logs and strong/reference model logs are intentionally not checked in.
Install Harbor and make sure Docker is available, then create a package skeleton:
python3 tools/new_task.py init-package \
--domain application-development \
--slug incident-drill-packet-builder \
--difficulty mediumImplement the task, then run local checks:
python3 tools/validate_task.py my_tasks/<package>/Tasks/<domain>/<task-dir>
python3 tools/package_validator.py my_tasks/<package>Run the golden solution through Harbor:
tools/harbor_stage.sh --package-root my_tasks/<package> --stage oracleCalibrate difficulty only after the task is correct. Use an oracle run for the golden solution, a target/test model for the model you are trying to evaluate or improve, and a stronger reference model as a solvability check:
tools/harbor_stage.sh --package-root my_tasks/<package> --stage model-run --model "$TARGET_MODEL" --k 10 --n 10 --r 3
tools/harbor_stage.sh --package-root my_tasks/<package> --stage model-run --model "$REFERENCE_MODEL" --k 10 --n 10 --r 3The pass-rate bands live in profiles/default.yaml; adjust them to match your benchmark objective.
For the full process, read docs/workflow.md.
docs/workflow.md: end-to-end task creation phases.docs/quality-checklist.md: anti-cheating and verifier quality checklist.docs/calibration.md: target/reference model pass-rate calibration.docs/llm-review.md: how to run and interpret the LLM reviewer.docs/packaging.md: final package validation and ZIP creation.docs/publishing-safety.md: checks to run before publishing examples.
- Create open Terminal-Bench task examples from scratch.
- Use LLM coding agents to draft, build, review, and iterate on tasks.
- Maintain task state across long calibration loops.
- Calibrate task difficulty against a target/test model and a strong/reference model.
- Package task artifacts while keeping private logs and credentials out of source control.
Do not commit real .env files, private model transcripts, private assignment material, generated ZIPs, or local jobs/ and checks/ directories. The checked-in examples keep only golden oracle logs, not target/reference model logs.