Skip to content

lemegetonV/terminal-bench-task-workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Terminal-Bench Task Workflow

Company-neutral tooling, prompts, examples, and agent instructions for authoring Harbor / Terminal-Bench style tasks with LLM assistance.

The repo is designed for people who want to create realistic terminal tasks, verify them with a golden solution, review instruction/test alignment, calibrate model pass rates, and package tasks without leaking private run artifacts.

Why This Exists

Terminal task creation is easiest to get wrong in quiet ways: the instruction drifts from the tests, the tests hardcode expected answers, the Docker image leaks verifier files, or the task is classified as medium/hard before anyone has measured model pass rates. This workflow makes those steps explicit and repeatable.

It supports a practical authoring loop:

  1. Draft a natural user-style instruction.
  2. Create a Harbor task skeleton.
  3. Build production-like model-visible inputs under environment/src/.
  4. Write dynamic verifier tests that recompute expected behavior from the inputs.
  5. Keep a deterministic golden solution.
  6. Run oracle, local checks, and LLM alignment review.
  7. Calibrate difficulty with a target/test model and a stronger reference model.
  8. Package only the artifacts that should be published or submitted.

What Is Included

  • AGENTS.md: the operational guide for an LLM coding agent working in this repo.
  • docs/: human-readable workflow, quality, calibration, packaging, and review guidance.
  • tools/: small local helpers around Harbor task creation, validation, run orchestration, review, debug, and zipping.
  • prompts/: reusable LLM review prompts.
  • templates/: starting points for task state, task.toml, tests/test.sh, and pytest files.
  • examples/: sanitized full task packages with golden-solution-log oracle artifacts only, plus task history notes.

Example Tasks

Two complete example packages are included:

  • examples/incident-drill-packet-builder-medium/: application-development task that fixes an incident drill packet generator across YAML, JSONL, TOML, and schema inputs.
  • examples/pipeline-bottleneck-profiler-medium/: build-and-deployment task that analyzes CI run exports and produces a deterministic bottleneck report.

Each example includes the full task package and a small model-run-logs/golden-solution-log/ oracle artifact so readers can inspect the expected Harbor log shape. Target/test model logs and strong/reference model logs are intentionally not checked in.

Quick Start

Install Harbor and make sure Docker is available, then create a package skeleton:

python3 tools/new_task.py init-package \
  --domain application-development \
  --slug incident-drill-packet-builder \
  --difficulty medium

Implement the task, then run local checks:

python3 tools/validate_task.py my_tasks/<package>/Tasks/<domain>/<task-dir>
python3 tools/package_validator.py my_tasks/<package>

Run the golden solution through Harbor:

tools/harbor_stage.sh --package-root my_tasks/<package> --stage oracle

Calibrate difficulty only after the task is correct. Use an oracle run for the golden solution, a target/test model for the model you are trying to evaluate or improve, and a stronger reference model as a solvability check:

tools/harbor_stage.sh --package-root my_tasks/<package> --stage model-run --model "$TARGET_MODEL" --k 10 --n 10 --r 3
tools/harbor_stage.sh --package-root my_tasks/<package> --stage model-run --model "$REFERENCE_MODEL" --k 10 --n 10 --r 3

The pass-rate bands live in profiles/default.yaml; adjust them to match your benchmark objective.

For the full process, read docs/workflow.md.

Core Docs

  • docs/workflow.md: end-to-end task creation phases.
  • docs/quality-checklist.md: anti-cheating and verifier quality checklist.
  • docs/calibration.md: target/reference model pass-rate calibration.
  • docs/llm-review.md: how to run and interpret the LLM reviewer.
  • docs/packaging.md: final package validation and ZIP creation.
  • docs/publishing-safety.md: checks to run before publishing examples.

Use Cases

  • Create open Terminal-Bench task examples from scratch.
  • Use LLM coding agents to draft, build, review, and iterate on tasks.
  • Maintain task state across long calibration loops.
  • Calibrate task difficulty against a target/test model and a strong/reference model.
  • Package task artifacts while keeping private logs and credentials out of source control.

Safety

Do not commit real .env files, private model transcripts, private assignment material, generated ZIPs, or local jobs/ and checks/ directories. The checked-in examples keep only golden oracle logs, not target/reference model logs.

About

Company-neutral workflow kit for creating, reviewing, calibrating, and packaging Harbor / Terminal-Bench tasks with LLM agents.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors