Terminal Task Dataset

Curated terminal environment tasks in Harbor format for RL training of coding agents.

Dataset Summary

Metric	Value
Total tasks (v0)	8,876
Accepted tasks (v5)	5,659
Sources	Endless Terminals, Nemotron Terminal, SETA
Format	Harbor (task.toml + instruction.md + Dockerfile + test.sh)

Source Breakdown (accepted)

Source	Tasks	Description
Endless Terminals	2,340	Diverse terminal tasks, already Harbor format
Nemotron Terminal	2,991	NVIDIA synthetic skill-based tasks
SETA	328	Converted from YAML-based custom format

Task Format

Each task directory contains:

task_name/
  task.toml              # Config: version, metadata, verifier, agent, environment
  instruction.md         # Natural language task description
  environment/
    Dockerfile           # Container definition
  tests/
    test.sh              # Verifier — writes reward to /logs/verifier/reward.txt
  solution/              # Optional
    solve.sh             # Reference solution

The verifier (tests/test.sh) outputs a reward file:

/logs/verifier/reward.txt — plain text (1 for pass, 0 for fail)
/logs/verifier/reward.json — JSON with numeric metrics

Version History

Tag	Description
`v0_unified`	Initial unification of all 3 sources (8,876 tasks)
`v1_fixed`	Round 1 LLM fix agent on 2,579 FIXABLE tasks (76.8% applied)
`v2_retriaged`	Round 1 retriage — 5,159 ACCEPTED
`v2.5_fresh_triage`	Fresh triage for 402 stale tasks
`v3_fixed_r2`	Round 2 fix agent on 1,060 FIXABLE tasks (72.1% applied)
`v4_retriaged_r2`	Round 2 retriage — 5,659 ACCEPTED
`v5_accepted_flat`	Latest — flat structure, accepted tasks only, no reports

Tags v0 through v4 preserve the full pipeline history with triage/fix reports. Tag v5_accepted_flat (HEAD) is the clean release for downstream use.

Pipeline

Unify — Convert all sources into Harbor format
Triage — LLM-powered semantic review (MiniMax-M2.5, 229B MoE)
Fix — LLM-powered targeted edits for FIXABLE tasks
Retriage — Re-evaluate fixed tasks
Release — Filter to ACCEPTED, flatten structure

Triage Methodology

Each task was reviewed by a deterministic structural checker followed by an LLM semantic reviewer. The LLM evaluates:

Hidden constraints not communicated in instructions
Instruction-verifier misalignment
Weak verifier bypass vectors
Data leakage or answer exposure
Impossible or ambiguous success criteria

Variance analysis (3-way independent triage) showed ~62% pairwise agreement, with aggregate distributions stable within ±2%.

License

See individual task sources for licensing terms.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
terminal-tasks		terminal-tasks
.gitignore		.gitignore
README.md		README.md
metadata.json		metadata.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Terminal Task Dataset

Dataset Summary

Source Breakdown (accepted)

Task Format

Version History

Pipeline

Triage Methodology

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Terminal Task Dataset

Dataset Summary

Source Breakdown (accepted)

Task Format

Version History

Pipeline

Triage Methodology

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages