Curated terminal environment tasks in Harbor format for RL training of coding agents.
| Metric | Value |
|---|---|
| Total tasks (v0) | 8,876 |
| Accepted tasks (v5) | 5,659 |
| Sources | Endless Terminals, Nemotron Terminal, SETA |
| Format | Harbor (task.toml + instruction.md + Dockerfile + test.sh) |
| Source | Tasks | Description |
|---|---|---|
| Endless Terminals | 2,340 | Diverse terminal tasks, already Harbor format |
| Nemotron Terminal | 2,991 | NVIDIA synthetic skill-based tasks |
| SETA | 328 | Converted from YAML-based custom format |
Each task directory contains:
task_name/
task.toml # Config: version, metadata, verifier, agent, environment
instruction.md # Natural language task description
environment/
Dockerfile # Container definition
tests/
test.sh # Verifier — writes reward to /logs/verifier/reward.txt
solution/ # Optional
solve.sh # Reference solution
The verifier (tests/test.sh) outputs a reward file:
/logs/verifier/reward.txt— plain text (1for pass,0for fail)/logs/verifier/reward.json— JSON with numeric metrics
| Tag | Description |
|---|---|
v0_unified |
Initial unification of all 3 sources (8,876 tasks) |
v1_fixed |
Round 1 LLM fix agent on 2,579 FIXABLE tasks (76.8% applied) |
v2_retriaged |
Round 1 retriage — 5,159 ACCEPTED |
v2.5_fresh_triage |
Fresh triage for 402 stale tasks |
v3_fixed_r2 |
Round 2 fix agent on 1,060 FIXABLE tasks (72.1% applied) |
v4_retriaged_r2 |
Round 2 retriage — 5,659 ACCEPTED |
v5_accepted_flat |
Latest — flat structure, accepted tasks only, no reports |
Tags v0 through v4 preserve the full pipeline history with triage/fix reports.
Tag v5_accepted_flat (HEAD) is the clean release for downstream use.
- Unify — Convert all sources into Harbor format
- Triage — LLM-powered semantic review (MiniMax-M2.5, 229B MoE)
- Fix — LLM-powered targeted edits for FIXABLE tasks
- Retriage — Re-evaluate fixed tasks
- Release — Filter to ACCEPTED, flatten structure
Each task was reviewed by a deterministic structural checker followed by an LLM semantic reviewer. The LLM evaluates:
- Hidden constraints not communicated in instructions
- Instruction-verifier misalignment
- Weak verifier bypass vectors
- Data leakage or answer exposure
- Impossible or ambiguous success criteria
Variance analysis (3-way independent triage) showed ~62% pairwise agreement, with aggregate distributions stable within ±2%.
See individual task sources for licensing terms.