Skip to content

LLM360/terminal-task-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Terminal Task Dataset

Curated terminal environment tasks in Harbor format for RL training of coding agents.

Dataset Summary

Metric Value
Total tasks (v0) 8,876
Accepted tasks (v5) 5,659
Sources Endless Terminals, Nemotron Terminal, SETA
Format Harbor (task.toml + instruction.md + Dockerfile + test.sh)

Source Breakdown (accepted)

Source Tasks Description
Endless Terminals 2,340 Diverse terminal tasks, already Harbor format
Nemotron Terminal 2,991 NVIDIA synthetic skill-based tasks
SETA 328 Converted from YAML-based custom format

Task Format

Each task directory contains:

task_name/
  task.toml              # Config: version, metadata, verifier, agent, environment
  instruction.md         # Natural language task description
  environment/
    Dockerfile           # Container definition
  tests/
    test.sh              # Verifier — writes reward to /logs/verifier/reward.txt
  solution/              # Optional
    solve.sh             # Reference solution

The verifier (tests/test.sh) outputs a reward file:

  • /logs/verifier/reward.txt — plain text (1 for pass, 0 for fail)
  • /logs/verifier/reward.json — JSON with numeric metrics

Version History

Tag Description
v0_unified Initial unification of all 3 sources (8,876 tasks)
v1_fixed Round 1 LLM fix agent on 2,579 FIXABLE tasks (76.8% applied)
v2_retriaged Round 1 retriage — 5,159 ACCEPTED
v2.5_fresh_triage Fresh triage for 402 stale tasks
v3_fixed_r2 Round 2 fix agent on 1,060 FIXABLE tasks (72.1% applied)
v4_retriaged_r2 Round 2 retriage — 5,659 ACCEPTED
v5_accepted_flat Latest — flat structure, accepted tasks only, no reports

Tags v0 through v4 preserve the full pipeline history with triage/fix reports. Tag v5_accepted_flat (HEAD) is the clean release for downstream use.

Pipeline

  1. Unify — Convert all sources into Harbor format
  2. Triage — LLM-powered semantic review (MiniMax-M2.5, 229B MoE)
  3. Fix — LLM-powered targeted edits for FIXABLE tasks
  4. Retriage — Re-evaluate fixed tasks
  5. Release — Filter to ACCEPTED, flatten structure

Triage Methodology

Each task was reviewed by a deterministic structural checker followed by an LLM semantic reviewer. The LLM evaluates:

  • Hidden constraints not communicated in instructions
  • Instruction-verifier misalignment
  • Weak verifier bypass vectors
  • Data leakage or answer exposure
  • Impossible or ambiguous success criteria

Variance analysis (3-way independent triage) showed ~62% pairwise agreement, with aggregate distributions stable within ±2%.

License

See individual task sources for licensing terms.

About

Curated terminal environment tasks in Harbor format for RL training of coding agents (~8,876 tasks, 5,659 accepted after LLM triage)

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors