Citation

Simple terminal-using agents.

💻 Code · 🤗 Models & Data · 📜 Paper · 📓 Blog

Tmax is our project around training simple, powerful terminal using agents. This codebase covers data generation, training, and evaluation. We use this to train our 'tmax' series of models, with strong performance as shown below! Please refer to our paper for more details.

Below, we give a quick overview of the codebase and how to use it.

News

Initial release of the codebase and models! Please read our paper for more details.

What's here

The repo is organised around four stages of building a terminal agent:

Stage	Where	What it does
Data generation	`rl_data/`	A simple, scalable, diverse, difficulty-aware pipeline for synthesising terminal-agent tasks, solving them at pass@k, analysing the corpus, and publishing it to the Hugging Face Hub. Tasks are sampled as an independent product of structured axes and packaged as self-contained Apptainer/Docker environments with programmatic verifiers.
Agent	`Vanillux2Agent/`	The Harbor agent used for solving and evaluation: a direct LiteLLM agent built on the vanillux prompt harness (mini-SWE-agent-derived prompts, bash tool schema, submit marker, format-error recovery, and output truncation), executing commands through Harbor's active environment.
Training	`training/open-instruct/`	A fork of open-instruct with fixes for Qwen 3.5 and terminal-agent training. SFT and DPPO RL launch scripts for the tmax models live under `training/open-instruct/scripts/tmax/`.
Evaluation	`scripts/` + `beaker_configs/`	Shell/Slurm launchers and a Beaker pipeline that serves a model with vLLM and runs Harbor datasets (Terminal-Bench, TB-Lite, SWE-bench) against it.

Quickstart

Python is run via uv; all commands run from the repo root.

# Install dependencies
uv sync

Generate task data

# 1. generate a small task corpus
NUM_TASKS=10 OUT_DIR=rl_data/output/tasks_smoke \
    bash rl_data/scripts/generate_tasks/run_generate_tasks.sh

# 2. solve the tasks with an LLM agent at pass@k
TASKS_DIR=rl_data/output/tasks_smoke \
    bash rl_data/scripts/generate_solutions/run_generate_solutions.sh

# 3. analyse pass@k + composition/balance stats
TASKS_DIR=rl_data/output/tasks_smoke \
    bash rl_data/scripts/analyze/run_analyze.sh

rl_data/README.md is the entry point for the data side of the project: it walks through the compositional sampler (the orthogonal axes that define a task), the four-stage pipeline (generate_tasks → generate_solutions → analyze → upload_to_hf), the corpus kinds (legacy / sft_v2 / rl_v2), the SFT warm-start set, and how to evaluate on our published Harbor dataset. Start there for anything data-related.

Train a model

SFT and DPPO RL are run via the open-instruct fork in training/open-instruct/. Launch scripts for the tmax models live under training/open-instruct/scripts/tmax/ (SFT/ and RL/):

# from training/open-instruct/, e.g. RL on Qwen3.5-4B
bash scripts/tmax/RL/qwen35_4b.sh <beaker-image>

See training/open-instruct/scripts/tmax/README.md for how to read the scripts (mason.py launcher vs. the underlying training command) and how to run them off-cluster.

Evaluate a model

Run a Harbor dataset against a locally served model on Beaker:

./beaker_configs/launch_eval.sh allenai/open_instruct_dev \
    --revision sft_qwen3_4b_tmax_4node \
    --name sft-4b \
    --dataset terminal-bench@2.0

See scripts/beaker/README.md for the full eval pipeline, flags, and troubleshooting.

Evaluate manually

To run an eval yourself on a node with GPUs (no Beaker), serve the model with vLLM and point Harbor at it:

# 1. serve the model on localhost:8008 (a separate shell/process)
uvx vllm==0.19.1 serve allenai/tmax-9b \
    --served-model-name tmax-9b \
    --enable-auto-tool-choice --tool-call-parser qwen3_xml \
    --tensor-parallel-size 8 --port 8008

# set daytona key
export DAYTONA_API_KEY='xxx'

# 3. run a Harbor dataset against the local vLLM endpoint
uv run harbor run \
  --dataset terminal-bench@2.0 \
  --env daytona \
  --agent-import-path Vanillux2Agent:Vanillux2Agent \
  --model openai/tmax-9b \
  --agent-kwarg api_base=http://localhost:8008/v1 \
  --agent-kwarg max_format_errors=64 \
  --n-concurrent 16 \
  -k 5 \
  --job-name swerl-qwen32-2b-tmax-step100-vanillux2-daytona-tblite

Feel free to swap out daytona for your sandboxing backend of choice.

Task data on Harbor

We also ship the full 15k task corpus in Harbor format so you can run it out of the box. It's published on the Harbor registry as tmax/TMax-15K-Harbor (public): the legacy 10k self-contained tasks plus 5k newer intricate multi-modal tasks. Every task ships as a self-contained Harbor environment with a programmatic verifier, so you can evaluate any agent/model on it directly — no need to regenerate or build anything from rl_data/.

# evaluate an agent on a 10-task subset using the Daytona cloud sandbox
# (no local Docker required — swap `--env daytona` for `--env docker` to build locally)
export DAYTONA_API_KEY='xxx'
uv run harbor run -d "tmax/TMax-15K-Harbor@latest" \
    --agent terminus-2 --model "<model>" \
    --env daytona -l 10

Drop -l to run the full set, or use -i/-x to include/exclude tasks by name. Per-task rewards and agent/verifier logs land under jobs/<job-name>/. See rl_data/README.md for the full set of run options.

Requirements

uv for dependency management (deps pinned in pyproject.toml / uv.lock).
An LLM API key for the configured model (e.g. GEMINI_API_KEY); local vLLM / Ollama / OpenAI-compatible endpoints are also supported via env vars.
(for data generation) apptainer on PATH for building and running task containers.
(for training) A Dockerhub login and personal access token (PAT). In particular, you probably need a business account to pull images from Dockerhub at large scale.
HF_TOKEN for the Hugging Face upload stage and for pulling gated models.
(to evaluate on the published tmax/TMax-15K-Harbor dataset) a model API key for your chosen agent plus a container runtime — Docker locally, or DAYTONA_API_KEY for the Daytona sandbox on Docker-less hosts.

Licensing

This codebase is licensed under Apache 2.0 as given in LICENSE.

Citation

If you use this codebase or models in your research, please cite our paper:

@misc{ivison2026tmaxsimplerecipeterminal,
      title={Tmax: A simple recipe for terminal agents}, 
      author={Hamish Ivison and Junjie Oscar Yin and Rulin Shao and Teng Xiao and Nathan Lambert and Hannaneh Hajishirzi},
      year={2026},
      eprint={2606.23321},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.23321}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News

What's here

Quickstart

Generate task data

Train a model

Evaluate a model

Evaluate manually

Task data on Harbor

Requirements

Licensing

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
Vanillux2Agent		Vanillux2Agent
assets		assets
beaker_configs		beaker_configs
evaluation_assets/daytona		evaluation_assets/daytona
rl_data		rl_data
scripts		scripts
training/open-instruct		training/open-instruct
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

News

What's here

Quickstart

Generate task data

Train a model

Evaluate a model

Evaluate manually

Task data on Harbor

Requirements

Licensing

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages