tau is a small CLI for running a staged SWE workflow:
generatemines a commit and creates a task.solveruns a solver against that task.comparescores two saved solutions by changed-line similarity.evalcompares multiple solutions with an LLM judge.deleteremoves saved task artifacts.private-submitvalidates and stores a signed private miner submission.serve-submissions-apiaccepts private miner submissions over HTTP.validateruns the live king-of-the-hill validator loop.restore-r2-kingsrepublishes the validator dashboard's recent king window.
The canonical miner-editable harness is a single file in the public
unarbos/ninja repository.
tau owns task generation, Docker execution, validation, scoring, and managed
inference; ninja is only the base agent for miners to edit.
agent.py(plus comments and docs for miners)- no task generators, validator code, pm2 configs, wallets, task pool tooling, or R2 helpers
For local tests you can run either the published ninja repo or a local clone:
source .venv/bin/activate
tau solve --task my-task --solution ninja-main --agent unarbos/ninja
tau solve --task my-task --solution local-ninja --agent ../ninjaagent.py must define:
def solve(repo_path: str, issue: str, model: str, api_base: str, api_key: str) -> dict:
...and should return patch, logs, steps, cost, and success.
model, api_base, and api_key are always provided by the validator and must
be treated as read-only invocation parameters.
In production, miners do not submit code through public GitHub PRs. They submit
their agent.py privately to the validator API, and the validator stores a
private bundle under private-submissions/<submission-id>/. The validator tracks
the private bundle id and file hash internally:
private-submission:<submission-id>:<sha256-of-agent.py>
The private submission route blocks submissions that do:
- change the
solve(...)contract - hardcode or import external model/provider credentials
- override provider routing (
api_base,api_key, ormodel) - set sampling/decoding params (
temperature,top_p,top_k,seed, penalties,logprobs, etc.) - add direct network/provider calls intended to bypass the validator-managed proxy
- fail Python compile or pyflakes smoke checks
- fail the OpenRouter private submission judge
The miner must sign this payload with the submitting hotkey:
tau-private-submission-v1:<hotkey>:<submission-id>:<sha256-of-agent.py>
The validator verifies that signature before queueing the private bundle, so a different miner cannot copy someone else's private code.
You can still test a local agent from any GitHub repo for research, e.g.:
source .venv/bin/activate
tau solve --task my-task --solution shared --agent owner/repoor:
source .venv/bin/activate
tau solve --task my-task --solution shared --agent https://github.com/owner/repoProduction miner submissions should use the private submission API, not GitHub
PRs or raw owner/repo@sha commitments.
- Python 3.11+
uv- Docker
- A GitHub token for task generation
- An OpenRouter API key for Docker file solves and evaluation
- A Cursor API key for Cursor solves
From the tau/ directory:
source .venv/bin/activate
uv pip install -e .Create a .env file in tau/ if you do not already have one:
GITHUB_TOKEN=your_github_token
OPENROUTER_API_KEY=your_openrouter_api_key
CURSOR_API_KEY=your_cursor_api_keytau loads .env automatically from the project root.
Optional environment defaults for centralized solver routing:
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
SOLVER_MAX_REQUESTS=40
SOLVER_MAX_TOTAL_TOKENS=200000
SOLVER_MAX_PROMPT_TOKENS=160000
SOLVER_MAX_COMPLETION_TOKENS=40000
SOLVER_MAX_TOKENS_PER_REQUEST=4096
SOLVER_MAX_COST=1.00CLI flags still override these values for one-off runs.
The live validator scores private miner edits from local bundle storage. Miners
send agent.py, hotkey, submission id, and hotkey signature over a private
operator-controlled channel. The operator stores the bundle with:
tau private-submit \
--hotkey <miner-hotkey> \
--agent /path/to/submitted-agent.py \
--base-agent /path/to/current-public-agent.py \
--signature <hotkey-signature> \
--private-submission-root /secure/private-submissions \
--network finneyThe command prints JSON with the private submission id/hash, the signature
payload, ci_checks, and the raw llm_judge result. If the
operator already knows the current registration block, --registration-block
can be supplied instead of doing the chain lookup.
To serve the miner-facing private submission API behind ninja66.ai, run:
tau serve-submissions-api \
--host 127.0.0.1 \
--port 8066 \
--base-agent /path/to/current-public-agent.py \
--private-submission-root /secure/private-submissions \
--max-request-bytes 5000000 \
--max-agent-bytes 5000000 \
--rate-limit-max-requests 6 \
--rate-limit-max-failures 3 \
--network finneyThe HTTP API accepts POST /api/submissions as multipart form data with
agent, hotkey, submission_id (optional), and signature. It returns the
same acceptance JSON as private-submit, including ci_checks and llm_judge
on failures before returning a non-2xx status. Accepted submissions refresh the
public accepted-submissions payload at sn66/api/submissions, which is exposed
as https://ninja66.ai/api/submissions by the same R2/domain mapping used for
the dashboard. The validator queues accepted API submissions directly from the
private ledger after rechecking that the hotkey is still registered.
Run this API behind nginx, Cloudflare, or an equivalent edge proxy. The Python server rejects oversized submissions, limits concurrent expensive checks, and rate-limits each client IP, but network-layer floods should be absorbed before they reach the validator host.
private-submit accepts and stores at most one valid bundle for a hotkey's
current registration block. It records accepted submissions in
_accepted_submissions.json under the private submission root; a second valid
bundle from the same hotkey is rejected until the hotkey re-registers and the
registration block advances. The validator also re-checks registration status
before queueing an accepted API submission.
The validator only queues the private submission when all of these match:
- the submission comes from a registered subnet hotkey
- the hotkey has not already used an accepted submission in its current registration
- the private submission gate has accepted no other bundle from this hotkey in its current registration
- the private bundle exists under the configured private submission root
agent.pyhashes to the committed SHA256- the bundle hotkey matches the submitting hotkey
- the hotkey signature verifies for the submitted payload
- local checks are green:
Agent Smoke,Submission Scope Guard, andOpenRouter Submission Judge
A miner can resubmit from the same hotkey only after it is freshly registered again. Accepted API submissions are treated as spent for the hotkey's current registration period; submissions from an older registration are ignored after the hotkey re-registers.
- Private bundles are checked against validator-side API gates:
Agent SmokeSubmission Scope GuardOpenRouter Submission Judge
Agent Smokecompilesagent.pyand runs pyflakes.Submission Scope Guardrejects edits that break the solve contract or attempt forbidden provider/sampling control.OpenRouter Submission Judgereviews the diff with the private submission gatekeeping prompt through OpenRouter usinganthropic/claude-opus-4.7, temperature 0, medium reasoning effort, and a required score aboveJUDGE_MIN_SCORE.
The validator keeps two independent 50-task pools: a primary pool for the first challenger-vs-king duel, and a retest pool used only when the challenger wins the primary duel. Promotion requires the challenger to also win the retest, which checks the improvement on a separate task set before changing the king. Parallel duels run the full gathered task set instead of stopping early once an outcome is mathematically decided. By default both pools are static fixed-size sets: once each pool reaches 50 tasks, the validator reuses that same ordered set until the king changes or an operator explicitly enables pool refresh.
The production validator continuously drains queued candidates in queue order and refreshes accepted API submissions every 10 minutes, adding newly eligible private submissions to the queue. Each duel can run up to 25 round workers with challenger agent timeouts capped at 600 seconds. If a challenger hits 5 consecutive round timeouts, the validator stops submitting new rounds for that challenger and moves on after its already-running rounds finish.
When a private challenger becomes king, the validator publishes the winning
agent.py directly to the configured public base repo, records the king as the resulting base
repo commit while keeping the miner hotkey metadata, flushes the old task
pool, and assigns all validator weight to the winning hotkey on the next
allowed weight-set epoch.
The background pool filler pre-solves tasks before challengers arrive. It caps
Cursor and king pool solves at 300 seconds, skips timed-out or empty Cursor
baselines, and the duel gatherer preserves the cached task order so every
challenger sees the same sequence.
With the default settings, once the primary and retest pools are full they stay
static at 50 tasks each. Scheduled recycling is disabled unless
--task-pool-refresh-count and --task-pool-refresh-interval-seconds are set
to non-zero values.
start_validator.sh enables this production path with:
--solver-model minimax/minimax-m2.7 \
--solver-provider-only minimax/fp8,minimax/highspeed \
--round-concurrency 25 \
--candidate-timeout-streak-limit 5 \
--poll-interval-seconds 600 \
--task-pool-target 50 \
--task-pool-static \
--task-pool-fill-from-saved \
--task-pool-refresh-count 0 \
--task-pool-refresh-interval-seconds 0 \
--duel-rounds 50 \
--win-margin 3 \
--hotkey-spent-since-block 8104340 \
--pool-filler-concurrency 25 \
--watch-private-submissions \
--private-submission-only \
--publish-repo unarbos/ninja \
--publish-base main--private-submission-only means normal unarbos/ninja@sha submissions are
ignored by the live validator. This keeps miner submissions private until a
challenger becomes king.
Each validation task still starts from a mined GitHub commit: task/original is the repo before the commit, task/reference is the repo after it, and task/reference.patch is used to filter out tiny tasks.
For duels, the scoring target is the Cursor baseline solution, saved as solutions/baseline. The pool filler runs Cursor and the current king on the same task, then stores the king's similarity to baseline. During a duel, the challenger is also compared to baseline.
Round score is now blended: 1/2 Cursor-baseline similarity plus 1/2 LLM diff judgment. The diff judge uses openai/gpt-5.4 through OpenRouter at temperature 0 with medium reasoning effort and a 16000-token output cap, then scores the king and challenger patches against the task/reference context.
Cursor is only the measuring stick. The challenger does not need to beat Cursor directly; it only needs more decisive round wins than the current king plus the configured margin. start_validator.sh currently uses --win-margin 3.
The validator still compares king to challenger separately for copy detection, but that pairwise similarity does not replace the Cursor baseline scoring target.
Docker file agents receive a validator-managed OpenAI-compatible endpoint through solve(..., model, api_base, api_key). The upstream provider key is never passed into miner code.
The proxy forwards to OpenRouter and enforces:
- the validator-selected model, currently
deepseek/deepseek-v4-flashfor solver inference unless overridden by validator config temperature=0.0top_p=1.0- removal of miner-controlled sampling fields such as
top_k,seed, penalties,logit_bias, andlogprobs - request, token, and cost budgets
Miner agents should use only the supplied api_base and api_key. Attempts to choose another provider, model, sampling policy, or credential path are rejected by ninja CI and overwritten or stripped by the validator proxy.
Show top-level help:
source .venv/bin/activate
tau --helpAll commands write their artifacts under:
workspace/tasks/
You can override that with --workspace-root /path/to/root.
source .venv/bin/activate
tau generate --task my-taskUseful options:
--generator-model <model>--seed <int>--max-mining-attempts <int>--agent-timeout <seconds>--debug
solve supports multiple backends. The --agent value can be:
cursorto run the Cursor CLI in Dockerclaudeto run the local Claude CLI on the hostclawto run the local Claw CLI on the host- a local
agent.pyfile for the Docker file solver - a local repo root containing
agent.pyfor the Docker file solver - a GitHub repo URL or shorthand like
owner/repofor the Docker file solver
Example using Cursor:
source .venv/bin/activate
tau solve --task my-task --solution cursor-run --agent cursorExample using Claude:
source .venv/bin/activate
tau solve --task my-task --solution claude-run --agent claudeExample using Claw:
source .venv/bin/activate
tau solve --task my-task --solution claw-run --agent clawExample using the public ninja harness:
source .venv/bin/activate
tau solve --task my-task --solution baseline --agent unarbos/ninjaExample using a local checkout of ninja:
source .venv/bin/activate
tau solve --task my-task --solution baseline --agent ../ninjaUseful options:
--solver-model <model>--baseline-model <model>--solver-max-requests <int>--solver-max-total-tokens <int>--solver-max-prompt-tokens <int>--solver-max-completion-tokens <int>--solver-max-tokens-per-request <int>--solver-max-cost <float>--solver-provider-sort price|throughput|latency--solver-provider-only <provider[,provider...]>--solver-provider-disable-fallbacks--solver-provider-min-throughput-p50 <float>--solver-provider-min-throughput-p90 <float>--docker-solver-memory 2g--docker-solver-cpus 2--docker-solver-no-cache--agent-timeout <seconds>--debug
Compare two saved solutions using changed-lines-only similarity:
source .venv/bin/activate
tau compare --task my-task --solutions cursor-run baselineComma-separated values also work:
source .venv/bin/activate
tau compare --task my-task --solutions cursor-run,baselineCompare two or more solutions for the same task:
source .venv/bin/activate
tau eval --task my-task --solutions baseline candidate-a candidate-bComma-separated values also work:
source .venv/bin/activate
tau eval --task my-task --solutions baseline,candidate-a,candidate-bUseful options:
--eval-model <model>--seed <int>--agent-timeout <seconds>--debug
Delete one task:
source .venv/bin/activate
tau delete --task my-taskDelete all saved tasks:
source .venv/bin/activate
tau delete task --allsource .venv/bin/activate
tau generate --task demo-task
tau solve --task demo-task --solution run-1 --agent cursor
tau solve --task demo-task --solution run-2 --agent unarbos/ninja
tau compare --task demo-task --solutions run-1 run-2
tau eval --task demo-task --solutions run-1 run-2When you pass a local file, local repo directory, or GitHub repo to --agent, tau builds a small Python Docker image, imports agent.py, and calls its solve(...) function.
- A Docker image (
swe-eval/file-solver:<hash>) is built frompython:3.11-slim. - A container starts with resource limits (memory, CPU, pids, tmpfs).
- The task repo is copied into the container at
/work/repo. - The submitted
agent.pyis copied into the container and imported. - The validator calls
solve(repo_path="/work/repo", issue=..., model=..., api_base=..., api_key=...)with the managed model id, local proxy URL, and per-run proxy token. - The diff is collected from the container and applied back to the host repo.
- The container is torn down.
The submitted agent does not receive the upstream OpenRouter key. On Linux the solver container runs with Docker network disabled and reaches the validator proxy through a local socket bridge, so LLM calls flow through one managed endpoint.
When you pass --agent cursor, tau builds a Docker image, runs the Cursor CLI inside it, and collects the resulting diff.
- A Docker image (
swe-eval/cursor-solver:<hash>) is built frompython:3.11-slimwith the Cursor CLI installed viacurl https://cursor.com/install | bash. - A container starts with resource limits (memory, CPU, pids, tmpfs).
- The task repo is copied into the container at
/work/repoand the prompt is written to/work/task.txt. - The Cursor
agentCLI runs inside the container withCURSOR_API_KEYinjected:
agent -p --force --trust --sandbox disabled --output-format stream-json \
--workspace /work/repo "$PROMPT"- The diff is collected from the container and applied back to the host repo.
- The container is torn down.
source .venv/bin/activate
tau solve --task my-task --solution cursor-run --agent cursorCURSOR_API_KEY must be set in your environment or in tau/.env.
| Flag | Purpose |
|---|---|
--solver-model <model> |
Override the model used by Cursor |
--agent-timeout <seconds> |
Time limit for the solve |
--docker-solver-memory 2g |
Container memory limit |
--docker-solver-cpus 2 |
Container CPU limit |
--docker-solver-no-cache |
Force rebuild the Docker image |
--debug |
Enable debug logging |
generateneedsGITHUB_TOKENorGH_TOKEN.tau solve --agent cursorneedsCURSOR_API_KEYand Docker.tau solve --agent claudeneeds theclaudeCLI installed on the host.tau solve --agent clawneeds theclawCLI installed on the host.- Docker file solves and
evalneedOPENROUTER_API_KEY. comparereads saved solution artifacts and does not call a model.- Docker-backed solves use Docker, so Docker must be installed and running.
- Generated task, solution, and evaluation paths are printed by the CLI after each command finishes.