Recursive Reasoning Models

Recursive Reasoning Models are an inference-time runtime for parallelizing language-model reasoning.

An RRM run treats a reasoning problem as an execution graph. Each node receives one task and emits one of two actions:

atomic: answer the task directly.
child / await: expose smaller subproblems, wait for their conclusions, then continue the same node.

The runtime parses streamed node events, launches ready child nodes immediately, tracks dependencies, and resumes parent nodes when awaited child conclusions are available. Independent child nodes run concurrently on same-model replicas.

The key idea is not agent orchestration. A node is not a separate persona or worker. A node is a scoped continuation of the same recursive execution policy.

Runtime Shape

root task
  node expands
  children launch as soon as their JSONL events stream in
  independent children run concurrently
  parent awaits required child conclusions
  parent continues from the child conclusions
  final answer returns from the root

Node event protocol:

{"type":"child","id":"case_a","task":"self-contained subproblem","depends_on":[]}
{"type":"child","id":"case_b","task":"self-contained subproblem","depends_on":[]}
{"type":"await","children":["case_a","case_b"],"rule":"how the child conclusions answer the task"}
{"type":"atomic","answer":"complete answer ending with FINAL: <answer>"}
{"type":"done"}

The runtime stores full node paths such as root.case_a, enforces dependency readiness, applies a global concurrency limit, and records trace events for node expansion, child execution, parent continuation, overlap, replica assignment, latency, and correctness.

What Is In This Repo

rrm/executor.py: recursive streaming scheduler.
rrm/streaming.py: lenient JSONL event parser.
rrm/vllm_backend.py: OpenAI-compatible vLLM backend with streaming, reasoning controls, model-call tracing, and replica pooling.
rrm/hf_kv_backend.py: Transformers backend that preserves parent-node KV state across await for continuation experiments.
rrm/analysis.py: correctness, latency, trust-label, overlap, and replica utilization reports.
rrm/demo_renderer.py: static HTML renderer for side-by-side direct vs RRM traces.
benchmarks/aime_2024_rrm_candidates.jsonl: AIME task subset used for the current demo.

Install

python3 -m pip install -e .

The core package uses the Python standard library. Backends add their own runtime dependencies:

vllm for local OpenAI-compatible serving.
transformers and torch for the HF KV continuation backend.
openai for the OpenAI Responses backend.

Run The AIME Demo

Start four same-model vLLM replicas, one per GPU:

MODEL=Qwen/Qwen3-14B scripts/launch_vllm_4replicas.sh

Run direct AIME solving:

PYTHONPATH=. python3 -m rrm.cli bench benchmarks/aime_2024_rrm_candidates.jsonl \
  --backend vllm \
  --model Qwen/Qwen3-14B \
  --base-urls http://127.0.0.1:18000/v1,http://127.0.0.1:18002/v1,http://127.0.0.1:18004/v1,http://127.0.0.1:18006/v1 \
  --modes direct \
  --prompt-style semantic-fastsplit-noformula \
  --max-output-tokens 7000 \
  --direct-max-output-tokens 7000 \
  --vllm-thinking on \
  --task-concurrency 4 \
  --out traces/aime_demo/direct.jsonl

Run model-planned recursive streaming:

PYTHONPATH=. python3 -m rrm.cli bench benchmarks/aime_2024_rrm_candidates.jsonl \
  --backend vllm \
  --model Qwen/Qwen3-14B \
  --base-urls http://127.0.0.1:18000/v1,http://127.0.0.1:18002/v1,http://127.0.0.1:18004/v1,http://127.0.0.1:18006/v1 \
  --modes streaming_parallel \
  --prompt-style semantic-fastsplit-noformula \
  --recombiner continue \
  --max-depth 1 \
  --max-concurrency 4 \
  --max-sibling-width 12 \
  --max-output-tokens 7000 \
  --planner-max-output-tokens 1536 \
  --worker-max-output-tokens 7000 \
  --recombiner-max-output-tokens 3000 \
  --vllm-thinking on \
  --vllm-graph-thinking off \
  --vllm-atomic-thinking on \
  --vllm-continuation-thinking off \
  --task-concurrency 4 \
  --out traces/aime_demo/rrm.jsonl

Analyze direct vs RRM:

cat traces/aime_demo/direct.jsonl traces/aime_demo/rrm.jsonl \
  > traces/aime_demo/combined.jsonl

PYTHONPATH=. python3 -m rrm.cli analyze traces/aime_demo/combined.jsonl \
  --tasks benchmarks/aime_2024_rrm_candidates.jsonl \
  --direct-vs-rrm

Render the static demo:

PYTHONPATH=. python3 -m rrm.cli render-demo traces/aime_demo/combined.jsonl \
  --tasks benchmarks/aime_2024_rrm_candidates.jsonl \
  --out traces/blogpost_demo/index.html \
  --title "Recursive Reasoning Models on AIME"

Trace Semantics

Every benchmark row records:

answer and correctness
end-to-end latency
model-call count
graph status
prompt hash and prompt preview
token usage when exposed by the backend
max observed depth
max sibling width
total node count
time to first child node
ancestor/descendant overlap
replica assignment and queueing metadata

Headline RRM rows require a real emitted graph, correct direct answer, correct RRM answer, no oracle graph, no fallback, positive overlap, and real API timing. Graphless atomic rows are direct-style controls.

Tests

python3 -m pytest

The tests cover event parsing, recursive scheduling, dependency readiness, continuation behavior, benchmark validation, analysis labels, and demo rendering.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
benchmarks		benchmarks
docs		docs
rrm		rrm
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Recursive Reasoning Models

Runtime Shape

What Is In This Repo

Install

Run The AIME Demo

Trace Semantics

Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Recursive Reasoning Models

Runtime Shape

What Is In This Repo

Install

Run The AIME Demo

Trace Semantics

Tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages