Recursive Reasoning Models are an inference-time runtime for parallelizing language-model reasoning.
An RRM run treats a reasoning problem as an execution graph. Each node receives one task and emits one of two actions:
atomic: answer the task directly.child/await: expose smaller subproblems, wait for their conclusions, then continue the same node.
The runtime parses streamed node events, launches ready child nodes immediately, tracks dependencies, and resumes parent nodes when awaited child conclusions are available. Independent child nodes run concurrently on same-model replicas.
The key idea is not agent orchestration. A node is not a separate persona or worker. A node is a scoped continuation of the same recursive execution policy.
root task
node expands
children launch as soon as their JSONL events stream in
independent children run concurrently
parent awaits required child conclusions
parent continues from the child conclusions
final answer returns from the root
Node event protocol:
{"type":"child","id":"case_a","task":"self-contained subproblem","depends_on":[]}
{"type":"child","id":"case_b","task":"self-contained subproblem","depends_on":[]}
{"type":"await","children":["case_a","case_b"],"rule":"how the child conclusions answer the task"}
{"type":"atomic","answer":"complete answer ending with FINAL: <answer>"}
{"type":"done"}The runtime stores full node paths such as root.case_a, enforces dependency
readiness, applies a global concurrency limit, and records trace events for node
expansion, child execution, parent continuation, overlap, replica assignment,
latency, and correctness.
rrm/executor.py: recursive streaming scheduler.rrm/streaming.py: lenient JSONL event parser.rrm/vllm_backend.py: OpenAI-compatible vLLM backend with streaming, reasoning controls, model-call tracing, and replica pooling.rrm/hf_kv_backend.py: Transformers backend that preserves parent-node KV state acrossawaitfor continuation experiments.rrm/analysis.py: correctness, latency, trust-label, overlap, and replica utilization reports.rrm/demo_renderer.py: static HTML renderer for side-by-side direct vs RRM traces.benchmarks/aime_2024_rrm_candidates.jsonl: AIME task subset used for the current demo.
python3 -m pip install -e .The core package uses the Python standard library. Backends add their own runtime dependencies:
vllmfor local OpenAI-compatible serving.transformersandtorchfor the HF KV continuation backend.openaifor the OpenAI Responses backend.
Start four same-model vLLM replicas, one per GPU:
MODEL=Qwen/Qwen3-14B scripts/launch_vllm_4replicas.shRun direct AIME solving:
PYTHONPATH=. python3 -m rrm.cli bench benchmarks/aime_2024_rrm_candidates.jsonl \
--backend vllm \
--model Qwen/Qwen3-14B \
--base-urls http://127.0.0.1:18000/v1,http://127.0.0.1:18002/v1,http://127.0.0.1:18004/v1,http://127.0.0.1:18006/v1 \
--modes direct \
--prompt-style semantic-fastsplit-noformula \
--max-output-tokens 7000 \
--direct-max-output-tokens 7000 \
--vllm-thinking on \
--task-concurrency 4 \
--out traces/aime_demo/direct.jsonlRun model-planned recursive streaming:
PYTHONPATH=. python3 -m rrm.cli bench benchmarks/aime_2024_rrm_candidates.jsonl \
--backend vllm \
--model Qwen/Qwen3-14B \
--base-urls http://127.0.0.1:18000/v1,http://127.0.0.1:18002/v1,http://127.0.0.1:18004/v1,http://127.0.0.1:18006/v1 \
--modes streaming_parallel \
--prompt-style semantic-fastsplit-noformula \
--recombiner continue \
--max-depth 1 \
--max-concurrency 4 \
--max-sibling-width 12 \
--max-output-tokens 7000 \
--planner-max-output-tokens 1536 \
--worker-max-output-tokens 7000 \
--recombiner-max-output-tokens 3000 \
--vllm-thinking on \
--vllm-graph-thinking off \
--vllm-atomic-thinking on \
--vllm-continuation-thinking off \
--task-concurrency 4 \
--out traces/aime_demo/rrm.jsonlAnalyze direct vs RRM:
cat traces/aime_demo/direct.jsonl traces/aime_demo/rrm.jsonl \
> traces/aime_demo/combined.jsonl
PYTHONPATH=. python3 -m rrm.cli analyze traces/aime_demo/combined.jsonl \
--tasks benchmarks/aime_2024_rrm_candidates.jsonl \
--direct-vs-rrmRender the static demo:
PYTHONPATH=. python3 -m rrm.cli render-demo traces/aime_demo/combined.jsonl \
--tasks benchmarks/aime_2024_rrm_candidates.jsonl \
--out traces/blogpost_demo/index.html \
--title "Recursive Reasoning Models on AIME"Every benchmark row records:
- answer and correctness
- end-to-end latency
- model-call count
- graph status
- prompt hash and prompt preview
- token usage when exposed by the backend
- max observed depth
- max sibling width
- total node count
- time to first child node
- ancestor/descendant overlap
- replica assignment and queueing metadata
Headline RRM rows require a real emitted graph, correct direct answer, correct RRM answer, no oracle graph, no fallback, positive overlap, and real API timing. Graphless atomic rows are direct-style controls.
python3 -m pytestThe tests cover event parsing, recursive scheduling, dependency readiness, continuation behavior, benchmark validation, analysis labels, and demo rendering.