Skip to content

romeerp/rrm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Recursive Reasoning Models

Recursive Reasoning Models are an inference-time runtime for parallelizing language-model reasoning.

An RRM run treats a reasoning problem as an execution graph. Each node receives one task and emits one of two actions:

  • atomic: answer the task directly.
  • child / await: expose smaller subproblems, wait for their conclusions, then continue the same node.

The runtime parses streamed node events, launches ready child nodes immediately, tracks dependencies, and resumes parent nodes when awaited child conclusions are available. Independent child nodes run concurrently on same-model replicas.

The key idea is not agent orchestration. A node is not a separate persona or worker. A node is a scoped continuation of the same recursive execution policy.

Runtime Shape

root task
  node expands
  children launch as soon as their JSONL events stream in
  independent children run concurrently
  parent awaits required child conclusions
  parent continues from the child conclusions
  final answer returns from the root

Node event protocol:

{"type":"child","id":"case_a","task":"self-contained subproblem","depends_on":[]}
{"type":"child","id":"case_b","task":"self-contained subproblem","depends_on":[]}
{"type":"await","children":["case_a","case_b"],"rule":"how the child conclusions answer the task"}
{"type":"atomic","answer":"complete answer ending with FINAL: <answer>"}
{"type":"done"}

The runtime stores full node paths such as root.case_a, enforces dependency readiness, applies a global concurrency limit, and records trace events for node expansion, child execution, parent continuation, overlap, replica assignment, latency, and correctness.

What Is In This Repo

  • rrm/executor.py: recursive streaming scheduler.
  • rrm/streaming.py: lenient JSONL event parser.
  • rrm/vllm_backend.py: OpenAI-compatible vLLM backend with streaming, reasoning controls, model-call tracing, and replica pooling.
  • rrm/hf_kv_backend.py: Transformers backend that preserves parent-node KV state across await for continuation experiments.
  • rrm/analysis.py: correctness, latency, trust-label, overlap, and replica utilization reports.
  • rrm/demo_renderer.py: static HTML renderer for side-by-side direct vs RRM traces.
  • benchmarks/aime_2024_rrm_candidates.jsonl: AIME task subset used for the current demo.

Install

python3 -m pip install -e .

The core package uses the Python standard library. Backends add their own runtime dependencies:

  • vllm for local OpenAI-compatible serving.
  • transformers and torch for the HF KV continuation backend.
  • openai for the OpenAI Responses backend.

Run The AIME Demo

Start four same-model vLLM replicas, one per GPU:

MODEL=Qwen/Qwen3-14B scripts/launch_vllm_4replicas.sh

Run direct AIME solving:

PYTHONPATH=. python3 -m rrm.cli bench benchmarks/aime_2024_rrm_candidates.jsonl \
  --backend vllm \
  --model Qwen/Qwen3-14B \
  --base-urls http://127.0.0.1:18000/v1,http://127.0.0.1:18002/v1,http://127.0.0.1:18004/v1,http://127.0.0.1:18006/v1 \
  --modes direct \
  --prompt-style semantic-fastsplit-noformula \
  --max-output-tokens 7000 \
  --direct-max-output-tokens 7000 \
  --vllm-thinking on \
  --task-concurrency 4 \
  --out traces/aime_demo/direct.jsonl

Run model-planned recursive streaming:

PYTHONPATH=. python3 -m rrm.cli bench benchmarks/aime_2024_rrm_candidates.jsonl \
  --backend vllm \
  --model Qwen/Qwen3-14B \
  --base-urls http://127.0.0.1:18000/v1,http://127.0.0.1:18002/v1,http://127.0.0.1:18004/v1,http://127.0.0.1:18006/v1 \
  --modes streaming_parallel \
  --prompt-style semantic-fastsplit-noformula \
  --recombiner continue \
  --max-depth 1 \
  --max-concurrency 4 \
  --max-sibling-width 12 \
  --max-output-tokens 7000 \
  --planner-max-output-tokens 1536 \
  --worker-max-output-tokens 7000 \
  --recombiner-max-output-tokens 3000 \
  --vllm-thinking on \
  --vllm-graph-thinking off \
  --vllm-atomic-thinking on \
  --vllm-continuation-thinking off \
  --task-concurrency 4 \
  --out traces/aime_demo/rrm.jsonl

Analyze direct vs RRM:

cat traces/aime_demo/direct.jsonl traces/aime_demo/rrm.jsonl \
  > traces/aime_demo/combined.jsonl

PYTHONPATH=. python3 -m rrm.cli analyze traces/aime_demo/combined.jsonl \
  --tasks benchmarks/aime_2024_rrm_candidates.jsonl \
  --direct-vs-rrm

Render the static demo:

PYTHONPATH=. python3 -m rrm.cli render-demo traces/aime_demo/combined.jsonl \
  --tasks benchmarks/aime_2024_rrm_candidates.jsonl \
  --out traces/blogpost_demo/index.html \
  --title "Recursive Reasoning Models on AIME"

Trace Semantics

Every benchmark row records:

  • answer and correctness
  • end-to-end latency
  • model-call count
  • graph status
  • prompt hash and prompt preview
  • token usage when exposed by the backend
  • max observed depth
  • max sibling width
  • total node count
  • time to first child node
  • ancestor/descendant overlap
  • replica assignment and queueing metadata

Headline RRM rows require a real emitted graph, correct direct answer, correct RRM answer, no oracle graph, no fallback, positive overlap, and real API timing. Graphless atomic rows are direct-style controls.

Tests

python3 -m pytest

The tests cover event parsing, recursive scheduling, dependency readiness, continuation behavior, benchmark validation, analysis labels, and demo rendering.

About

Recursive Reasoning Models prototype

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors