RLM — Recursive Language Model vs Vanilla LLM

A local, reproducible comparison framework that puts Recursive Language Models (RLMs) head-to-head against standard LLM inference on long-context tasks — using the exact same model for both. Includes SFT training with QLoRA and Muon optimizer ablation.

Based on the research paper: Recursive Language Models (arXiv:2512.24601) by Zhang, Kraska & Khattab (2025/2026)

What is an RLM?

Standard LLMs suffer from context rot — the longer the document you feed them, the worse their reasoning gets. Even models with 128K token windows degrade badly on information-dense tasks.

An RLM fixes this by never putting the full document in the LLM's context window. Instead:

The document is stored as a Python variable (context) in an external REPL
The LLM writes Python code to examine slices of it
The LLM calls itself recursively via llm_query() on focused sub-chunks
When it has the answer, it calls FINAL(answer) to terminate

Vanilla LLM:
  [system prompt + FULL 32K document + query] -> LLM -> answer
  (context rot degrades quality above ~8K tokens)

RLM:
  [system prompt + metadata about document + query] -> root LLM -> Python code
                                                                        |
  REPL executes code, `context` variable holds full document in RAM
  LLM writes code like: chunks = [...]; results = [llm_query(chunk) for chunk in chunks]
                                                                        |
  Sub-LM answers focused questions on small, clean chunks
                                                                        |
  Root LLM consolidates -> FINAL(answer)

Same model. Completely different architecture. Measurably different results.

Key Results

Context	Vanilla	RLM	Winner
4K-8K	100%	100%	Tie (both fit in KV cache)
16K	100%	67%	Vanilla (RLM code quality issue)
32K	33%	33%	Tie (KV cache truncation begins)
64K	33%	67%	RLM (vanilla is blind, RLM finds it)
128K	0%	--	RLM (vanilla completely fails)

RLM advantage emerges at 32K+ where Ollama's KV cache truncates vanilla's context. RLM is unaffected because it searches the full document via REPL code.

Project Structure

RLM/
|-- src/
|   |-- rlm/                        # Core RLM engine
|   |   |-- clients/
|   |   |   |-- base.py             # BaseLLMClient (abstract)
|   |   |   |-- ollama.py           # OllamaClient -- HTTP to Ollama
|   |   |-- repl.py                 # LocalREPL -- exec-based REPL
|   |   |-- rlm_repl.py             # RLM_REPL -- full recursive loop
|   |-- baseline/
|   |   |-- vanilla_llm.py          # VanillaLLM -- direct context inference
|   |-- benchmarks/
|   |   |-- niah.py                 # Needle in a Haystack generator
|   |   |-- long_doc_qa.py          # Long Document QA generator
|   |-- comparison/
|   |   |-- runner.py               # Run vanilla + RLM, save JSONL
|   |   |-- eval.py                 # Score results
|   |-- training/                   # Phase 4: SFT + RL training
|       |-- muon.py                 # Muon optimizer (Newton-Schulz)
|
|-- scripts/
|   |-- smoke_test.py               # Quick single-task test
|   |-- run_niah.py                 # NIAH benchmark suite
|   |-- run_comparison.py           # Full suite (NIAH + LongDocQA)
|   |-- plot_results.py             # DeepSeek-R1-Zero style plots
|   |-- view_results.py             # Analyse saved JSONL results
|   |-- build_sft_dataset.py        # Trajectory -> SFT training data
|   |-- train_sft.py                # QLoRA SFT with AdamW or Muon
|   |-- eval_sft.py                 # Evaluate SFT model vs base
|
|-- prompts/
|   |-- rlm_system.txt              # System prompt for root LM
|
|-- experiments/                    # Results, checkpoints, plots
|   |-- plots/                      # Generated PNG charts
|   |-- checkpoints/                # SFT model checkpoints
|   |-- *.jsonl                     # Benchmark results
|   |-- *_trajectories.jsonl        # RLM code traces for training
|   |-- sft_dataset.jsonl           # Filtered SFT training data
|
|-- docs/                           # Research, decisions, logs
|   |-- dev_log.md                  # Session-by-session progress
|   |-- tasks.md                    # Master task tracker
|   |-- 01_rlm_deep_dive.md        # Paper breakdown
|   |-- 02_repl_explained.md       # REPL concept + build guide
|   |-- 03_hardware_constraints.md # RTX 4050 reality check
|   |-- 04_implementation_plan.md  # Architecture + phase plan
|   |-- 05_novel_research_ideas.md # Future directions
|
|-- pyproject.toml
|-- .python-version                 # Python 3.12

Hardware & Model


GPU	NVIDIA RTX 4050 Laptop GPU (6GB VRAM)
Model	`qwen3.5:2b` via Ollama (inference) / `Qwen2.5-1.5B` via HF (training)
Root LM	`qwen3.5:2b`
Sub LM	`qwen3.5:2b` (same model)
REPL	Local `exec()`-based Python
Training	QLoRA (4-bit NF4) + LoRA rank 8

Setup

Requirements: Python 3.12+, uv, Ollama

# 1. Clone / enter the project
cd RLM

# 2. Install base dependencies
uv sync

# 3. Start Ollama and pull model
ollama serve
ollama pull qwen3.5:2b

# 4. (Optional) Install training dependencies for Phase 4
uv pip install -e ".[train]"

Usage

Benchmarking (Phase 2-3)

# Smoke test — verify everything works (~1-2 min)
uv run python scripts/smoke_test.py

# NIAH benchmark — short contexts
uv run python scripts/run_niah.py --small

# NIAH benchmark — full grid (4K/8K/16K/32K)
uv run python scripts/run_niah.py --save-trajectories

# Long-context NIAH (32K/64K/128K) — where RLM shines
uv run python scripts/run_niah.py --long --save-trajectories

# Full comparison (NIAH + LongDocQA)
uv run python scripts/run_comparison.py --save-trajectories

# Resume an interrupted run (fingerprint-based skip)
uv run python scripts/run_niah.py --run-name niah_full_v2

# Generate plots (DeepSeek-R1-Zero style)
uv run python scripts/plot_results.py experiments/niah_full_v2.jsonl

# Merge multiple runs into one plot set
uv run python scripts/plot_results.py experiments/niah_full_v2.jsonl experiments/niah_long_v1.jsonl --merge

Training (Phase 4)

# Step 1: Build SFT dataset from collected trajectories
uv run python scripts/build_sft_dataset.py

# Step 2: STOP Ollama (frees VRAM for training)
ollama stop qwen3.5:2b

# Step 3a: Train with AdamW (baseline)
uv run python scripts/train_sft.py --optimizer adamw --epochs 3

# Step 3b: Train with Muon (ablation)
uv run python scripts/train_sft.py --optimizer muon --epochs 3

# Step 4: Evaluate SFT model
uv run python scripts/eval_sft.py experiments/checkpoints/sft_adamw/final
uv run python scripts/eval_sft.py experiments/checkpoints/sft_muon/final

# Step 5: Compare all results
uv run python scripts/plot_results.py experiments/eval_sft_adamw.jsonl experiments/eval_sft_muon.jsonl --merge

Training Parameters (6GB VRAM)

Parameter	Value	Reason
Quantization	4-bit NF4 (QLoRA)	Fits 2B model in ~3.5GB
LoRA rank	8	Low memory, sufficient for SFT
LoRA alpha	16	Standard 2x rank
Batch size	1	Minimum for 6GB
Grad accumulation	4	Effective batch = 4
Max seq length	2048	Caps memory usage
Optimizer	AdamW or Muon	Muon for 2D weights only

Muon Optimizer

Muon is an optimizer designed for weight matrices in transformers. Instead of standard gradient descent, it applies Newton-Schulz orthogonalization to compute a spectrally-normalized update direction.

How it works:

Accumulate gradient with Nesterov momentum
Apply Newton-Schulz iterations to approximate the polar decomposition: G -> U @ V^T
This gives a "direction-only" update (strips out magnitude)
Update: W -= lr * orthogonal_direction

Split strategy:

Muon: all 2D weight matrices (attention Q/K/V/O projections, MLP layers)
AdamW: everything else (embeddings, layernorm, biases)

Ablation question: Does Muon converge faster and produce better RLM code trajectories compared to standard AdamW?

How RLM Works (in code)

from rlm.clients.ollama import OllamaClient
from rlm.rlm_repl import RLM_REPL

client = OllamaClient(model="qwen3.5:2b")
rlm = RLM_REPL(root_client=client, max_iterations=20)

result = rlm.completion(
    context="...your very long document...",
    query="What is the activation code mentioned in section 3?"
)

print(result["answer"])       # The model's answer
print(result["iterations"])   # How many REPL turns it took
print(result["sub_calls"])    # How many llm_query() calls were made
print(result["trajectory"])   # Full step-by-step code + output log

Benchmarks

Needle in a Haystack (NIAH)

Context: random English words (haystack)
Needle: one specific fact inserted at a controlled position (10%, 50%, 90%)
Query: ask for that exact fact
Why: tests pure retrieval across increasing context lengths

Long Document QA

Context: filler sentences with 5 facts embedded throughout
Query: ask about one specific fact buried in filler
Why: more realistic than NIAH — coherent text, not random words

SFT Training Pipeline

Benchmark runs (--save-trajectories)
        |
        v
*_trajectories.jsonl  (raw RLM code traces)
        |
        v
build_sft_dataset.py  (filter: correct + FINAL called + no errors)
        |
        v
sft_dataset.jsonl     (chat-format messages for SFT)
        |
        v
train_sft.py          (QLoRA + AdamW or Muon)
        |
        v
checkpoints/sft_*/    (LoRA adapters)
        |
        v
eval_sft.py           (run SFT model through NIAH, compare to base)

Roadmap

Phase 0 -- Research & planning
Phase 1 -- Core infrastructure (REPL, RLM loop, Ollama client, vanilla baseline)
Phase 2 -- Benchmarks (NIAH, Long-Doc QA, comparison runner, eval, plots)
Phase 2.5 -- Resumable runs, trajectory saving, DeepSeek-style plots
[/] Phase 3 -- Analysis (NIAH complete, LongDocQA in progress)
[/] Phase 4a -- SFT Training with QLoRA (pipeline built, training pending)
Phase 4b -- GRPO Reinforcement Learning
Phase 4c -- Muon Optimizer Ablation (AdamW vs Muon)
Phase 5 -- Publication (Medium article, GitHub Pages)

See docs/tasks.md for the full checklist and docs/dev_log.md for session history.

Research Background

Resource	Link
RLM Paper (arXiv)	https://arxiv.org/abs/2512.24601
Author blog post	https://alexzhang13.github.io/blog/2025/rlm/
Official implementation	https://github.com/alexzhang13/rlm
Muon optimizer	https://github.com/KellerJordan/Muon
Unsloth (QLoRA training)	https://github.com/unslothai/unsloth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RLM — Recursive Language Model vs Vanilla LLM

What is an RLM?

Key Results

Project Structure

Hardware & Model

Setup

Usage

Benchmarking (Phase 2-3)

Training (Phase 4)

Training Parameters (6GB VRAM)

Muon Optimizer

How RLM Works (in code)

Benchmarks

Needle in a Haystack (NIAH)

Long Document QA

SFT Training Pipeline

Roadmap

Research Background

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
experiments		experiments
prompts		prompts
scripts		scripts
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

RLM — Recursive Language Model vs Vanilla LLM

What is an RLM?

Key Results

Project Structure

Hardware & Model

Setup

Usage

Benchmarking (Phase 2-3)

Training (Phase 4)

Training Parameters (6GB VRAM)

Muon Optimizer

How RLM Works (in code)

Benchmarks

Needle in a Haystack (NIAH)

Long Document QA

SFT Training Pipeline

Roadmap

Research Background

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages