One context, run everywhere.
RLM-as-a-service: persistent, recursive reasoning for AI coding agents.
Built on top of alexzhang13/rlm — the open-source Recursive Language Model framework where LLMs offload context into a REPL environment and recursively call sub-LLMs to decompose complex tasks. On benchmarks like OOLONG (132k tokens), RLM(GPT-5-mini) outperforms GPT-5 by over 34 points at similar cost.
Monolith takes the core RLM and turns it into deployed infrastructure that AI agents can call as a tool. We wrap the RLM in an MCP server, deploy the compute on Modal serverless, and add a persistent memory layer (Modal Volume) so the RLM accumulates context across sessions. The result: plug it into Claude Code and the agent gains the ability to recursively reason over arbitrarily large contexts — and remember what it learned.
| Layer | What | Why |
|---|---|---|
| MCP server | server.py (stdio) + Cloudflare Worker (HTTP) |
Exposes RLM as tools any MCP-compatible agent can call |
| Modal backend | modal_runtime.py — serverless functions + HTTP endpoints |
No infra to manage; scales to zero when idle |
| Persistent memory | Modal Volume stores {thread_id}/context.txt |
RLM builds on past sessions instead of starting from scratch |
| Session auto-upload | Claude Code Stop hook captures full transcripts |
Every conversation becomes searchable context for the RLM |
| Modal Sandbox sub-LLMs | ModalSandboxSubRLM runs sub-LLM calls in isolated sandboxes |
Safe code execution for recursive calls in the cloud |
| CLI tools | python -m monolith.query / store |
Use RLM outside of Claude Code |
Claude Code
│
└─ MCP (stdio or streamable-http)
│
▼
MCP Server ← thin routing layer
│ (Python stdio server OR Cloudflare Worker)
│
├─ chat_rlm_query(query, thread_id)
│ │
│ ▼
│ Modal: run_rlm_remote()
│ ├─ reads context from Volume: /{thread_id}/context.txt
│ ├─ runs RLM_REPL reasoning loop:
│ │ root LLM (gpt-5) ──writes code──▶ sandboxed REPL
│ │ │
│ │ REPL calls llm_query() ────────▶ sub-LLM (gpt-5-nano)
│ │ │
│ │ results flow back to root LLM ◀─────┘
│ │ ... repeat up to N iterations
│ ├─ appends Q&A turn to Volume
│ └─ returns answer
│
└─ upload_context(transcript, session_id, thread_id)
│
▼
Modal: store_context()
└─ appends transcript to Volume: /{thread_id}/context.txt
The RLM never sees the full context directly. Instead it interacts with it programmatically through a REPL:
- Recon — the root LLM reads the context file, checks its size, identifies the format and natural chunk boundaries
- Filter + Analyze — writes Python code to split the context along those boundaries, uses regex/keywords to find relevant sections, then calls
llm_query()to delegate semantic analysis of each section to a sub-LLM - Aggregate + Answer — synthesizes sub-LLM results via a final
llm_query()call and returns the answer
The root LLM uses a powerful model (gpt-5) for orchestration while sub-LLMs use cheaper models (gpt-5-nano) for focused analysis — keeping cost low while handling arbitrarily large contexts.
- Python 3.12+
- Modal account
- OpenAI API key
- Claude Code
git clone https://github.com/WingchunSiu/Monolith.git
cd Monolith
modal token setmodal volume create rlm-shared-volume
echo "OPENAI_API_KEY=sk-..." > /tmp/.env
modal volume put rlm-shared-volume /tmp/.env .env
rm /tmp/.envcd mcp-modal
pip install -r requirements.txt
modal deploy modal_runtime.pyLocal mode (stdio — recommended for dev):
claude mcp add monolith --transport stdio -- python /path/to/Monolith/mcp-modal/server.pyCloud mode (Cloudflare Worker → Modal HTTP):
cd mcp-modal/cloudflare/worker-gateway
# set MODAL_BACKEND_URL in wrangler.toml
npm install && npm run deploy
claude mcp add monolith --transport http \
--url https://monolith-mcp-modal.<subdomain>.workers.dev/mcpOpen Claude Code — the chat_rlm_query and upload_context tools are available automatically. The RLM handles recursive reasoning; Claude Code handles everything else.
Query the RLM with persistent thread context.
| Param | Type | Description |
|---|---|---|
query |
string | The question to ask |
thread_id |
string | Thread identifier — context accumulates per thread |
Upload a transcript to the RLM's persistent memory.
| Param | Type | Description |
|---|---|---|
transcript |
string | Full transcript text |
session_id |
string | Session identifier |
thread_id |
string | Thread to store under (default: transcripts) |
Add to .claude/settings.local.json to automatically capture every Claude Code session:
{
"hooks": {
"Stop": [{
"type": "command",
"command": "/path/to/Monolith/scripts/session_end_upload.sh"
}]
}
}Each transcript is uploaded with metadata (developer, git branch, timestamps, message count) so the RLM can reason over your full development history.
Monolith/
├── mcp-modal/ # MCP + Modal deployment layer
│ ├── server.py # MCP server (stdio)
│ ├── modal_runtime.py # Modal functions + HTTP endpoints
│ ├── rlm/ # RLM package (mounted into Modal image)
│ └── cloudflare/ # Cloudflare Worker gateway
├── rlm/ # Core RLM (forked from alexzhang13/rlm)
│ ├── rlm/
│ │ ├── rlm_repl.py # RLM_REPL — recursive reasoning loop
│ │ ├── repl.py # Sandboxed REPL with llm_query()
│ │ ├── sub_rlm_worker.py # Sub-LLM worker for Modal Sandboxes
│ │ └── utils/
│ │ ├── llm.py # OpenAI client wrapper
│ │ └── prompts.py # System prompts + 3-phase strategy
│ └── main.py # Needle-in-haystack example
├── monolith/ # CLI entry points
│ ├── query.py # python -m monolith.query
│ └── store.py # python -m monolith.store
└── scripts/
└── session_end_upload.sh # Stop hook for auto-upload
- Recursive Language Models — Zhang, Kraska & Khattab (2025)
- RLM blog post and original codebase
- Model Context Protocol
- Modal