Monolith

One context, run everywhere.

RLM-as-a-service: persistent, recursive reasoning for AI coding agents.

Built on top of alexzhang13/rlm — the open-source Recursive Language Model framework where LLMs offload context into a REPL environment and recursively call sub-LLMs to decompose complex tasks. On benchmarks like OOLONG (132k tokens), RLM(GPT-5-mini) outperforms GPT-5 by over 34 points at similar cost.

Monolith takes the core RLM and turns it into deployed infrastructure that AI agents can call as a tool. We wrap the RLM in an MCP server, deploy the compute on Modal serverless, and add a persistent memory layer (Modal Volume) so the RLM accumulates context across sessions. The result: plug it into Claude Code and the agent gains the ability to recursively reason over arbitrarily large contexts — and remember what it learned.

What We Added on Top of RLM

Layer	What	Why
MCP server	`server.py` (stdio) + Cloudflare Worker (HTTP)	Exposes RLM as tools any MCP-compatible agent can call
Modal backend	`modal_runtime.py` — serverless functions + HTTP endpoints	No infra to manage; scales to zero when idle
Persistent memory	Modal Volume stores `{thread_id}/context.txt`	RLM builds on past sessions instead of starting from scratch
Session auto-upload	Claude Code `Stop` hook captures full transcripts	Every conversation becomes searchable context for the RLM
Modal Sandbox sub-LLMs	`ModalSandboxSubRLM` runs sub-LLM calls in isolated sandboxes	Safe code execution for recursive calls in the cloud
CLI tools	`python -m monolith.query` / `store`	Use RLM outside of Claude Code

Architecture

Claude Code
  │
  └─ MCP (stdio or streamable-http)
      │
      ▼
MCP Server                              ← thin routing layer
  │  (Python stdio server OR Cloudflare Worker)
  │
  ├─ chat_rlm_query(query, thread_id)
  │     │
  │     ▼
  │   Modal: run_rlm_remote()
  │     ├─ reads context from Volume: /{thread_id}/context.txt
  │     ├─ runs RLM_REPL reasoning loop:
  │     │    root LLM (gpt-5) ──writes code──▶ sandboxed REPL
  │     │                                        │
  │     │    REPL calls llm_query() ────────▶ sub-LLM (gpt-5-nano)
  │     │                                        │
  │     │    results flow back to root LLM ◀─────┘
  │     │    ... repeat up to N iterations
  │     ├─ appends Q&A turn to Volume
  │     └─ returns answer
  │
  └─ upload_context(transcript, session_id, thread_id)
        │
        ▼
      Modal: store_context()
        └─ appends transcript to Volume: /{thread_id}/context.txt

How RLM Reasoning Works

The RLM never sees the full context directly. Instead it interacts with it programmatically through a REPL:

Recon — the root LLM reads the context file, checks its size, identifies the format and natural chunk boundaries
Filter + Analyze — writes Python code to split the context along those boundaries, uses regex/keywords to find relevant sections, then calls llm_query() to delegate semantic analysis of each section to a sub-LLM
Aggregate + Answer — synthesizes sub-LLM results via a final llm_query() call and returns the answer

The root LLM uses a powerful model (gpt-5) for orchestration while sub-LLMs use cheaper models (gpt-5-nano) for focused analysis — keeping cost low while handling arbitrarily large contexts.

Quick Start

Prerequisites

Python 3.12+
Modal account
OpenAI API key
Claude Code

1. Set up Modal

git clone https://github.com/WingchunSiu/Monolith.git
cd Monolith
modal token set

2. Upload your OpenAI key to the Modal Volume

modal volume create rlm-shared-volume
echo "OPENAI_API_KEY=sk-..." > /tmp/.env
modal volume put rlm-shared-volume /tmp/.env .env
rm /tmp/.env

3. Deploy the backend

cd mcp-modal
pip install -r requirements.txt
modal deploy modal_runtime.py

4. Connect to Claude Code

Local mode (stdio — recommended for dev):

claude mcp add monolith --transport stdio -- python /path/to/Monolith/mcp-modal/server.py

Cloud mode (Cloudflare Worker → Modal HTTP):

cd mcp-modal/cloudflare/worker-gateway
# set MODAL_BACKEND_URL in wrangler.toml
npm install && npm run deploy

claude mcp add monolith --transport http \
  --url https://monolith-mcp-modal.<subdomain>.workers.dev/mcp

5. Use it

Open Claude Code — the chat_rlm_query and upload_context tools are available automatically. The RLM handles recursive reasoning; Claude Code handles everything else.

MCP Tools

`chat_rlm_query`

Query the RLM with persistent thread context.

Param	Type	Description
`query`	string	The question to ask
`thread_id`	string	Thread identifier — context accumulates per thread

`upload_context`

Upload a transcript to the RLM's persistent memory.

Param	Type	Description
`transcript`	string	Full transcript text
`session_id`	string	Session identifier
`thread_id`	string	Thread to store under (default: `transcripts`)

Auto Session Upload

Add to .claude/settings.local.json to automatically capture every Claude Code session:

{
  "hooks": {
    "Stop": [{
      "type": "command",
      "command": "/path/to/Monolith/scripts/session_end_upload.sh"
    }]
  }
}

Each transcript is uploaded with metadata (developer, git branch, timestamps, message count) so the RLM can reason over your full development history.

Project Structure

Monolith/
├── mcp-modal/                  # MCP + Modal deployment layer
│   ├── server.py               # MCP server (stdio)
│   ├── modal_runtime.py        # Modal functions + HTTP endpoints
│   ├── rlm/                    # RLM package (mounted into Modal image)
│   └── cloudflare/             # Cloudflare Worker gateway
├── rlm/                        # Core RLM (forked from alexzhang13/rlm)
│   ├── rlm/
│   │   ├── rlm_repl.py         # RLM_REPL — recursive reasoning loop
│   │   ├── repl.py             # Sandboxed REPL with llm_query()
│   │   ├── sub_rlm_worker.py   # Sub-LLM worker for Modal Sandboxes
│   │   └── utils/
│   │       ├── llm.py          # OpenAI client wrapper
│   │       └── prompts.py      # System prompts + 3-phase strategy
│   └── main.py                 # Needle-in-haystack example
├── monolith/                   # CLI entry points
│   ├── query.py                # python -m monolith.query
│   └── store.py                # python -m monolith.store
└── scripts/
    └── session_end_upload.sh   # Stop hook for auto-upload

References

Recursive Language Models — Zhang, Kraska & Khattab (2025)
RLM blog post and original codebase
Model Context Protocol
Modal

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.claude		.claude
claude_tool_mcp		claude_tool_mcp
deeprecurse		deeprecurse
mcp-modal		mcp-modal
rlm		rlm
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Monolith

What We Added on Top of RLM

Architecture

How RLM Reasoning Works

Quick Start

Prerequisites

1. Set up Modal

2. Upload your OpenAI key to the Modal Volume

3. Deploy the backend

4. Connect to Claude Code

5. Use it

MCP Tools

`chat_rlm_query`

`upload_context`

Auto Session Upload

Project Structure

References

Built at TreeHacks 2026

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Monolith

What We Added on Top of RLM

Architecture

How RLM Reasoning Works

Quick Start

Prerequisites

1. Set up Modal

2. Upload your OpenAI key to the Modal Volume

3. Deploy the backend

4. Connect to Claude Code

5. Use it

MCP Tools

chat_rlm_query

upload_context

Auto Session Upload

Project Structure

References

Built at TreeHacks 2026

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`chat_rlm_query`

`upload_context`

Packages