Skip to content

feat(rlm): add Recursive Language Model mode for large codebase review#2

Open
anderskev wants to merge 29 commits intomainfrom
feat/rlm-code-review
Open

feat(rlm): add Recursive Language Model mode for large codebase review#2
anderskev wants to merge 29 commits intomainfrom
feat/rlm-code-review

Conversation

@anderskev
Copy link
Member

@anderskev anderskev commented Feb 2, 2026

Summary

This PR implements improvements to the RLM (Recursive Language Model) code review system. Changes are informed by analysis of the RLM research paper (Khattab et al.) which documented similar challenges we encountered during implementation.

What is an RLM?

Traditional code review loads the entire codebase into context, which becomes impractical for large repositories. Even frontier models exhibit context rot where quality degrades as context length increases. The paper demonstrates that the effective context window cannot be understood independently of task complexity—more complex problems degrade at shorter lengths.

The key insight: Long prompts should not be fed into the neural network directly but should instead be treated as part of the environment that the LLM can symbolically interact with.

┌─────────────────────────────────────────────────────────────────┐
│                    Traditional LLM Approach                      │
│  [Entire Codebase] ──────────────────────────────► [LLM Context] │
│       100K+ tokens                                  Context Rot! │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                      RLM Approach                                │
│  [Codebase as Variable] ◄────► [Python REPL] ◄────► [LLM]       │
│       In Memory              Code Execution      Fresh Context   │
│                                                                  │
│  LLM writes code to:                                             │
│  1. Peek at structure (repo.files, repo.structure)               │
│  2. Filter relevant files (files_containing(), files_importing())│
│  3. Recursively call sub-LLMs on targeted snippets (llm_query()) │
│  4. Build findings incrementally in variables                    │
│  5. Return final report via FINAL() or FINAL_VAR()              │
└─────────────────────────────────────────────────────────────────┘

Research Paper Insights Applied

Based on analysis of the recursive language models paper (Khattab et al.), we identified several key learnings:

1. FINAL() brittleness is a known issue

The paper noted "distinguishing between a final answer and a thought is brittle" and they "added minor safeguards" but punted full resolution to future model training. Our implementation provides robust safeguards with escape hatches.

2. Specific character targets are effective

The paper's fix for Qwen making too many sub-LLM calls was adding: "aim for around ~200k characters per call" and noting "sub-LLMs can handle around 500K chars". We've incorporated these specific targets.

3. Models make non-optimal choices

The paper observed models "carefully construct their final answer... but then discard this information and continue wasting sub-calls". We've added detection for this wasteful pattern.

4. Efficiency framing works better than punishment

The paper framed batching guidance around "runtime costs" rather than rejection. We've updated our warning messages accordingly.

Changes in This Iteration

  • Escalating prompts with efficiency-focused messaging (not just rejection warnings)
  • Force-accept escape hatch after max rejections to prevent infinite loops
  • Exploration score escape hatch for legitimate edge cases
  • Scaled thresholds based on repository size (logarithmic scaling for 100K to 10M+ token repos)
  • Agent behavior guardrails:
    • Block filesystem imports (pathlib, os, subprocess, shutil)
    • Enforce sub-LLM usage for large repos
    • Iteration checkpoints with strategy nudges
    • Pre-populated first turn showing repo context
    • Tool usage metrics tracking
  • PR mode detection via get_changed_files() from git diff
  • Review quality validation ensuring actual code analysis, not architecture summaries
  • FINAL() auto-sanitization converting problematic backtick patterns to FINAL_VAR()

Changes Planned for Next Iteration

See PLAN-rlm-fixes.md for detailed task breakdown:

Task 1.5: Wasteful Restart Pattern Detection

Detect when agent re-initializes variables or makes redundant queries that were already answered.

Tasks 2.1-2.4: Improved Code Block Extraction

  • Extract LAST code block (most likely to be executable)
  • Syntax validation before execution
  • String literal detection for edge cases

Tasks 3.1-3.5: Enhanced Prompts

  • Capacity guidelines (~500K chars/call)
  • Tiered function ordering in prompt
  • Query similarity tracking to detect redundant sub-LLM calls

Architecture

cli.py (--rlm flag)
    │
    ▼
runner.py (_run_rlm_mode / run_rlm_review_with_fallback)
    │
    ▼
rlm/runner.py (RLMRunner orchestration loop)
    │
    ├──► get_review_prompt() ──► Comprehensive prompt with strategy guidance
    │
    ├──► _extract_code() ──► Smart extraction (skip examples, prefer FINAL)
    │    └──► _sanitize_final_call() ──► Auto-convert FINAL() with backticks
    │
    ├──► _validate_review_quality() ──► Check for severity markers, file refs
    │
    ├──► _build_continuation_prompt() ──► History + results + code-only enforcement
    │
    └──► REPL execution loop:
         │
         ▼
    rlm/repl.py (REPLProcess)
         │
         ├──► Import validation (blocks pathlib, os, subprocess, shutil)
         ├──► repo.files, repo.structure (codebase access)
         ├──► repo.changed_files (PR mode - auto-populated from git)
         ├──► llm_query(), llm_query_parallel() (sub-LLM calls)
         ├──► files_containing(), files_importing() (search)
         ├──► file_exists(), list_files_matching() (discovery)
         ├──► get_file_slice() (large file handling)
         └──► FINAL(), FINAL_VAR() (completion signals)

Test Plan

  • Unit tests for escalating prompts and threshold calculations
  • Unit tests for code block extraction edge cases
  • Integration test with mock LLM that doesn't comply
  • Manual testing on small (<100K), medium (100K-1M), and large (1M+) repos

Testing Status

============================= test session starts ==============================
124 passed in 34.37s
mypy: Success, no issues found in 17 source files
ruff: All checks passed!

References

Checklist

  • Code follows project style guidelines
  • Self-review completed
  • Tests pass locally (124/124 passed)
  • Linting passes (ruff + mypy clean)
  • Documentation includes learnings from paper and implementation

🤖 Generated with Claude Code

anderskev and others added 16 commits February 1, 2026 21:45
Add design document for Recursive Language Model (RLM) integration
into Daydream for reviewing large codebases (1M+ tokens).

Key components:
- REPL environment API with repo.files, repo.structure, repo.services
- IPC via JSON-RPC over stdin/stdout with devcontainer sandbox
- Sub-LLM calls with fresh context (Haiku for sub-calls, Opus for root)
- System prompt aligned with RLM paper best practices

Also adds prompts module with production-ready system prompt that
incorporates paper learnings: iterative framing, batching warnings,
concrete examples, and verification patterns.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add clarifications based on RLM paper analysis:
- Add repo.file_sizes to planning context for batch sizing
- Document llm_query_parallel() enhancement over paper baseline
- Explain truncation as core design principle (not limitation)
- Add Design Decisions section covering:
  - Recursion depth = 1 rationale
  - Parallel sub-LLM execution benefits
  - Devcontainer sandboxing approach
- Resolve open question on recursion depth

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add inline documentation based on RLM paper analysis:
- Document stateless nature of llm_query() calls
- Add batch size recommendations (~100-200k chars per call)
- Note sub-LLM capacity (~500K chars)
- Explain why buffer variables enable unbounded output length
- Clarify FINAL_VAR() preference for complex reports

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Transform the RLM code review design from a draft overview into a
comprehensive implementation plan with TDD-style task specifications,
code snippets, and phased delivery structure.

Key additions:
- Three-phase implementation roadmap
- Detailed task breakdowns with test-first approach
- Code snippets for each component
- JSON-RPC protocol specification
- System prompt and review output formats
- Container lifecycle management design

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add RLM mode execution path that loads codebase and displays stats
before indicating full orchestration is pending. Includes test for
the integration and fixes lint issues in repl.py.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implement the full orchestration loop in RLMRunner.run():

- Add _build_system_prompt() method that generates a detailed system
  prompt describing the REPL environment, available functions (repo,
  llm_query, files_containing, files_importing, get_file_slice, FINAL,
  FINAL_VAR), and codebase metadata (file count, languages, token count,
  largest files)

- Add _extract_code() method to parse Python code from markdown fenced
  blocks in LLM responses

- Add _handle_llm_query() callback for routing sub-LLM calls from the
  REPL back to the LLM

- Add _build_continuation_prompt() to construct follow-up prompts that
  include execution results and errors

- Implement the main loop: prompt LLM -> extract code -> execute in
  REPL -> check for FINAL -> repeat until done or max_iterations

- Add max_iterations config option (default 50) to prevent infinite
  loops

- Handle errors gracefully by continuing after REPL errors and
  reporting them to the LLM for correction

- Add comprehensive integration tests that mock at the LLM boundary to
  verify the full orchestration flow

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implement graceful degradation when RLM mode fails:

- Add run_rlm_review_with_fallback() function that catches RLM-specific
  errors (REPLCrashError, HeartbeatFailedError, ContainerError) and
  falls back to standard skill-based review
- Refactor _run_rlm_mode() to use the new fallback wrapper
- Extract run_rlm_review() and run_standard_review() as separate
  functions for better testability
- Add helper _skill_to_languages() to map skill names to language lists
- Add comprehensive tests for fallback triggering on each error type

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@anderskev anderskev added enhancement New feature or request area:cli CLI commands and interface area:runner Main orchestration and run loop labels Feb 2, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 2, 2026

Walkthrough

Adds a CLI --rlm flag wired into RunConfig.rlm_mode. Introduces daydream.prompts with review prompt builders and daydream.rlm package composed of errors, ipc, container, environment, repl, history, and runner modules implementing DevContainer management, JSON‑RPC IPC, REPL execution and namespace, conversation history, and an orchestration runner. Adds RLM timeout/limit constants to daydream.config, RLM-related tests under tests/rlm, CLI integration in daydream/runner.py, and a design doc outlining the RLM code‑review workflow.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding a Recursive Language Model mode for large codebase review. It is specific, clear, and reflects the primary feature introduced by this PR.
Description check ✅ Passed The description is comprehensive and directly related to the changeset. It explains the RLM approach, references the research paper foundation, details architectural changes, and lists testing outcomes. It clearly covers the PR's primary objectives and implementation.
Docstring Coverage ✅ Passed Docstring coverage is 85.99% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@anderskev anderskev self-assigned this Feb 2, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Fix all issues with AI agents
In `@daydream/rlm/container.py`:
- Around line 124-141: The stop() method can hang because proc.communicate() has
no timeout; update daydream.rlm.container.Container.stop to wrap the
communicate() call in an asyncio.wait_for with a reasonable timeout (e.g.,
configurable constant), catch asyncio.TimeoutError to terminate the subprocess
(proc.kill()/proc.terminate() and await proc.wait()), and still clear
self.container_id in the finally block; reference stop(), proc.communicate(),
proc.kill()/proc.terminate(), proc.wait(), self.container_id, and
self.is_running when making the change.
- Around line 143-161: The exec_command implementation currently ignores the
stdin parameter and raises RuntimeError; update exec_command to raise the
module's ContainerError (not RuntimeError) when the container is not running or
container_id is None, and if stdin is provided, connect it to the command's
stdin writer by creating a piping task that reads from the provided
asyncio.StreamReader stdin and writes into the returned stdin StreamWriter (or
the container attach method if available), ensuring the returned tuple
(stdin_writer, stdout_reader, stderr_reader) is actually wired to that stdin
source; reference exec_command, stdin, is_running, container_id and
ContainerError when making the change.

In `@daydream/rlm/ipc.py`:
- Around line 99-134: In decode_message, validate JSON-RPC 2.0 conformance:
first assert data.get("jsonrpc") == "2.0" and raise a descriptive exception if
missing/wrong; when handling a request (if "method" in data) ensure "method" is
present and a string and then return JsonRpcRequest(method=...,
id=data.get("id"), params=data.get("params")); when handling a response (no
"method") enforce that exactly one of "result" or "error" is present (raise if
both or neither), and if "error" is present validate it contains "code" and
"message" before constructing JsonRpcError(code=..., message=..., data=...) and
returning JsonRpcResponse(id=data.get("id"), result=..., error=...).

In `@daydream/rlm/runner.py`:
- Line 70: The RLMConfig.timeout value is never enforced in the orchestration
loop; update the run() method in runner.py to enforce a wall-clock timeout by
recording a start timestamp at the beginning of run(), checking elapsed time
against RLMConfig.timeout at the top (or bottom) of each iteration of the main
loop, and breaking/raising a TimeoutError or returning a timeout status when
elapsed >= timeout; reference the RLMConfig.timeout field and the run() function
name so you add the check in the loop that currently uses max_iterations only,
and ensure any necessary cleanup/termination logic is executed when timing out.

In `@daydream/runner.py`:
- Around line 268-270: The RLM branch returns early without applying the chosen
model, so either pass the model through the RLM call or set it before returning:
update _run_rlm_mode to accept a model parameter (e.g., _run_rlm_mode(config,
target_dir, model)) and propagate config.model into run_rlm_review_with_fallback
inside that chain, or call set_model(config.model) immediately before the early
return in the config.rlm_mode branch so the RLM orchestration uses the selected
model when its TODO is implemented.

In `@docs/plans/2026-02-01-rlm-code-review-design.md`:
- Around line 21-23: Replace bolded subsection lines like "**Step 1: Write the
failing test**" with proper Markdown headings (for example "### Step 1: Write
the failing test") throughout the document to satisfy MD036; search for other
emphasized step lines (e.g., "**Step 2: ...**", "**Step 3: ...**") and convert
them consistently to heading syntax so they render as actual headings rather
than bold text.

In `@tests/test_cli.py`:
- Around line 39-59: The test test_rlm_mode_prints_not_implemented uses a weak
assertion that passes if exit_code == 0 even when no output indicates RLM;
update the assertion to require both a zero exit code and the expected output
string (e.g., require exit_code == 0 AND "RLM" in captured.out) so the test
actually verifies that run(RunConfig(..., rlm_mode=True)) prints the
not-implemented message; locate this logic in the test function
test_rlm_mode_prints_not_implemented and change the final assertion accordingly.
🧹 Nitpick comments (11)
daydream/prompts/review_system_prompt.py (1)

404-408: Guard PR-context insertion against base-prompt drift.
If the base prompt text changes, the replace call silently skips the PR context. Add a marker check (or placeholder) so failures are explicit.

Proposed fix
     base_prompt = build_review_system_prompt(metadata)
@@
-    return base_prompt.replace(
-        "Begin by exploring the codebase structure.",
-        pr_context + "\nBegin by exploring the changed files and their context.",
-    )
+    marker = "Begin by exploring the codebase structure."
+    if marker not in base_prompt:
+        raise ValueError("Base prompt missing insertion marker")
+    return base_prompt.replace(
+        marker,
+        pr_context + "\nBegin by exploring the changed files and their context.",
+    )
daydream/rlm/environment.py (2)

142-147: Tighten files_importing to avoid prefix false positives.
files_importing("os") currently matches import osx. Add word-boundary handling (and optional submodule dots) to reduce noise.

Proposed fix
-        pattern = rf"(?:^|\n)\s*(?:import\s+{re.escape(module)}|from\s+{re.escape(module)}\s+import)"
-        compiled = re.compile(pattern)
+        pattern = (
+            rf"^\s*(?:import\s+{re.escape(module)}(?:\b|\.|$)"
+            rf"|from\s+{re.escape(module)}(?:\.|\b)\s+import)"
+        )
+        compiled = re.compile(pattern, re.MULTILINE)

152-158: Add bounds checks for get_file_slice inputs.
Negative/zero starts or end < start can yield surprising slices due to Python’s negative indexing.

Proposed fix
     def get_file_slice(path: str, start_line: int, end_line: int) -> str:
         """Get specific line range from a file (1-based, inclusive)."""
         content = ctx.files.get(path, "")
         lines = content.split("\n")
+        if start_line < 1 or end_line < start_line:
+            raise ValueError("start_line must be >= 1 and <= end_line")
+        end_line = min(end_line, len(lines))
         # Convert to 0-based index, end_line is inclusive
         selected = lines[start_line - 1 : end_line]
         return "\n".join(selected)
tests/rlm/test_repl.py (1)

51-70: Consider consolidating duplicate test methods.

test_repl_process_init and test_repl_process_not_running_initially are nearly identical—both create a RepoContext, instantiate REPLProcess, and assert is_running is False. Consider merging them or differentiating their purpose if they test distinct behaviors.

tests/rlm/test_runner_integration.py (1)

90-126: Consider strengthening iteration count assertion.

Line 126 asserts len(iterations) >= 1, but given the mock behavior (code on first call, FINAL on second), the expected count should be at least 2. The current assertion may pass even if the orchestration loop doesn't complete its second iteration.

Proposed fix
-        assert len(iterations) >= 1
+        assert len(iterations) >= 2  # First call returns code, second returns FINAL
daydream/rlm/container.py (1)

119-122: Consider chaining exceptions for better debugging.

When re-raising ContainerError from TimeoutError or JSONDecodeError, chain the original exception to preserve the traceback context.

Proposed fix
         except asyncio.TimeoutError:
-            raise ContainerError(f"Container startup timed out after {timeout}s")
+            raise ContainerError(f"Container startup timed out after {timeout}s") from None
         except json.JSONDecodeError as e:
-            raise ContainerError(f"Failed to parse devcontainer output: {e}")
+            raise ContainerError(f"Failed to parse devcontainer output: {e}") from e
daydream/rlm/repl.py (1)

111-127: Consider adding separator when combining stdout and stderr.

When stderr is appended to stdout (line 120), there's no newline separator, which could result in confusing mixed output if stdout doesn't end with a newline.

Proposed fix
             # Combine stdout and stderr
             if stderr_output:
-                output = output + stderr_output
+                if output and not output.endswith("\n"):
+                    output = output + "\n"
+                output = output + stderr_output
daydream/runner.py (2)

140-145: Placeholder implementation noted.

The TODO indicates full orchestration is pending. This is acceptable for the initial PR, but tracking for follow-up would be helpful.

Would you like me to open an issue to track the full RLM orchestration implementation?


224-228: Consider specifying encoding for file write.

For consistency with other file operations in the codebase (e.g., read_text(encoding="utf-8") in load_codebase), consider explicitly specifying the encoding.

📝 Suggested improvement
         # Write review output to file if we got content
         if review_output:
             review_output_path = target_dir / REVIEW_OUTPUT_FILE
-            review_output_path.write_text(review_output)
+            review_output_path.write_text(review_output, encoding="utf-8")
             print_success(console, f"Review output written to: {review_output_path}")
daydream/rlm/runner.py (2)

242-242: Empty line when changed_files is None.

The f-string outputs an empty string when changed_files is None/empty, which may leave an extra blank line in the prompt. This is minor and doesn't affect functionality.

📝 Suggested improvement to avoid blank line
 - **Total tokens**: ~{ctx.total_tokens:,} estimated
-{f'- **Changed files (PR mode)**: {len(ctx.changed_files)} files' if ctx.changed_files else ''}
+{f"- **Changed files (PR mode)**: {len(ctx.changed_files)} files" + chr(10) if ctx.changed_files else ""}

Or restructure to build the section conditionally before the f-string.


419-421: Executing raw response may be risky.

When FINAL( is detected in the response but no code block is found, the code executes response.strip() directly. If the response contains markdown or other non-Python content alongside the FINAL call, this could cause execution errors.

📝 Suggested improvement
                 if not code:
                     # No code found - try to extract any runnable content
                     # or ask for clarification
                     if "FINAL(" in response:
-                        # Try to execute the raw response
-                        code = response.strip()
+                        # Try to extract just the FINAL call
+                        import re
+                        final_match = re.search(r'(FINAL\([^)]*\))', response)
+                        if final_match:
+                            code = final_match.group(1)
+                        else:
+                            code = response.strip()
                     else:

Alternatively, consider requiring code blocks and prompting the LLM to format correctly.

anderskev and others added 3 commits February 2, 2026 09:00
- Require explicit python/py tag in code block regex to prevent
  extracting ASCII diagrams and non-Python blocks as executable code
- Add ConversationHistory class with token budget enforcement to
  maintain context between iterations
- Update prompts to guide LLM toward FINAL_VAR() for complex reports
  containing code examples or triple-quotes
- Refactor DevContainer to use trailofbits/claude-code-devcontainer CLI
- Add query_llm_simple() for headless LLM queries without Rich UI
- Wire up full RLM orchestration loop with event logging

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add .devcontainer/ with Dockerfile and config for Claude Code sandbox
- Add docs/testing/test-plan.yaml for RLM feature testing

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
These were development artifacts that are no longer needed:
- .devcontainer/ - Now using external trailofbits/claude-code-devcontainer
- docs/testing/test-plan.yaml - Development-time test plan for RLM feature

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @.devcontainer/Dockerfile:
- Line 43: The Dockerfile currently pulls the uv image using the floating tag
"ghcr.io/astral-sh/uv:latest" which harms reproducibility; change the image
reference in the COPY instruction (the "ghcr.io/astral-sh/uv:latest" token) to a
pinned version or digest (for example a specific semver tag or an sha256 digest)
so builds are deterministic, and update any related documentation or build
scripts that assume :latest.

In `@daydream/rlm/repl.py`:
- Around line 111-172: The execute method currently catches Exception but not
SystemExit, so model-generated sys.exit() will terminate the host; add an
explicit except SystemExit block (placed before the generic except Exception)
inside execute of the REPL class to intercept SystemExit, capture stdout/stderr
(use stdout_capture.getvalue() and stderr_capture.getvalue() or
traceback.format_exc()) and return an ExecuteResult with the output and an
appropriate error message (e.g., include the SystemExit code or traceback) and
final_answer=None so the host process is not exited; reference the execute
function, StreamingStringIO stdout_capture/stderr_capture, and ExecuteResult
when implementing this change.

In `@daydream/rlm/runner.py`:
- Around line 596-598: The current synchronous call result =
self._repl.execute(code) blocks the event loop and can deadlock when REPL code
calls llm_query (which uses llm_callback and run_coroutine_threadsafe); change
it to run the REPL execution on a worker thread by replacing the direct call
with an awaited asyncio.to_thread(self._repl.execute, code) inside the async
context, ensure asyncio is imported and that the containing coroutine (e.g., the
method where this appears) awaits that call so REPL execution no longer blocks
the loop.

In `@daydream/runner.py`:
- Around line 122-206: query_llm_simple currently doesn't set the global client
pointer or use AgentState for model/debugging; update it to mirror run_agent by
assigning _current_client = client when the ClaudeSDKClient (or equivalent) is
created, use effective_model = model if model is not None else _state.model
instead of taking model directly, and replace direct _log_debug(...) calls with
_state.debug_log(...) so all state/logging flows through AgentState; ensure any
cleanup or overwrites of _current_client follow the same lifecycle pattern as
run_agent.
🧹 Nitpick comments (4)
daydream/agent.py (1)

185-250: Consider tracking _current_client for graceful shutdown support.

Unlike run_agent, this function doesn't set _current_client, which means external shutdown handling (e.g., signal handlers) won't be able to cancel an in-progress query. If RLM mode needs to be interruptible, this could cause issues during graceful termination.

If this is intentional for the simpler use case, consider documenting that in the docstring. Otherwise:

♻️ Proposed fix to track client for shutdown
 async def query_llm_simple(
     cwd: Path,
     prompt: str,
     model: str = "opus",
     label: str | None = None,
 ) -> str:
     """Simple LLM query returning text response only.

     Unlike run_agent(), this function:
     - Does not render Rich UI output
     - Does not support structured output
     - Returns only text content
     - Logs all fields needed for Rich UI upgrade
+    - Does not support graceful shutdown interruption

     ...
     """
+    global _current_client
     log_prefix = f"[{label}] " if label else ""
     ...
     async with ClaudeSDKClient(options=options) as client:
+        _current_client = client
-        await client.query(prompt)
-        async for msg in client.receive_response():
+        try:
+            await client.query(prompt)
+            async for msg in client.receive_response():
             ...
+        finally:
+            _current_client = None

Based on learnings: "maintain _current_client for SDK instance access".

.devcontainer/Dockerfile (1)

91-95: Consider exposing FZF_VERSION as a build arg in devcontainer.json for consistency.

Other version ARGs (GIT_DELTA_VERSION, ZSH_IN_DOCKER_VERSION) are configured in devcontainer.json, but FZF_VERSION is hardcoded only in the Dockerfile. For consistency and easier maintenance, consider adding it to devcontainer.json build args.

.devcontainer/post_install.py (1)

26-29: Silent suppression of JSONDecodeError may hide corruption.

If settings.json exists but contains malformed JSON, the file will be silently overwritten with default settings. Consider logging a warning when this occurs so users are aware their existing settings were replaced.

     settings = {}
     if settings_file.exists():
-        with contextlib.suppress(json.JSONDecodeError):
-            settings = json.loads(settings_file.read_text())
+        try:
+            settings = json.loads(settings_file.read_text())
+        except json.JSONDecodeError:
+            print(f"[post_install] Warning: Existing {settings_file} is malformed, resetting", file=sys.stderr)
.devcontainer/devcontainer.json (1)

16-19: Elevated network capabilities require justification.

The container requests NET_ADMIN and NET_RAW capabilities. These are needed for the network tools (iptables, ipset) installed in the Dockerfile for security testing. Consider adding a comment in the JSON (or a README note) explaining why these capabilities are required, so future maintainers understand the security implications.

- Add JSON-RPC 2.0 spec validation to decode_message (ipc.py)
- Add SystemExit handler in REPL to prevent host termination (repl.py)
- Add _current_client tracking and _state.model fallback to query_llm_simple (agent.py)
- Strengthen test assertion to require exit_code == 0 (test_cli.py)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 2, 2026
@anderskev anderskev changed the title feat(rlm): add RLM mode for large codebase review feat(rlm): add Recursive Language Model mode for large codebase review Feb 2, 2026
The LLM sometimes includes documentation examples with embedded code
blocks (e.g., `@dataclass` class definitions) that were being extracted
instead of the actual executable code.

Changes:
- Add smart code extraction that prefers blocks with FINAL/FINAL_VAR
- Skip blocks that look like examples (bare decorators, short snippets)
- Fall back to last code block (most likely executable)
- Add documentation clarifying repo.files values are strings directly

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@tests/rlm/test_runner.py`:
- Around line 380-426: The async test function
test_consecutive_error_limit_exits_loop is missing the pytest asyncio marker;
add the `@pytest.mark.asyncio` decorator immediately above
test_consecutive_error_limit_exits_loop and ensure pytest is imported in the
test module if not already present so the async test runs correctly when calling
runner.run() and asserting behavior around MAX_CONSECUTIVE_ERRORS and emitted
events.
- Around line 428-475: The async test function
test_consecutive_error_counter_resets_on_valid_code is missing the pytest
asyncio marker; add `@pytest.mark.asyncio` immediately above the function
definition and ensure pytest is imported in the test module (add "import pytest"
at the top if not present) so the async test runs reliably with pytest;
reference the test function name
test_consecutive_error_counter_resets_on_valid_code and the test module
tests/rlm/test_runner.py when making the change.
🧹 Nitpick comments (2)
daydream/prompts/review_system_prompt.py (1)

414-418: Consider a more robust prompt composition approach.

The str.replace() approach for injecting PR context is fragile — if the target string in the base prompt changes, the replacement silently fails without error. The current implementation works, but future prompt refactoring could break this without warning.

🔧 Alternative: Use a placeholder marker
 # In build_review_system_prompt, at line 370:
-Begin by exploring the codebase structure.
+{begin_instruction}
 '''
+    # Then inject the appropriate instruction based on context

Or add a simple assertion to catch mismatches during development:

    result = base_prompt.replace(
        "Begin by exploring the codebase structure.",
        pr_context + "\nBegin by exploring the changed files and their context.",
    )
    # Sanity check that replacement occurred
    assert "Begin by exploring the changed files" in result, "PR prompt injection failed"
    return result
tests/rlm/test_runner_integration.py (1)

1-18: Unused imports can be cleaned up.

AsyncMock, MagicMock, and patch are imported but not used in the visible test code. If they're needed for tests not shown, this is fine; otherwise, consider removing them.

-from unittest.mock import AsyncMock, MagicMock, patch
+# Remove if unused, or keep if needed for additional tests

Comment on lines +380 to +426
async def test_consecutive_error_limit_exits_loop(self, tmp_path):
"""Runner should exit after MAX_CONSECUTIVE_ERRORS consecutive no-code responses."""
# Create a minimal test file
(tmp_path / "main.py").write_text("x = 1")

cfg = RLMConfig(
workspace_path=str(tmp_path),
languages=["python"],
use_container=False, # Don't use container for testing
max_iterations=20, # Higher than MAX_CONSECUTIVE_ERRORS
)
runner = RLMRunner(cfg)

# Track how many times LLM was called
call_count = 0

async def mock_llm(prompt: str) -> str:
nonlocal call_count
call_count += 1
# Always return a response without valid code blocks
return "I'll analyze the code but I forgot to include code blocks."

runner._call_llm = mock_llm

# Track emitted events
events = []

def track_events(iteration: int, event_type: str, data: dict) -> None:
events.append((iteration, event_type, data))

runner._on_event = track_events

# Run the review
result = await runner.run()

# Should have exited after MAX_CONSECUTIVE_ERRORS
assert call_count == MAX_CONSECUTIVE_ERRORS
assert "Code Review (Failed)" in result
assert f"{MAX_CONSECUTIVE_ERRORS} consecutive iterations" in result
assert "without valid Python code" in result

# Verify consecutive_error_limit event was emitted
error_limit_events = [
e for e in events if e[1] == "consecutive_error_limit"
]
assert len(error_limit_events) == 1
assert error_limit_events[0][2]["consecutive_errors"] == MAX_CONSECUTIVE_ERRORS
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing @pytest.mark.asyncio decorator on async test.

The test is defined as async def but lacks the @pytest.mark.asyncio decorator. While some pytest configurations auto-detect async tests, explicit marking is best practice and ensures consistent behavior across different environments.

🔧 Add pytest marker
+    `@pytest.mark.asyncio`
     async def test_consecutive_error_limit_exits_loop(self, tmp_path):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def test_consecutive_error_limit_exits_loop(self, tmp_path):
"""Runner should exit after MAX_CONSECUTIVE_ERRORS consecutive no-code responses."""
# Create a minimal test file
(tmp_path / "main.py").write_text("x = 1")
cfg = RLMConfig(
workspace_path=str(tmp_path),
languages=["python"],
use_container=False, # Don't use container for testing
max_iterations=20, # Higher than MAX_CONSECUTIVE_ERRORS
)
runner = RLMRunner(cfg)
# Track how many times LLM was called
call_count = 0
async def mock_llm(prompt: str) -> str:
nonlocal call_count
call_count += 1
# Always return a response without valid code blocks
return "I'll analyze the code but I forgot to include code blocks."
runner._call_llm = mock_llm
# Track emitted events
events = []
def track_events(iteration: int, event_type: str, data: dict) -> None:
events.append((iteration, event_type, data))
runner._on_event = track_events
# Run the review
result = await runner.run()
# Should have exited after MAX_CONSECUTIVE_ERRORS
assert call_count == MAX_CONSECUTIVE_ERRORS
assert "Code Review (Failed)" in result
assert f"{MAX_CONSECUTIVE_ERRORS} consecutive iterations" in result
assert "without valid Python code" in result
# Verify consecutive_error_limit event was emitted
error_limit_events = [
e for e in events if e[1] == "consecutive_error_limit"
]
assert len(error_limit_events) == 1
assert error_limit_events[0][2]["consecutive_errors"] == MAX_CONSECUTIVE_ERRORS
`@pytest.mark.asyncio`
async def test_consecutive_error_limit_exits_loop(self, tmp_path):
"""Runner should exit after MAX_CONSECUTIVE_ERRORS consecutive no-code responses."""
# Create a minimal test file
(tmp_path / "main.py").write_text("x = 1")
cfg = RLMConfig(
workspace_path=str(tmp_path),
languages=["python"],
use_container=False, # Don't use container for testing
max_iterations=20, # Higher than MAX_CONSECUTIVE_ERRORS
)
runner = RLMRunner(cfg)
# Track how many times LLM was called
call_count = 0
async def mock_llm(prompt: str) -> str:
nonlocal call_count
call_count += 1
# Always return a response without valid code blocks
return "I'll analyze the code but I forgot to include code blocks."
runner._call_llm = mock_llm
# Track emitted events
events = []
def track_events(iteration: int, event_type: str, data: dict) -> None:
events.append((iteration, event_type, data))
runner._on_event = track_events
# Run the review
result = await runner.run()
# Should have exited after MAX_CONSECUTIVE_ERRORS
assert call_count == MAX_CONSECUTIVE_ERRORS
assert "Code Review (Failed)" in result
assert f"{MAX_CONSECUTIVE_ERRORS} consecutive iterations" in result
assert "without valid Python code" in result
# Verify consecutive_error_limit event was emitted
error_limit_events = [
e for e in events if e[1] == "consecutive_error_limit"
]
assert len(error_limit_events) == 1
assert error_limit_events[0][2]["consecutive_errors"] == MAX_CONSECUTIVE_ERRORS
🤖 Prompt for AI Agents
In `@tests/rlm/test_runner.py` around lines 380 - 426, The async test function
test_consecutive_error_limit_exits_loop is missing the pytest asyncio marker;
add the `@pytest.mark.asyncio` decorator immediately above
test_consecutive_error_limit_exits_loop and ensure pytest is imported in the
test module if not already present so the async test runs correctly when calling
runner.run() and asserting behavior around MAX_CONSECUTIVE_ERRORS and emitted
events.

Comment on lines +428 to +475
async def test_consecutive_error_counter_resets_on_valid_code(self, tmp_path):
"""Consecutive error counter should reset when valid code is found."""
(tmp_path / "main.py").write_text("x = 1")

cfg = RLMConfig(
workspace_path=str(tmp_path),
languages=["python"],
use_container=False,
max_iterations=10,
)
runner = RLMRunner(cfg)

call_count = 0

async def mock_llm(prompt: str) -> str:
nonlocal call_count
call_count += 1
# First few calls: no code, then valid code, then no code again
if call_count <= 3:
# No code for first 3 calls
return "Analyzing the codebase..."
elif call_count == 4:
# Valid code that calls FINAL
return '```python\nFINAL("Review complete")\n```'
else:
# Should not reach here if FINAL works
return "More analysis..."

runner._call_llm = mock_llm

events = []

def track_events(iteration: int, event_type: str, data: dict) -> None:
events.append((iteration, event_type, data))

runner._on_event = track_events

result = await runner.run()

# Should have completed with FINAL, not hit error limit
assert "Review complete" in result
assert "Code Review (Failed)" not in result

# Verify no consecutive_error_limit event was emitted
error_limit_events = [
e for e in events if e[1] == "consecutive_error_limit"
]
assert len(error_limit_events) == 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing @pytest.mark.asyncio decorator on async test.

Same issue as above — add explicit pytest marker for consistent behavior.

🔧 Add pytest marker
+    `@pytest.mark.asyncio`
     async def test_consecutive_error_counter_resets_on_valid_code(self, tmp_path):
🤖 Prompt for AI Agents
In `@tests/rlm/test_runner.py` around lines 428 - 475, The async test function
test_consecutive_error_counter_resets_on_valid_code is missing the pytest
asyncio marker; add `@pytest.mark.asyncio` immediately above the function
definition and ensure pytest is imported in the test module (add "import pytest"
at the top if not present) so the async test runs reliably with pytest;
reference the test function name
test_consecutive_error_counter_resets_on_valid_code and the test module
tests/rlm/test_runner.py when making the change.

anderskev and others added 2 commits February 2, 2026 20:50
Analysis of test_run.log revealed several inefficiencies causing ~33%
wasted iterations. These changes address the key issues identified:

**Single code block guidance**: The RLM paper shows GPT-5's iterative
single-block pattern (91.33%) dramatically outperformed Qwen3-Coder's
multi-block pattern (44.66%). Added explicit guidance to generate ONE
code block per turn to enable proper observe-act-observe feedback loop.

**Prominent API warning**: Model repeatedly tried `repo.files[path].content`
when values are strings directly. Moved warning to dedicated section
with emoji header (## ⚠️ CRITICAL API NOTE) for visibility.

**File existence helpers**: Added `file_exists()` and `list_files_matching()`
to help models verify paths before reading, reducing hallucinated file errors.

**Code-only enforcement**: Added "RESPOND WITH PYTHON CODE ONLY" to
continuation prompt after observing iteration 5 generated prose instead
of code, causing complete iteration waste.

**Available paths in recovery**: Show top-level directories in recovery
prompt to help orient models when they're lost.

**Starter template**: Added Quick Start Template with orientation code
to reduce discovery phase iterations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The model sometimes passes a dict or other non-string type to FINAL().
When this happened, len(dict) returned the number of keys instead of
string length, causing confusing "6 chars" log messages and potential
downstream issues.

FINAL_VAR already does str() conversion, so FINAL should match.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@daydream/rlm/environment.py`:
- Around line 134-137: The function files_containing currently calls
re.compile(pattern) which will raise re.error for invalid regexes; update
files_containing to catch re.error around re.compile (and/or compiled.search)
and handle it gracefully—either return an empty list or raise a clear ValueError
with the invalid pattern included—so the REPL won't crash; reference the
files_containing function and the compiled variable and ensure you still iterate
ctx.files.items() only when the regex compiled successfully.
- Around line 167-173: The helper get_file_slice currently accepts invalid
1-based indexes and produces incorrect slices (e.g., start_line=0 yields
lines[-1:]); add input validation at the start of get_file_slice to ensure
start_line and end_line are integers >= 1, that start_line <= end_line, and that
end_line does not exceed the number of lines in the file (len(lines)); if any
check fails, raise a ValueError with a clear message referencing the parameters
(start_line, end_line, path) so callers get deterministic errors instead of
silent incorrect results.
🧹 Nitpick comments (1)
daydream/prompts/review_system_prompt.py (1)

444-448: String replacement for prompt injection is fragile.

Using str.replace() to inject PR context relies on the exact string "Begin by exploring the codebase structure." existing in the base prompt. If that text changes, the replacement silently fails and returns the unmodified prompt.

Consider a more robust approach like using a placeholder or structured prompt sections.

♻️ Example using placeholder
 # In build_review_system_prompt, near line 400:
-Begin by exploring the codebase structure.
+{exploration_directive}
 '''
+    # At end of function, before return:
+    return prompt.format(exploration_directive="Begin by exploring the codebase structure.")

 # Then in build_pr_review_prompt:
-    return base_prompt.replace(
-        "Begin by exploring the codebase structure.",
-        pr_context + "\nBegin by exploring the changed files and their context.",
-    )
+    return base_prompt.format(
+        exploration_directive=pr_context + "\nBegin by exploring the changed files and their context."
+    )

anderskev and others added 2 commits February 2, 2026 21:14
Move the blocking self._repl.execute(code) call to a worker thread using
asyncio.to_thread(). This prevents deadlock when REPL code calls llm_query(),
which uses run_coroutine_threadsafe to schedule work on the event loop.

Previously, the synchronous execute() blocked the event loop, so any
llm_query() calls from REPL code would wait forever for the blocked
loop to process their coroutines.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add get_changed_files() to auto-detect PR mode from git diff
- Use comprehensive prompt template with "Probe → Filter → Batch → Aggregate" strategy
- Add mandatory first-turn template to identify review scope immediately
- Auto-sanitize FINAL() calls with embedded backticks to prevent syntax errors
- Add review quality validation (severity markers, file:line refs)
- Warn when agent completes without sub-LLM analysis calls

These changes ensure focused reviews on changed files (when applicable)
and catch superficial reviews that only explore structure without
actually analyzing code.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@daydream/prompts/review_system_prompt.py`:
- Around line 470-474: The PR context insertion uses base_prompt.replace("Begin
by exploring the codebase structure.", pr_context + "\nBegin by exploring the
changed files and their context.") which silently does nothing if the marker
string changes; update the logic in the function that builds the prompt (the
call site using base_prompt.replace, pr_context and the marker string) to first
check whether the marker exists in base_prompt and perform the replace if found,
otherwise append (or prepend) pr_context to base_prompt as a fallback so the PR
block is never dropped; ensure you still include the updated "Begin by exploring
the changed files and their context." text when the replace succeeds.

In `@daydream/rlm/runner.py`:
- Around line 763-768: In run(), compute the repository's changed files and pass
that list into load_codebase so PR mode activates: locate the run() function
where you build workspace = Path(self.config.workspace_path) and call
load_codebase(workspace, self.config.languages) and update that call to include
the repo.changed_files (or compute changed_files via the repo object) as an
additional argument; ensure load_codebase signature and any callers accept a
changed_files parameter and that repo.changed_files (or the computed list) is
forwarded so PR-focused prompts reach the REPL.

anderskev and others added 2 commits February 2, 2026 22:43
- Block filesystem imports (pathlib, os, subprocess, shutil) to enforce
  repo.files usage instead of manual file reading
- Enforce minimum sub-LLM calls for large repos (>100k tokens) before
  accepting FINAL() - rejects early finalization with guidance
- Add iteration checkpoints at 5, 10, 15 with strategy nudges
- Improve FINAL() extraction for complex strings with triple-quotes
- Pre-populate first turn with repo context and explicit next steps
- Add RLMMetrics dataclass tracking all tool calls with event emission

These guardrails address issues found in production logs where agents
bypassed the batched sub-LLM analysis strategy by manually reading
files 54 times and never calling llm_query().

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add try/except for re.error in files_containing() to handle invalid regex
- Add input validation/clamping in get_file_slice() for invalid line numbers
- Add fallback for missing marker in PR context insertion
- Pass changed_files to load_codebase() for PR mode support

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@daydream/rlm/runner.py`:
- Around line 773-789: _handle_files_importing currently compiles a regex from
the module name without guarding against invalid patterns; wrap the
re.compile(...) (or the pattern construction) in a try/except that catches
re.error, increments or logs the failure appropriately, and returns an empty
list (keeping metrics consistent), so that invalid module names do not raise;
specifically modify the block that builds pattern and calls re.compile(pattern)
in _handle_files_importing to catch re.error and return [] while preserving the
metrics update behavior.
- Around line 756-771: _handle_files_containing currently compiles a user regex
without handling re.error; wrap the re.compile(pattern) call in a try/except
re.error block, and on exception log the invalid pattern and exception via
self._logger.error (include pattern and error text) and return an empty list so
you don’t proceed to use compiled.search; keep the rest of the logic (iterating
self._context.files and updating self._metrics.unique_files_accessed) unchanged
when compilation succeeds.
🧹 Nitpick comments (3)
daydream/rlm/environment.py (2)

146-152: Unhandled re.error in files_importing if module contains regex metacharacters after escaping edge cases.

While re.escape(module) handles most cases, re.compile can still raise re.error for extremely malformed input or internal issues. For consistency with files_containing, consider wrapping in try/except.

🛡️ Optional: Add defensive error handling
     def files_importing(module: str) -> list[str]:
         """Find files that import a given module."""
         # Match: import X, from X import, import X as Y
         pattern = rf"(?:^|\n)\s*(?:import\s+{re.escape(module)}|from\s+{re.escape(module)}\s+import)"
-        compiled = re.compile(pattern)
-        return [path for path, content in ctx.files.items() if compiled.search(content)]
+        try:
+            compiled = re.compile(pattern)
+        except re.error as e:
+            print(f"[Warning] Invalid module pattern: {e}")
+            return []
+        return [path for path, content in ctx.files.items() if compiled.search(content)]

163-167: Consider adding error handling for invalid glob patterns.

fnmatch.fnmatch doesn't raise on invalid patterns but may produce unexpected results. For consistency and robustness, a brief validation or documentation note could help.

daydream/rlm/runner.py (1)

711-723: _handle_llm_query_parallel doesn't leverage parallel execution.

The method falls back to sequential execution via a list comprehension. While the docstring mentions "fall back to sequential," this defeats the purpose of llm_query_parallel. Consider implementing actual parallel execution using asyncio.gather or concurrent.futures.

♻️ Suggested parallel implementation
     def _handle_llm_query_parallel(self, prompts: list[str], model: str) -> list[str]:
         """Handle llm_query_parallel callback from REPL."""
         self._metrics.llm_query_parallel_calls += 1
-        # Fall back to sequential execution
-        return [self._handle_llm_query(p, model) for p in prompts]
+        # Execute queries in parallel using ThreadPoolExecutor
+        # (actual parallelism depends on _llm_callback implementation)
+        from concurrent.futures import ThreadPoolExecutor, as_completed
+        
+        results = [None] * len(prompts)
+        with ThreadPoolExecutor(max_workers=min(len(prompts), 5)) as executor:
+            future_to_idx = {
+                executor.submit(self._handle_llm_query, p, model): i
+                for i, p in enumerate(prompts)
+            }
+            for future in as_completed(future_to_idx):
+                idx = future_to_idx[future]
+                results[idx] = future.result()
+        return results

Comment on lines +756 to +771
def _handle_files_containing(self, pattern: str) -> list[str]:
"""Handle files_containing callback from REPL.

Args:
pattern: Regex pattern to search for.

Returns:
List of file paths matching the pattern.
"""
self._metrics.files_containing_calls += 1
if self._context is None:
return []
compiled = re.compile(pattern)
results = [path for path, content in self._context.files.items() if compiled.search(content)]
self._metrics.unique_files_accessed.update(results)
return results
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unhandled re.error in _handle_files_containing.

This method compiles user-provided regex patterns without error handling. While files_containing in environment.py now handles this, this internal handler duplicates the logic without the same protection.

🛡️ Proposed fix to add error handling
     def _handle_files_containing(self, pattern: str) -> list[str]:
         """Handle files_containing callback from REPL."""
         self._metrics.files_containing_calls += 1
         if self._context is None:
             return []
+        try:
+            compiled = re.compile(pattern)
+        except re.error:
+            return []
-        compiled = re.compile(pattern)
         results = [path for path, content in self._context.files.items() if compiled.search(content)]
         self._metrics.unique_files_accessed.update(results)
         return results
🤖 Prompt for AI Agents
In `@daydream/rlm/runner.py` around lines 756 - 771, _handle_files_containing
currently compiles a user regex without handling re.error; wrap the
re.compile(pattern) call in a try/except re.error block, and on exception log
the invalid pattern and exception via self._logger.error (include pattern and
error text) and return an empty list so you don’t proceed to use
compiled.search; keep the rest of the logic (iterating self._context.files and
updating self._metrics.unique_files_accessed) unchanged when compilation
succeeds.

Comment on lines +773 to +789
def _handle_files_importing(self, module: str) -> list[str]:
"""Handle files_importing callback from REPL.

Args:
module: Module name to search for.

Returns:
List of file paths that import the module.
"""
self._metrics.files_importing_calls += 1
if self._context is None:
return []
pattern = rf"(?:^|\n)\s*(?:import\s+{re.escape(module)}|from\s+{re.escape(module)}\s+import)"
compiled = re.compile(pattern)
results = [path for path, content in self._context.files.items() if compiled.search(content)]
self._metrics.unique_files_accessed.update(results)
return results
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Same re.error risk in _handle_files_importing.

Similar to _handle_files_containing, this method should handle re.error for consistency.

🛡️ Proposed fix
     def _handle_files_importing(self, module: str) -> list[str]:
         """Handle files_importing callback from REPL."""
         self._metrics.files_importing_calls += 1
         if self._context is None:
             return []
         pattern = rf"(?:^|\n)\s*(?:import\s+{re.escape(module)}|from\s+{re.escape(module)}\s+import)"
-        compiled = re.compile(pattern)
+        try:
+            compiled = re.compile(pattern)
+        except re.error:
+            return []
         results = [path for path, content in self._context.files.items() if compiled.search(content)]
         self._metrics.unique_files_accessed.update(results)
         return results
🤖 Prompt for AI Agents
In `@daydream/rlm/runner.py` around lines 773 - 789, _handle_files_importing
currently compiles a regex from the module name without guarding against invalid
patterns; wrap the re.compile(...) (or the pattern construction) in a try/except
that catches re.error, increments or logs the failure appropriately, and returns
an empty list (keeping metrics consistent), so that invalid module names do not
raise; specifically modify the block that builds pattern and calls
re.compile(pattern) in _handle_files_importing to catch re.error and return []
while preserving the metrics update behavior.

Models often hallucinate a 'context' parameter when calling llm_query()
because it's a common pattern in LLM APIs. Instead of failing with
TypeError, the function now gracefully merges the context into the
prompt string.

This prevents reviews from failing due to the model using an incorrect
API signature.

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@daydream/rlm/environment.py`:
- Around line 124-134: The llm_query_parallel wrapper currently lacks **kwargs
so a caller passing context= will raise TypeError; update both definitions of
llm_query_parallel (the branch that calls llm_query_parallel_fn and the fallback
that uses llm_query) to accept **kwargs, remove or ignore a hallucinated
'context' key (e.g., pop('context', None)) and then pass the remaining kwargs
through to llm_query_parallel_fn(...) or llm_query(...) in the list
comprehension; keep the function name llm_query_parallel and ensure
namespace["llm_query_parallel"] still points to it.

Comment on lines +124 to +134
# Parallel query function
if llm_query_parallel_fn is not None:
def llm_query_parallel(prompts: list[str], model: str = "haiku") -> list[str]:
"""Batch multiple independent queries for efficiency."""
return llm_query_parallel_fn(prompts, model)
else:
def llm_query_parallel(prompts: list[str], model: str = "haiku") -> list[str]:
"""Fallback: execute queries sequentially."""
return [llm_query_fn(p, model) for p in prompts]

namespace["llm_query_parallel"] = llm_query_parallel
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Handle hallucinated context in llm_query_parallel to avoid REPL crashes.
Line 126: llm_query_parallel doesn’t accept **kwargs, so a model passing context= will raise TypeError, similar to the issue you already guarded against in llm_query. Consider mirroring the same behavior here.

🐛 Suggested fix
-    if llm_query_parallel_fn is not None:
-        def llm_query_parallel(prompts: list[str], model: str = "haiku") -> list[str]:
+    if llm_query_parallel_fn is not None:
+        def llm_query_parallel(
+            prompts: list[str],
+            model: str = "haiku",
+            **kwargs,
+        ) -> list[str]:
             """Batch multiple independent queries for efficiency."""
-            return llm_query_parallel_fn(prompts, model)
+            if "context" in kwargs:
+                prompts = [f"{p}\n\n{kwargs['context']}" for p in prompts]
+            return llm_query_parallel_fn(prompts, model)
     else:
-        def llm_query_parallel(prompts: list[str], model: str = "haiku") -> list[str]:
+        def llm_query_parallel(
+            prompts: list[str],
+            model: str = "haiku",
+            **kwargs,
+        ) -> list[str]:
             """Fallback: execute queries sequentially."""
-            return [llm_query_fn(p, model) for p in prompts]
+            return [llm_query(p, model, **kwargs) for p in prompts]
🤖 Prompt for AI Agents
In `@daydream/rlm/environment.py` around lines 124 - 134, The llm_query_parallel
wrapper currently lacks **kwargs so a caller passing context= will raise
TypeError; update both definitions of llm_query_parallel (the branch that calls
llm_query_parallel_fn and the fallback that uses llm_query) to accept **kwargs,
remove or ignore a hallucinated 'context' key (e.g., pop('context', None)) and
then pass the remaining kwargs through to llm_query_parallel_fn(...) or
llm_query(...) in the list comprehension; keep the function name
llm_query_parallel and ensure namespace["llm_query_parallel"] still points to
it.

…ion complexity

Instead of asking the LLM for markdown-fenced code blocks and then
parsing them out, instruct the LLM to output raw Python directly.
This eliminates the fragile code extraction pipeline (regex patterns,
example detection, sanitization, recovery prompts) and lets SyntaxErrors
from non-Python output serve as natural self-correction signals.

Also improves FINAL_VAR error handling to return diagnostic info
via FinalAnswer instead of raising NameError.

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@daydream/rlm/runner.py`:
- Around line 586-598: The current _handle_llm_query_parallel increments
self._metrics.llm_query_parallel_calls and then calls _handle_llm_query for each
prompt, which itself increments self._metrics.llm_query_calls, causing
double-counting; modify the fallback to invoke the LLM logic without
incrementing per-query metrics (either by adding a non-metric helper like
_handle_llm_query_no_metric or by adding a skip_metric flag to
_handle_llm_query) so that only llm_query_parallel_calls is incremented for the
parallel path and llm_query_calls remains unchanged during the sequential
fallback; ensure references to _handle_llm_query_parallel, _handle_llm_query,
llm_query_parallel_calls, and llm_query_calls are updated accordingly.
- Around line 631-664: Remove the unreachable runner-level handlers
_handle_files_containing and _handle_files_importing and their metrics updates:
delete both method definitions (including their increments to
self._metrics.files_containing_calls / files_importing_calls and calls to
self._metrics.unique_files_accessed.update). After removing the methods, search
for and remove any remaining references or unused metric fields
(files_containing_calls, files_importing_calls) on the metrics object if they
are now unused, since file-search closures are provided by build_repl_namespace
and these runner handlers are never invoked.

In `@tests/rlm/test_runner_integration.py`:
- Around line 238-280: The test reveals that RLMConfig.sub_model isn't being
passed into the REPL environment so llm_query falls back to its hardcoded
default; update the RLMRunner code that constructs the REPL namespace (where
environment.llm_query is exposed) to inject the configured sub-model
(cfg.sub_model) under the symbol expected by llm_query (e.g., pass sub_model
into the REPL globals or wrap llm_query to bind model=cfg.sub_model), so that
RLMRunner (and tests like test_llm_query_in_code_routes_correctly) use the
configured sub_model rather than the hardcoded default.
🧹 Nitpick comments (3)
daydream/rlm/runner.py (2)

600-629: Sub-LLM gating can cause the model to burn iterations without progress.

When the model calls FINAL but hasn't met the sub-LLM threshold, its answer is silently discarded (line 924-926) and it gets a continuation prompt. There's no escalation — if the model repeatedly calls FINAL without ever invoking llm_query, it will loop until max_iterations with every FINAL rejected.

Consider adding a force-accept after N rejected FINAL attempts (e.g., 3), since the model may have legitimate reasons to skip sub-LLM calls (small meaningful change set, obvious issues found via direct inspection).


746-757: Default _llm_callback fallback on line 750 uses a lambda that truncates prompts — could cause confusion in debugging.

When _llm_callback is None but _call_llm is set (i.e., an orchestration LLM is configured but no sub-LLM), the fallback echo lambda silently swallows sub-LLM calls. This is fine for testing, but in production it could mask configuration issues. Consider logging a warning when this fallback is used.

tests/rlm/test_runner_integration.py (1)

1-18: Unused imports: AsyncMock, MagicMock, patch, Path, and RepoContext.

These are imported but never referenced in any test. They'll be caught by ruff's F401 rule.

Proposed fix
 import pytest
-from unittest.mock import AsyncMock, MagicMock, patch
-from pathlib import Path
 
 from daydream.rlm.runner import (
     RLMConfig,
     RLMRunner,
     load_codebase,
 )
-from daydream.rlm.environment import RepoContext

Comment on lines +586 to +598
def _handle_llm_query_parallel(self, prompts: list[str], model: str) -> list[str]:
"""Handle llm_query_parallel callback from REPL.

Args:
prompts: List of query prompts.
model: The model to use (e.g., "haiku").

Returns:
List of LLM response texts.
"""
self._metrics.llm_query_parallel_calls += 1
# Fall back to sequential execution
return [self._handle_llm_query(p, model) for p in prompts]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

_handle_llm_query_parallel double-counts metrics when falling back to sequential execution.

Each call to _handle_llm_query_parallel increments llm_query_parallel_calls by 1, and then the sequential fallback on line 598 calls _handle_llm_query per prompt, which increments llm_query_calls for each one. So a parallel call with 3 prompts records 1 parallel call + 3 query calls = 4 total in _check_sublm_usage_before_final. This inflates the count and makes the sub-LLM usage gate easier to pass than intended.

Proposed fix — track parallel calls without double-counting
     def _handle_llm_query_parallel(self, prompts: list[str], model: str) -> list[str]:
         self._metrics.llm_query_parallel_calls += 1
         # Fall back to sequential execution
-        return [self._handle_llm_query(p, model) for p in prompts]
+        if self._llm_callback is None:
+            return ["[Error: No LLM callback configured]"] * len(prompts)
+        return [self._llm_callback(p, model) for p in prompts]
🤖 Prompt for AI Agents
In `@daydream/rlm/runner.py` around lines 586 - 598, The current
_handle_llm_query_parallel increments self._metrics.llm_query_parallel_calls and
then calls _handle_llm_query for each prompt, which itself increments
self._metrics.llm_query_calls, causing double-counting; modify the fallback to
invoke the LLM logic without incrementing per-query metrics (either by adding a
non-metric helper like _handle_llm_query_no_metric or by adding a skip_metric
flag to _handle_llm_query) so that only llm_query_parallel_calls is incremented
for the parallel path and llm_query_calls remains unchanged during the
sequential fallback; ensure references to _handle_llm_query_parallel,
_handle_llm_query, llm_query_parallel_calls, and llm_query_calls are updated
accordingly.

Comment on lines +631 to +664
def _handle_files_containing(self, pattern: str) -> list[str]:
"""Handle files_containing callback from REPL.

Args:
pattern: Regex pattern to search for.

Returns:
List of file paths matching the pattern.
"""
self._metrics.files_containing_calls += 1
if self._context is None:
return []
compiled = re.compile(pattern)
results = [path for path, content in self._context.files.items() if compiled.search(content)]
self._metrics.unique_files_accessed.update(results)
return results

def _handle_files_importing(self, module: str) -> list[str]:
"""Handle files_importing callback from REPL.

Args:
module: Module name to search for.

Returns:
List of file paths that import the module.
"""
self._metrics.files_importing_calls += 1
if self._context is None:
return []
pattern = rf"(?:^|\n)\s*(?:import\s+{re.escape(module)}|from\s+{re.escape(module)}\s+import)"
compiled = re.compile(pattern)
results = [path for path, content in self._context.files.items() if compiled.search(content)]
self._metrics.unique_files_accessed.update(results)
return results
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check how REPLProcess is constructed and whether it wires runner callbacks for file search
echo "=== REPLProcess constructor and its parameters ==="
ast-grep --pattern 'class REPLProcess {
  $$$
  __init__($$$) {
    $$$
  }
  $$$
}'

echo ""
echo "=== How files_containing is wired in REPL ==="
rg -n "files_containing" daydream/rlm/repl.py -B3 -A3

echo ""
echo "=== How REPLProcess is instantiated in runner ==="
rg -n "REPLProcess(" daydream/rlm/runner.py -A10

echo ""
echo "=== build_repl_namespace call sites ==="
rg -n "build_repl_namespace" daydream/rlm/ -B2 -A5

Repository: existential-birds/daydream

Length of output: 2510


🏁 Script executed:

#!/bin/bash
# Look at the build_repl_namespace function implementation
echo "=== build_repl_namespace implementation ==="
rg -n "def build_repl_namespace" daydream/rlm/environment.py -A50

echo ""
echo "=== Search for _handle_files_containing usage ==="
rg -n "_handle_files_containing" daydream/rlm/

echo ""
echo "=== Search for _handle_files_importing usage ==="
rg -n "_handle_files_importing" daydream/rlm/

echo ""
echo "=== Check if runner passes callbacks to REPLProcess ==="
rg -n "REPLProcess" daydream/rlm/runner.py -A5 -B2

Repository: existential-birds/daydream

Length of output: 3821


🏁 Script executed:

#!/bin/bash
# Get the full build_repl_namespace implementation to see files_containing and files_importing
rg -n "def build_repl_namespace" daydream/rlm/environment.py -A100 | head -120

Repository: existential-birds/daydream

Length of output: 4439


Remove _handle_files_containing and _handle_files_importing methods; their metrics tracking is unreachable dead code.

REPLProcess is instantiated (runner.py:753) with only context, llm_callback, and llm_parallel_callback—no file search callbacks. build_repl_namespace (environment.py:90) creates its own files_containing and files_importing closures (lines 137, 149) that directly access ctx.files and are exposed in the REPL namespace. The runner-level handlers are never invoked, making their metrics counters (files_containing_calls, files_importing_calls) permanently zero and the code duplicate.

🤖 Prompt for AI Agents
In `@daydream/rlm/runner.py` around lines 631 - 664, Remove the unreachable
runner-level handlers _handle_files_containing and _handle_files_importing and
their metrics updates: delete both method definitions (including their
increments to self._metrics.files_containing_calls / files_importing_calls and
calls to self._metrics.unique_files_accessed.update). After removing the
methods, search for and remove any remaining references or unused metric fields
(files_containing_calls, files_importing_calls) on the metrics object if they
are now unused, since file-search closures are provided by build_repl_namespace
and these runner handlers are never invoked.

Comment on lines +238 to +280
class TestRLMRunnerSubLLMCalls:
"""Tests for sub-LLM (llm_query) handling."""

@pytest.mark.asyncio
async def test_llm_query_in_code_routes_correctly(self, tmp_path):
"""llm_query calls in REPL should route to sub-LLM handler."""
(tmp_path / "code.py").write_text("def foo(): pass")

cfg = RLMConfig(
workspace_path=str(tmp_path),
languages=["python"],
use_container=False,
sub_model="haiku",
)
runner = RLMRunner(cfg)

sub_llm_calls = []

def mock_sub_llm(prompt: str, model: str) -> str:
sub_llm_calls.append((prompt, model))
return "Analyzed code looks good"

runner._llm_callback = mock_sub_llm

call_count = [0]

async def mock_llm_call(prompt: str) -> str:
call_count[0] += 1
if call_count[0] == 1:
# Raw Python, no markdown
return '''analysis = llm_query("Analyze the foo function")
print(analysis)'''
else:
return 'FINAL("Review done with sub-LLM analysis")'

runner._call_llm = mock_llm_call

result = await runner.run()

# Sub-LLM should have been called
assert len(sub_llm_calls) == 1
assert "foo function" in sub_llm_calls[0][0]
assert sub_llm_calls[0][1] == "haiku"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if sub_model is ever used to configure the REPL namespace default model
echo "=== References to sub_model in runner.py ==="
rg -n "sub_model" daydream/rlm/runner.py -B2 -A2

echo ""
echo "=== Default model in environment.py llm_query ==="
rg -n 'model.*=.*"haiku"' daydream/rlm/environment.py -B2 -A2

echo ""
echo "=== build_repl_namespace call with model parameter ==="
rg -n "build_repl_namespace" daydream/rlm/ -B2 -A5

Repository: existential-birds/daydream

Length of output: 3681


🏁 Script executed:

#!/bin/bash
# Check all uses of sub_model in runner.py
echo "=== All references to sub_model in runner.py ==="
rg -n "sub_model" daydream/rlm/runner.py -C5

echo ""
echo "=== Check where llm_callback is set/used ==="
rg -n "_llm_callback\|llm_callback" daydream/rlm/runner.py -B2 -A2

echo ""
echo "=== Check REPL initialization in runner.py ==="
rg -n "REPLProcess\|build_repl_namespace\|\.start\(\)" daydream/rlm/runner.py -B3 -A3

Repository: existential-birds/daydream

Length of output: 1015


🏁 Script executed:

#!/bin/bash
# Read runner.py to see RLMRunner class and how it initializes REPL
echo "=== Check file size ==="
wc -l daydream/rlm/runner.py

echo ""
echo "=== Read RLMRunner class ==="
sed -n '87,200p' daydream/rlm/runner.py

echo ""
echo "=== Look for REPL or build_repl_namespace usage ==="
rg -n "REPL|build_repl" daydream/rlm/runner.py

Repository: existential-birds/daydream

Length of output: 5002


🏁 Script executed:

#!/bin/bash
# Read the REPL initialization section
sed -n '745,770p' daydream/rlm/runner.py

Repository: existential-birds/daydream

Length of output: 1068


🏁 Script executed:

#!/bin/bash
# Read REPLProcess class and initialization
echo "=== REPLProcess class in repl.py ==="
sed -n '1,150p' daydream/rlm/repl.py

echo ""
echo "=== Check if sub_model is stored in RLMRunner ==="
sed -n '300,380p' daydream/rlm/runner.py

Repository: existential-birds/daydream

Length of output: 7467


🏁 Script executed:

#!/bin/bash
# Check _handle_llm_query method
echo "=== _handle_llm_query method ==="
sed -n '572,600p' daydream/rlm/runner.py

echo ""
echo "=== Check if sub_model is ever passed to callbacks ==="
rg -n "sub_model\|_llm_callback" daydream/rlm/ -A2 -B2

Repository: existential-birds/daydream

Length of output: 1113


🏁 Script executed:

#!/bin/bash
# Search all files in daydream/rlm for any use of config.sub_model or self.config.sub_model
echo "=== All uses of config.sub_model or self.config.sub_model ==="
rg -n "config\.sub_model|self\.config\.sub_model" daydream/rlm/

echo ""
echo "=== Check what happens during REPL initialization in runner ==="
sed -n '745,765p' daydream/rlm/runner.py

echo ""
echo "=== Check if sub_model is ever read from config in runner ==="
sed -n '330,360p' daydream/rlm/runner.py

Repository: existential-birds/daydream

Length of output: 1926


Wire config.sub_model into the REPL namespace, or add a test that verifies a non-default sub_model is actually used.

The sub_model config field is defined but never propagated to the REPL. The test passes because llm_query in environment.py defaults to "haiku" (line 115), which coincidentally matches cfg.sub_model="haiku". If you changed sub_model to "sonnet", the REPL would still use "haiku" since the default is hardcoded.

🤖 Prompt for AI Agents
In `@tests/rlm/test_runner_integration.py` around lines 238 - 280, The test
reveals that RLMConfig.sub_model isn't being passed into the REPL environment so
llm_query falls back to its hardcoded default; update the RLMRunner code that
constructs the REPL namespace (where environment.llm_query is exposed) to inject
the configured sub-model (cfg.sub_model) under the symbol expected by llm_query
(e.g., pass sub_model into the REPL globals or wrap llm_query to bind
model=cfg.sub_model), so that RLMRunner (and tests like
test_llm_query_in_code_routes_correctly) use the configured sub_model rather
than the hardcoded default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:cli CLI commands and interface area:runner Main orchestration and run loop enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant