From c647358ec9bc4bbacd1121419739362fd2f7dbb4 Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 17:19:42 -0400
Subject: [PATCH 01/24] docs(trove): add native tool calling design spec

Specifies the architecture for adapting TroVE's IMPORT mode to use native
OpenAI tool calling with vLLM-served gpt-oss models. CREATE and SKIP
remain text-based; reward selection, K-sampling, and trimming stay
faithful to the paper. Includes telemetry plan, vLLM version requirements
(>= v0.16.0 for PR #28729), defensive sanitizer for the open Harmony
control-token leakage bug (PR #35906), and the smoke-run done criteria.

Made-with: Cursor
---
 ...-04-25-trove-native-tool-calling-design.md | 374 ++++++++++++++++++
 1 file changed, 374 insertions(+)
 create mode 100644 docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md

diff --git a/docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md b/docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md
new file mode 100644
index 00000000..ff0fc0a1
--- /dev/null
+++ b/docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md
@@ -0,0 +1,374 @@
+# TroVE Native Tool Calling — Design Spec
+
+**Date:** 2026-04-25
+**Branch:** `trove_baseline`
+**Status:** Approved (sectional review complete; self-review applied)
+
+---
+
+## 1. Problem statement
+
+Our existing TroVE port (`symbolic_agent/baselines/trove/*`) faithfully implements the original 3-mode generation (IMPORT / CREATE / SKIP) via free-form text prompts. When run on PBEBench with `gpt-oss-20b` / `gpt-oss-120b` served via vLLM, two failure modes are observed:
+
+1. **The toolbox is populated but never used.** The model emits `**Toolbox**` and `**Solution**` blocks that ignore previously-induced functions even when those functions match the task family. CoT shows the model "discovering" the same primitive sequence repeatedly.
+2. **CoT and final code are decoupled.** Even when the prompt names a toolbox helper, the model's reasoning channel does not interleave concrete invocations of that helper — calls only appear (if at all) in the final code, with no per-call signal we can audit.
+
+The user requirement is: **the model's chain-of-thought should interleave with concrete function calls into the induced toolbox**. The mechanism for this is **native OpenAI tool calling**: the toolbox is exposed via the `tools=[...]` parameter of `chat.completions.create`, and the model emits structured `tool_calls` during its reasoning that vLLM dispatches back to us. Each tool call is real, auditable, and credited toward toolbox frequency.
+
+This spec adapts the IMPORT mode of TroVE to use that mechanism, while keeping CREATE and SKIP modes text-based and preserving the rest of the algorithm faithful to the paper.
+
+---
+
+## 2. Goals and non-goals
+
+### Goals
+
+- IMPORT-mode trajectories are produced by a multi-turn loop where `gpt-oss` calls toolbox functions natively via `tool_calls`.
+- Frequency credit reflects what the model actually called, not what appeared in text.
+- The 3-way generation (IMPORT, CREATE, SKIP), K-sampling, reward-based candidate selection, AST tie-breaking, and `C·log_{20}(n)` trimming all remain faithful to the original TroVE algorithm.
+- The smoke run produces enough telemetry (per-task `tool_calls` lists, per-mode wins, function-frequency table) to attribute any accuracy delta vs. the no-toolbox baseline to actual tool usage.
+- "Done" = code complete + 50-task PBEBench-Lite smoke run on `gpt-oss-20b` + numbers reported. **No prompt iteration to chase performance targets.**
+
+### Non-goals
+
+- We do not change CREATE or SKIP mode generation. They remain single-shot text completions exactly as today.
+- We do not pre-seed the toolbox.
+- We do not change reward semantics, the PBEBench harness, or the executor's I/O contract.
+- We do not chase a specific accuracy target. Per the original TroVE methodology, we report what the algorithm produces.
+- We do not test or report Anthropic backend numbers — only vLLM-served `gpt-oss`.
+
+---
+
+## 3. Architecture overview
+
+```mermaid
+flowchart TD
+    Task[PBEBench task] --> Controller[TroVEController._multi_way_generation]
+    Controller --> ImportBranch{toolbox non-empty<br/>AND backend == openai?}
+    Controller --> Create[CREATE mode<br/>K text-only completions]
+    Controller --> Skip[SKIP mode<br/>K text-only completions]
+
+    ImportBranch -->|yes| ImportTools[_generate_import_with_tools<br/>K multi-turn tool-calling trajectories]
+    ImportBranch -->|no, Anthropic or empty| LegacyImport[legacy text-based IMPORT<br/>defensive fallback path]
+
+    ImportTools --> ToolsApi[tools_api.toolbox_to_openai_tools<br/>top-k toolbox -> OpenAI tool schemas]
+    ImportTools --> ChatLoop[llm.chat_with_tools<br/>multi-turn loop, max_tool_iters=8]
+    ChatLoop --> Vllm[vLLM /v1/chat/completions<br/>--tool-call-parser openai<br/>--reasoning-parser openai_gptoss]
+    Vllm --> Dispatcher[tools_api.dispatch_tool_call<br/>sandbox execute via executor.run_solution]
+    Dispatcher --> ChatLoop
+
+    ImportTools --> ImportCands[K IMPORT candidates<br/>final assistant text + tool_call trajectory]
+    Create --> CreateCands[K CREATE candidates]
+    Skip --> SkipCands[K SKIP candidates]
+    LegacyImport --> ImportCands
+
+    ImportCands --> Pick[_select_best_by_reward<br/>tie-break by AST node count]
+    CreateCands --> Pick
+    SkipCands --> Pick
+
+    Pick --> Library[_update_library<br/>credit frequency from tool_calls]
+    Library --> Trim[periodic toolbox.trim<br/>C * log_20 n_processed, C=1.0]
+```
+
+**One-line summary:** Only the IMPORT branch changes. Everything else (CREATE, SKIP, K-sampling, selection, library updates, trimming) stays where the existing port already has it.
+
+---
+
+## 4. Data flow for IMPORT-with-tools
+
+```mermaid
+sequenceDiagram
+    participant Ctrl as TroVEController
+    participant Tools as tools_api
+    participant LLM as TroVELLMClient.chat_with_tools
+    participant vLLM
+    participant Exec as executor.run_solution
+
+    Ctrl->>Tools: toolbox_to_openai_tools(toolbox, topk=10)
+    Tools-->>Ctrl: tools_schema (list[dict])
+    Ctrl->>LLM: chat_with_tools(messages, tools_schema, model, max_tool_iters=8)
+    loop iter 1..N (N <= 8)
+        LLM->>vLLM: chat.completions.create(messages, tools=tools_schema)
+        vLLM-->>LLM: assistant message (content + reasoning_content + tool_calls)
+        alt tool_calls present
+            LLM->>Tools: dispatch_tool_call(toolbox, tool_call)
+            Tools->>Exec: run_solution(toolbox_src + call_expr, task_inputs)
+            Exec-->>Tools: stdout (truncated to 4096 chars) or error
+            Tools-->>LLM: tool result string
+            LLM->>LLM: append assistant + tool messages, loop
+        else no tool_calls
+            LLM-->>Ctrl: trajectory (final text + recorded tool_calls)
+        end
+    end
+    Ctrl->>Ctrl: parse **Solution** block from final text
+    Ctrl->>Ctrl: credit frequency by unique tool_call.function.name
+```
+
+---
+
+## 5. Components
+
+### 5.1 New file: `symbolic_agent/baselines/trove/tools_api.py`
+
+Two pure functions; no state.
+
+**`toolbox_to_openai_tools(toolbox: TroVEToolbox, topk: int = 10) -> list[dict]`**
+
+- Selects the top-k entries by frequency (matching the existing `format_toolbox(topk=10)` view).
+- For each entry, executes the toolbox source via `exec(toolbox.get_full_code(), namespace)` into a fresh dict, then reads `inspect.signature(namespace[fn_name])` to enumerate parameters and annotations.
+- Builds an OpenAI `chat.completions` tool dict:
+  ```json
+  {
+    "type": "function",
+    "function": {
+      "name": "<fn_name>",
+      "description": "<docstring or empty string>",
+      "parameters": {
+        "type": "object",
+        "properties": {"<param>": {"type": "<inferred>"}, ...},
+        "required": [<all params without defaults>]
+      }
+    }
+  }
+  ```
+- Type inference: `int → integer`, `float → number`, `bool → boolean`, `list/tuple → array`, `dict → object`, anything else (or unannotated) → `string`. Numeric and string defaults: pass through to the schema as `default`. Anything else: omit the default.
+- Functions with `*args` or `**kwargs` are excluded from the tool list (we cannot generate a meaningful schema; this is rare for induced TroVE helpers and is logged to the debug dir for inspection).
+
+**`dispatch_tool_call(toolbox: TroVEToolbox, tool_call) -> str`**
+
+- Sanitizes the tool name: `name = tool_call.function.name.split("<|", 1)[0]`. This is a defensive 2-line workaround for the open vLLM bug tracked by PR #35906 (Harmony control tokens leaking into tool names like `find_replace_chain<|channel|>commentary`). If/when #35906 lands upstream, this becomes a no-op.
+- If `name` is not in `toolbox`, returns the JSON string `{"error": "tool '<name>' not in toolbox"}` (the model can recover).
+- Parses `tool_call.function.arguments` as JSON; on parse error returns `{"error": "argument JSON parse failed: <msg>"}`.
+- Builds a one-liner call expression: `print(repr(<name>(**<args>)))`.
+- Runs `executor.run_solution` with `toolbox.get_full_code() + "\n" + call_expr` and `inputs={}`. (PBEBench task inputs are not needed at the function-call level — the model passes inputs as arguments.)
+- Returns the captured stdout truncated to **4096 characters** (UTF-8 codepoints, not bytes — simpler to truncate without splitting a codepoint), or the error message on non-zero exit.
+
+### 5.2 Modify: `symbolic_agent/baselines/trove/llm.py`
+
+**`TroVELLMClient._call_openai`**
+
+- After reading `response.choices[0].message.content`, fall back to `getattr(response.choices[0].message, "reasoning_content", "")` when `content` is empty/None. This handles `gpt-oss` Harmony channel splits where the answer lands in the reasoning channel for non-tool-calling text completions (CREATE, SKIP, and legacy IMPORT). No change to the function signature.
+
+**New method: `TroVELLMClient.chat_with_tools(messages, tools, model, max_tokens, max_tool_iters=8, on_tool_call, tag) -> dict`**
+
+- Returns `{"final_text": str, "tool_calls": list[dict], "iterations": int, "stopped_reason": str}`.
+  - `final_text` is the assistant message content from the final iteration (`""` if none).
+  - `tool_calls` is the ordered list of recorded calls, each `{"name": str, "args_preview": str (≤200 chars), "result_preview": str (≤200 chars), "ok": bool}`.
+- Implements the multi-turn loop:
+  1. Append the user message.
+  2. POST `chat.completions.create(model, messages, tools, tool_choice="auto", max_tokens)`.
+  3. If `message.tool_calls` is empty: record `final_text` (with `reasoning_content` fallback) and return.
+  4. Otherwise: append the assistant message verbatim, then for each `tool_call` invoke `on_tool_call(tool_call)` (the controller passes a closure that calls `tools_api.dispatch_tool_call`). Append a `{"role": "tool", "tool_call_id": ..., "content": <result>}` message per call.
+  5. Increment iteration counter; if `iterations >= max_tool_iters`, stop with `stopped_reason="max_iters"` and return what we have.
+- Defensive guard: raises `NotImplementedError("chat_with_tools requires the openai backend")` on `self.backend == "anthropic"`. This guard is **never tripped in normal flow** because the controller branches on `self.backend == "openai"` before calling. It exists only to fail loudly if a future caller invokes the method directly.
+- Uses the same 3-attempt retry, the same per-call debug logging (writing one JSON file per LLM round-trip into the existing `_debug_dir` with the tag suffixed by `_iter{n}`), and the same token accounting as `_call_openai`.
+
+### 5.3 Modify: `symbolic_agent/baselines/trove/controller.py`
+
+**`__init__`** — add two parameters:
+- `task_family: str = "default"` — passed through to `prompts.build_*_prompt` and `parse.parse_response`.
+- `selection: str = "reward"` — `"reward"` (default) uses the existing `_select_best_by_reward`; `"consistency"` uses the existing `_select_best_by_consistency`.
+
+**`_multi_way_generation`** — change the IMPORT branch only:
+- If `self.backend == "openai"` AND `len(self.toolbox) > 0`: call new `_generate_import_with_tools(task, K)`.
+- Else: call the existing legacy text-based IMPORT path. (Anthropic and empty-toolbox both fall through here; the latter is correct because there are no tools to expose anyway.)
+- CREATE and SKIP branches: unchanged.
+
+**New method: `_generate_import_with_tools(task, K) -> list[Candidate]`**
+
+- Builds the IMPORT-with-tools prompt via `prompts.build_import_with_tools_prompt(task, task_family=self.task_family)` (no `**Toolbox**` markdown — the toolbox is conveyed via the `tools=[...]` parameter).
+- Builds the tool schema once per task: `tools_schema = tools_api.toolbox_to_openai_tools(self.toolbox, topk=10)`.
+- For `i in range(K)`, calls `self.llm.chat_with_tools(...)` with the tag `f"trove_import_{task.id}_{i}"`.
+- Each returned trajectory becomes one Candidate. Solution code is parsed from the final text via `parse.parse_response(final_text, task_family="pbebench")` (strict `**Solution**` block; no fallback to "any python block").
+- Empty `final_text` → empty solution code → reward=0 → naturally loses in selection.
+
+**`_update_library`** — for `mode == "import"`, credit frequency by **unique `tool_call.function.name`** entries in the trajectory:
+- `unique_names = {sanitize(tc["name"]) for tc in trajectory.tool_calls}` where `sanitize` is the same `<|`-truncation used in `dispatch_tool_call` (defensive symmetry).
+- For each name, call `self.toolbox.update_frequency(name, example_idx)`. Names not present in the toolbox are silently no-ops thanks to the existing filter at `toolbox.py:68` — hallucinated tool names contribute nothing to frequency. Real tool calls (names matching a toolbox entry) get one credit per task per unique name.
+
+**`_make_result`** — emit passive telemetry fields per task. Add to the result dict (no behavior changes):
+- `won_mode: "import" | "create" | "skip"`
+- `import_eligible: bool` (true iff toolbox was non-empty when the task ran)
+- `import_was_winner: bool`
+- `tool_calls: list[{name, args_preview, result_preview, ok}]` (only populated when the IMPORT-with-tools path ran)
+- `tool_call_count: int`
+- `tools_called: list[str]` (unique names actually called)
+- `actually_called: list[str]` (functions from `toolbox` that appear as call-sites in the AST of the winning `**Solution**` code; computed via `parse.imported_callsites`)
+
+### 5.4 Modify: `symbolic_agent/baselines/trove/parse.py`
+
+**New helper: `imported_callsites(solution_code: str, tools_code: str, candidate_names: set[str]) -> set[str]`**
+- AST-walks `solution_code`, returns the subset of `candidate_names` that appear as `Call` targets (handles bare `Name` and `Attribute` callees like `toolbox.find_replace_chain`).
+- Used by `_make_result.actually_called`.
+
+**Modify `parse_response`** — add `task_family: str = "default"` parameter:
+- For `task_family == "pbebench"`, do not fall back to `_extract_any_python_block` if the `**Solution**` block is missing — return empty solution code instead. This enforces strict format adherence and prevents the parser from accidentally promoting CoT scratchpad to the answer.
+- For all other families, behavior is unchanged.
+
+### 5.5 Modify: `symbolic_agent/baselines/trove/prompts.py`
+
+- Add PBEBench-shaped few-shot examples: `_CREATE_EXAMPLE_PBEBENCH` and `_SKIP_EXAMPLE_PBEBENCH`. Each demonstrates a sequence of `replace()` operations and (in CREATE's case) a small reusable helper such as `find_replace_chain(s, pairs)` so the model has a concrete pattern to imitate.
+- Add **`_IMPORT_INSTRUCTION_FOR_TOOLS`** and **`_IMPORT_EXAMPLE_FOR_TOOLS`**: the prompt for IMPORT-with-tools mode. These do *not* include a `**Toolbox**` markdown block (the toolbox is conveyed via the `tools=[...]` parameter). They instruct the model to use the available tools when helpful and to produce a final answer in a `**Solution**` block.
+- Add **`build_import_with_tools_prompt(task, task_family)`** and refactor `build_import_prompt`, `build_create_prompt`, `build_skip_prompt` to accept `task_family` and dispatch to the appropriate example set.
+- Make `_FORMAT_OVERRIDE` conditional: empty string for `task_family == "pbebench"` (the new PBEBench examples model the desired format directly); existing override for other families.
+
+### 5.6 Modify: `symbolic_agent/baselines/trove/toolbox.py`
+
+- `TroVEToolbox.trim`: change default `C: float = 0.5` → `C: float = 1.0` to match the original TroVE implementation.
+
+### 5.7 Modify: `symbolic_agent/baselines/trove/executor.py`
+
+- `DEFAULT_TIMEOUT = 10` → `DEFAULT_TIMEOUT = 60`. Closer to the original TroVE's ~100s; gives PBEBench's `replace()`-chain solutions and the multi-turn tool dispatch enough headroom on local vLLM.
+
+### 5.8 Modify: `main.py`
+
+- Add CLI flag `--trove-selection {reward,consistency}` with `default="reward"`. Plumb to `TroVEController(selection=args.trove_selection)`.
+- When `--dataset pbebench` is specified, pass `task_family="pbebench"` to the controller. Otherwise pass `"default"`.
+
+### 5.9 Modify: `scripts/launch_vllm_gpt_oss_120b.sh`
+
+Add three flags to the `vllm.entrypoints.openai.api_server` invocation:
+- `--enable-auto-tool-choice` — enables `tool_choice="auto"` to actually fire tool calls.
+- `--tool-call-parser openai` — the parser that knows how to extract `tool_calls` from the `gpt-oss` Harmony commentary channel.
+- `--reasoning-parser openai_gptoss` — routes Harmony analysis-channel content into `message.reasoning_content` rather than dropping it.
+
+### 5.10 New file: `scripts/analyze_trove_run.py`
+
+Read a TroVE JSONL output and print:
+- Overall accuracy (pass rate).
+- Final toolbox size.
+- Per-mode wins (counts of `won_mode == "import"`, `"create"`, `"skip"`).
+- IMPORT-mode behavior breakdown:
+  - Tasks with `import_eligible == True` and `tool_call_count >= 1` (rate).
+  - Mean `tool_call_count` across IMPORT-eligible tasks.
+  - Tool-call success rate: fraction of `tool_calls` entries with `ok == True`.
+- Top-10 most-called toolbox functions (by total call count across the run).
+
+### 5.11 Rewrite: `symbolic_agent/baselines/trove/docs/deviations.md`
+
+(Path may need creation if it doesn't exist.) Three sections:
+
+1. **Algorithmic deviations:**
+   - Native OpenAI tool calling for IMPORT mode (replaces the original text-based "model selects from `**Toolbox**` markdown" mechanism).
+   - Reward-based candidate selection by default (vs. self-consistency in the paper); self-consistency available via `--trove-selection consistency`.
+   - PBEBench-shaped few-shot examples in CREATE and SKIP prompts.
+
+2. **Faithful elements:** 3-mode generation, K-sampling per mode, AST-tie-breaking by node count, `C·log_{20}(n)` periodic trimming with `C=1.0`, frequency-based top-k retrieval for the toolbox view, dict-keyed toolbox structure mirroring `utils/code.py`.
+
+3. **Infrastructural patches:** JSONL-per-task checkpointing, `reasoning_content` fallback in `_call_openai`, executor timeout 60s, defensive `<|`-truncation sanitizer in the tool-call dispatcher (workaround for open vLLM PR #35906 covering Harmony control-token leakage).
+
+4. **Backend coverage caveat:** Anthropic backend code paths are still present and exercised by CREATE / SKIP / legacy IMPORT, but the smoke run and reported numbers are vLLM-served `gpt-oss` only. IMPORT-with-tools requires the OpenAI/vLLM backend.
+
+---
+
+## 6. Telemetry to be collected
+
+Per task (in the JSONL row):
+
+| Field | Type | Source |
+|---|---|---|
+| `won_mode` | string | controller `_make_result` |
+| `import_eligible` | bool | `len(toolbox) > 0` at task start |
+| `import_was_winner` | bool | `won_mode == "import"` |
+| `tool_calls` | list[dict] | `chat_with_tools` recorded list |
+| `tool_call_count` | int | `len(tool_calls)` |
+| `tools_called` | list[str] | unique names from `tool_calls` |
+| `actually_called` | list[str] | `parse.imported_callsites(winning_solution, ...)` |
+
+Per run (computed by `scripts/analyze_trove_run.py`):
+
+- Overall accuracy
+- Final toolbox size
+- Mode-win histogram
+- IMPORT-mode tool-use rate, mean calls/task, success rate
+- Top-10 most-called functions
+
+---
+
+## 7. Implementation defaults
+
+| Choice | Value | Rationale |
+|---|---|---|
+| `K` (samples per mode) | 3 | Matches existing controller; matches paper |
+| Tool schema top-k | 10 | Matches existing `format_toolbox(topk=10)` |
+| `max_tool_iters` | 8 | Allows multi-step compositions; bounded for safety |
+| Tool result truncation | 4096 characters | Avoids truncating mid-codepoint; safe for JSON |
+| Trim coefficient `C` | 1.0 | Matches the original TroVE `λ = log_{20}(n)` |
+| Executor timeout | 60s | PBEBench `replace()`-chains + multi-turn dispatch |
+| Selection default | `reward` | Existing PBEBench reward signal is reliable |
+| Tool name sanitization | `name.split("<\|", 1)[0]` | Defensive vs. open vLLM PR #35906 |
+
+---
+
+## 8. Smoke run
+
+**Command (filled when ready to execute):**
+
+```bash
+# Launch vLLM (after script is updated with the three new flags)
+bash scripts/launch_vllm_gpt_oss_120b.sh 8000
+
+# Run TroVE on 50 PBEBench-Lite tasks with gpt-oss-20b
+python main.py \
+  --dataset pbebench \
+  --baseline trove \
+  --model gpt-oss-20b \
+  --backend openai \
+  --base-url http://localhost:8000/v1 \
+  --num-tasks 50 \
+  --trove-selection reward \
+  --debug-dir ./outputs/trove_pbebench_smoke
+
+# Analyze
+python scripts/analyze_trove_run.py outputs/trove_pbebench_smoke/results.jsonl
+```
+
+**Pre-flight check.** Before kicking off the full 50-task run, run a single one-task smoke and verify:
+1. The OpenAI client request payload contains `tools=[...]` with at least one entry once the toolbox has been populated.
+2. The first response with a non-empty toolbox returns at least one `tool_call` from vLLM (visible in the debug log JSON for that round-trip).
+
+If `message.tool_calls` is None or missing on a non-empty-toolbox task, **verify all three vLLM flags (`--enable-auto-tool-choice`, `--tool-call-parser openai`, `--reasoning-parser openai_gptoss`) are present in the launcher script**, restart vLLM, and re-run the sanity check before proceeding.
+
+**Done criteria.**
+
+- All code changes merged on `trove_baseline`.
+- Smoke run completes without crashes.
+- Reported numbers (in plain text or a brief markdown summary):
+  - Overall accuracy (pass rate over 50 tasks)
+  - Final toolbox size
+  - Mode-win counts
+  - IMPORT tool-use rate among IMPORT-eligible tasks
+  - Top-10 most-called functions
+  - A short narrative of any anomalies observed (e.g. `<|channel|>` contamination from PR #35906, `max_iters` stops, JSON-arg parse failures).
+
+We **do not** iterate on prompts, schemas, or thresholds to chase a target number. The numbers are what they are.
+
+---
+
+## 9. vLLM version requirement and known caveats
+
+- **Minimum vLLM:** v0.16.0 (branch-cut 2026-02-08). Latest as of writing is v0.20.0.
+- **Required upstream change:** PR #28729 ("Multiple fixes for gpt-oss Chat Completion prompting"), merged 2025-12-12 by `@chaunceyjiang`. Without this, multi-turn tool-call flows fail to round-trip the analysis/commentary channels correctly. v0.16.0 is the first stable release branch-cut after the merge.
+- **Known open caveat:** PR #35906 ("Sanitize leaked Harmony control tokens in tool names and recipients") is **still open** as of late March 2026. Symptoms when this hits us: tool names contain Harmony tags, e.g. `find_replace_chain<|channel|>commentary`. Mitigation: the `<|`-truncation sanitizer in `dispatch_tool_call` and `_update_library`. If/when #35906 lands upstream, the sanitizer becomes a no-op and we leave it in place.
+
+---
+
+## 10. Cost envelope (smoke run upper bound)
+
+Per task baseline (no IMPORT branch, e.g. first ~10 tasks before the toolbox is populated): K=3 across CREATE and SKIP only = 6 single-shot calls + 3 legacy IMPORT (no-op when toolbox empty, but the call is still made) = 9 round-trips.
+
+Per IMPORT-eligible task (~40 of 50): K=3 multi-turn IMPORT trajectories × up to 8 iterations each + 1 final no-tool turn = up to 27 calls; plus 6 for CREATE and SKIP = up to 33 round-trips.
+
+Total upper bound: 40·33 + 10·9 = **1410 round-trips** for the 50-task smoke. Acceptable for local vLLM.
+
+---
+
+## 11. Out of scope (explicit)
+
+- Any change to PBEBench dataset/loader/scoring.
+- Any change to CREATE or SKIP generation paths.
+- Pre-seeding the toolbox.
+- Toolbox persistence across runs.
+- Any change to reward semantics.
+- Any per-task or per-prompt iteration after the smoke run lands.
+- Anthropic backend smoke runs.

From 0f872ac5278c4fd2730263901ca699e653a2791e Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 17:51:15 -0400
Subject: [PATCH 02/24] docs(trove): add native tool calling implementation
 plan

Step-by-step plan for implementing the design spec
(2026-04-25-trove-native-tool-calling-design.md). Eleven tasks covering
infra patches, tools_api module, chat_with_tools, controller branch,
prompts, CLI flags, vLLM launcher, analyzer script, deviations doc,
and the 50-task PBEBench-Lite smoke run + report.

Made-with: Cursor
---
 .../2026-04-25-trove-native-tool-calling.md   | 2274 +++++++++++++++++
 1 file changed, 2274 insertions(+)
 create mode 100644 docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md

diff --git a/docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md b/docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md
new file mode 100644
index 00000000..76ecb582
--- /dev/null
+++ b/docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md
@@ -0,0 +1,2274 @@
+# TroVE Native Tool Calling Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Adapt the existing TroVE port so that the IMPORT mode uses native OpenAI tool calling (vLLM-served `gpt-oss`) while CREATE / SKIP / selection / trimming remain faithful to the paper, then run a 50-task PBEBench smoke and report numbers.
+
+**Architecture:** Keep `_multi_way_generation` unchanged for CREATE/SKIP. Replace the IMPORT branch (when toolbox non-empty AND backend is OpenAI) with a multi-turn loop that (a) translates top-k toolbox functions into OpenAI tool schemas, (b) lets the model emit `tool_calls` that are executed in a sandboxed subprocess, and (c) returns the final assistant text + recorded tool-call trajectory. Frequency credit comes from unique `tool_call.function.name` entries, not parsed `from toolbox import`. All other invariants (K-sampling, reward-based selection, AST tie-break, `C·log_{20}(n)` trimming) are unchanged.
+
+**Tech Stack:** Python 3.11, OpenAI Python SDK against a vLLM ≥ v0.16.0 endpoint serving `openai/gpt-oss-20b` (or `120b`), `subprocess`-based executor, `inspect` + `ast` from stdlib.
+
+**Spec:** [docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md](../specs/2026-04-25-trove-native-tool-calling-design.md)
+
+---
+
+## File Structure
+
+| File | Status | Purpose |
+|---|---|---|
+| `symbolic_agent/baselines/trove/toolbox.py` | Modify | Trim coefficient `C=1.0` |
+| `symbolic_agent/baselines/trove/executor.py` | Modify | `DEFAULT_TIMEOUT=60` |
+| `symbolic_agent/baselines/trove/llm.py` | Modify | `reasoning_content` fallback in `_call_openai`; new `chat_with_tools` method |
+| `symbolic_agent/baselines/trove/parse.py` | Modify | `imported_callsites` helper; `task_family` parameter on `parse_response` |
+| `symbolic_agent/baselines/trove/prompts.py` | Modify | PBEBench-shaped few-shots; `build_import_with_tools_prompt`; `task_family` dispatch |
+| `symbolic_agent/baselines/trove/controller.py` | Modify | IMPORT-with-tools branch; telemetry fields; `task_family` + `selection` params |
+| `symbolic_agent/baselines/trove/tools_api.py` | Create | `toolbox_to_openai_tools`; `dispatch_tool_call` |
+| `symbolic_agent/baselines/trove/docs/deviations.md` | Create | Algorithmic deviations / faithful elements / infra patches |
+| `symbolic_agent/baselines/trove/tests/__init__.py` | Create | Marker file for the new tests package |
+| `symbolic_agent/baselines/trove/tests/test_tools_api.py` | Create | Unit tests for schema generation + dispatcher |
+| `symbolic_agent/baselines/trove/tests/test_parse_callsites.py` | Create | Unit tests for `imported_callsites` |
+| `main.py` | Modify | `--trove-selection` and `--trove-task-family` flags |
+| `scripts/launch_vllm_gpt_oss_120b.sh` | Modify | Add three vLLM tool-calling flags |
+| `scripts/analyze_trove_run.py` | Create | Post-hoc analysis of TroVE JSONL output |
+
+---
+
+## Task 1: Quick infrastructure patches (trim C, executor timeout, reasoning_content fallback)
+
+**Files:**
+- Modify: `symbolic_agent/baselines/trove/toolbox.py:117`
+- Modify: `symbolic_agent/baselines/trove/executor.py:19`
+- Modify: `symbolic_agent/baselines/trove/llm.py:192`
+
+These are three independent one-line changes. Bundling them since each is too small to warrant its own commit and they're all on the "infrastructure" axis.
+
+- [ ] **Step 1.1: Update trim coefficient default**
+
+In `symbolic_agent/baselines/trove/toolbox.py`, change the default of `trim`:
+
+```python
+def trim(self, n_processed: int, C: float = 1.0) -> set:
+    """
+    Remove functions whose frequency is below the threshold
+        C * log_{20}(n_processed)
+    and return the set of example indices that had used those functions.
+
+    Faithful to trim_library() in run_trove.py:
+        threshold = math.log(n, 20)   # log base 20
+    C defaults to 1.0, matching the original implementation (C·log_{20}(n)).
+    Note: the original uses log base-20 not base-10; we keep base-20.
+    """
+```
+
+- [ ] **Step 1.2: Update executor timeout default**
+
+In `symbolic_agent/baselines/trove/executor.py`, change the constant:
+
+```python
+DEFAULT_TIMEOUT = 60  # seconds — generous for PBEBench replace() chains and multi-turn dispatch
+```
+
+- [ ] **Step 1.3: Add reasoning_content fallback in `_call_openai`**
+
+In `symbolic_agent/baselines/trove/llm.py`, replace the line that reads `raw = response.choices[0].message.content or ""` with:
+
+```python
+                msg = response.choices[0].message
+                raw = msg.content or getattr(msg, "reasoning_content", "") or ""
+```
+
+Context (the surrounding `try` block stays unchanged):
+
+```python
+                response = self._client.chat.completions.create(
+                    model=model,
+                    max_tokens=max_tokens,
+                    messages=messages,
+                )
+                msg = response.choices[0].message
+                raw = msg.content or getattr(msg, "reasoning_content", "") or ""
+                u = getattr(response, "usage", None)
+```
+
+- [ ] **Step 1.4: Sanity-check the changes**
+
+Run: `python -c "from symbolic_agent.baselines.trove.toolbox import TroVEToolbox; from symbolic_agent.baselines.trove.executor import DEFAULT_TIMEOUT; import inspect; print(inspect.signature(TroVEToolbox.trim).parameters['C'].default, DEFAULT_TIMEOUT)"`
+
+Expected: `1.0 60`
+
+- [ ] **Step 1.5: Commit**
+
+```bash
+git add symbolic_agent/baselines/trove/toolbox.py symbolic_agent/baselines/trove/executor.py symbolic_agent/baselines/trove/llm.py
+git commit -m "$(cat <<'EOF'
+fix(trove): infra patches for native tool calling
+
+- toolbox.trim default C=1.0 (matches original TroVE)
+- executor DEFAULT_TIMEOUT=60s (PBEBench + multi-turn headroom)
+- llm._call_openai falls back to message.reasoning_content when
+  message.content is empty (gpt-oss Harmony channel split)
+EOF
+)"
+```
+
+---
+
+## Task 2: `parse.imported_callsites` helper + `task_family` parameter
+
+**Files:**
+- Modify: `symbolic_agent/baselines/trove/parse.py:86,106-114`
+- Create: `symbolic_agent/baselines/trove/tests/__init__.py`
+- Create: `symbolic_agent/baselines/trove/tests/test_parse_callsites.py`
+
+- [ ] **Step 2.1: Create the tests package marker**
+
+Create `symbolic_agent/baselines/trove/tests/__init__.py` as an empty file.
+
+- [ ] **Step 2.2: Write the failing test for `imported_callsites`**
+
+Create `symbolic_agent/baselines/trove/tests/test_parse_callsites.py`:
+
+```python
+"""Unit tests for parse.imported_callsites and parse_response(task_family=)."""
+
+from symbolic_agent.baselines.trove.parse import imported_callsites, parse_response
+
+
+# ---------------------------------------------------------------------------
+# imported_callsites
+# ---------------------------------------------------------------------------
+
+def test_callsites_bare_name():
+    code = "result = find_replace_chain(s, [('a', 'b')])\nprint(result)"
+    assert imported_callsites(code, tools_code="", candidate_names={"find_replace_chain", "other"}) == {"find_replace_chain"}
+
+
+def test_callsites_attribute_access():
+    code = "result = toolbox.find_replace_chain(s, pairs)\nprint(result)"
+    assert imported_callsites(code, tools_code="", candidate_names={"find_replace_chain"}) == {"find_replace_chain"}
+
+
+def test_callsites_no_match():
+    code = "print(s.replace('a', 'b'))"
+    assert imported_callsites(code, tools_code="", candidate_names={"find_replace_chain"}) == set()
+
+
+def test_callsites_multiple_calls_same_name_dedup():
+    code = "x = f(1)\ny = f(2)\nprint(x, y)"
+    assert imported_callsites(code, tools_code="", candidate_names={"f", "g"}) == {"f"}
+
+
+def test_callsites_syntax_error_returns_empty():
+    code = "this is not valid python ::"
+    assert imported_callsites(code, tools_code="", candidate_names={"f"}) == set()
+
+
+def test_callsites_empty_inputs():
+    assert imported_callsites("", "", set()) == set()
+    assert imported_callsites("print(1)", "", set()) == set()
+
+
+# ---------------------------------------------------------------------------
+# parse_response(task_family=)
+# ---------------------------------------------------------------------------
+
+def test_parse_response_pbebench_strict_no_solution_block():
+    text = "Here is some reasoning.\n```python\nprint('answer')\n```\n"
+    out = parse_response(text, task_family="pbebench")
+    assert out["solution_code"] == ""
+
+
+def test_parse_response_pbebench_with_solution_block():
+    text = "**Solution**\n```python\nprint('answer')\n```\n"
+    out = parse_response(text, task_family="pbebench")
+    assert out["solution_code"] == "print('answer')"
+
+
+def test_parse_response_default_falls_back_to_any_python_block():
+    text = "Here is some reasoning.\n```python\nprint('answer')\n```\n"
+    out = parse_response(text, task_family="default")
+    assert "print('answer')" in out["solution_code"]
+
+
+def test_parse_response_default_call_signature_unchanged():
+    text = "**Solution**\n```python\nprint('answer')\n```\n"
+    out = parse_response(text)
+    assert out["solution_code"] == "print('answer')"
+```
+
+- [ ] **Step 2.3: Run the tests to confirm they fail**
+
+Run: `python -m pytest symbolic_agent/baselines/trove/tests/test_parse_callsites.py -v`
+
+Expected: ImportError on `imported_callsites` (function does not exist) and one or more failures on `parse_response(text, task_family=...)` (unknown kwarg).
+
+- [ ] **Step 2.4: Implement `imported_callsites` and add `task_family` to `parse_response`**
+
+In `symbolic_agent/baselines/trove/parse.py`, add the helper at the end of the AST section (after `count_ast_nodes`):
+
+```python
+def imported_callsites(
+    solution_code: str,
+    tools_code: str,
+    candidate_names: set,
+) -> set:
+    """
+    Return the subset of `candidate_names` that appear as call-sites in
+    `solution_code`. Used for the `actually_called` telemetry field.
+
+    Detects two callee shapes:
+      - bare Name:        find_replace_chain(...)
+      - Attribute(name):  toolbox.find_replace_chain(...)
+
+    `tools_code` is currently unused (kept in the signature so callers can
+    pass through the **Tools** block context if we later want to filter by
+    what was actually imported).
+
+    Returns an empty set on empty input or SyntaxError.
+    """
+    if not solution_code or not candidate_names:
+        return set()
+    try:
+        tree = ast.parse(solution_code)
+    except SyntaxError:
+        return set()
+    found: set = set()
+    for node in ast.walk(tree):
+        if not isinstance(node, ast.Call):
+            continue
+        func = node.func
+        if isinstance(func, ast.Name) and func.id in candidate_names:
+            found.add(func.id)
+        elif isinstance(func, ast.Attribute) and func.attr in candidate_names:
+            found.add(func.attr)
+    return found
+```
+
+Then modify `parse_response` (around line 86) to accept `task_family`:
+
+```python
+def parse_response(text: str, task_family: str = "default") -> dict:
+    """
+    Parse a TroVE-format LLM response.
+
+    Returns
+    -------
+    {
+        "solution_code": str,         # code inside **Solution** block
+        "tools_code":    str,         # code inside **Tools** block
+        "functions":     list[dict],  # parsed tool dicts from the Tools block
+    }
+
+    task_family
+    -----------
+    "default": if no **Solution** block is found, falls back to the first
+    ```python``` block anywhere (legacy behaviour).
+    "pbebench": no fallback. Strict **Solution**-block-only parsing avoids
+    accidentally promoting CoT scratchpad to the answer.
+    """
+    solution_code = _extract_code_block(text, "Solution") or ""
+    tools_code = _extract_code_block(text, "Tools") or ""
+
+    if not solution_code and task_family != "pbebench":
+        raw = _extract_any_python_block(text)
+        if raw:
+            solution_code = _make_executable(raw)
+
+    functions = parse_tools_in_chunk(tools_code) if tools_code else []
+    return {
+        "solution_code": solution_code,
+        "tools_code": tools_code,
+        "functions": functions,
+    }
+```
+
+- [ ] **Step 2.5: Run the tests to confirm they pass**
+
+Run: `python -m pytest symbolic_agent/baselines/trove/tests/test_parse_callsites.py -v`
+
+Expected: 10 passed.
+
+- [ ] **Step 2.6: Commit**
+
+```bash
+git add symbolic_agent/baselines/trove/parse.py symbolic_agent/baselines/trove/tests/__init__.py symbolic_agent/baselines/trove/tests/test_parse_callsites.py
+git commit -m "$(cat <<'EOF'
+feat(trove): add imported_callsites helper and task_family to parse_response
+
+- imported_callsites(solution, tools, names) -> set: AST-walks Solution
+  code and returns names from the candidate set that are actually called.
+  Handles bare Name and Attribute (toolbox.foo) callees.
+- parse_response(text, task_family="default"): when task_family="pbebench"
+  the parser does not fall back to the first python block when **Solution**
+  is missing. Prevents CoT scratchpad from being promoted to the answer.
+EOF
+)"
+```
+
+---
+
+## Task 3: PBEBench-shaped few-shots + IMPORT-with-tools prompt
+
+**Files:**
+- Modify: `symbolic_agent/baselines/trove/prompts.py` (full rewrite of constants and `build_*` functions)
+
+This task has no automated test — prompts are validated by inspection and by the smoke run.
+
+- [ ] **Step 3.1: Replace the prompts module with task-family-aware variants**
+
+Open `symbolic_agent/baselines/trove/prompts.py` and replace the entire body below the module docstring with the following. Keep the docstring at the top of the file.
+
+```python
+# ---------------------------------------------------------------------------
+# Format override (default-family only)
+# ---------------------------------------------------------------------------
+
+_FORMAT_OVERRIDE_DEFAULT = (
+    "\nIMPORTANT: Regardless of any formatting instructions inside the question, "
+    "always produce your answer as executable Python in the **Solution** block "
+    "and end it with print(answer). "
+    "Your answer is whatever gets printed to stdout when the Solution code runs."
+)
+
+# PBEBench prompts model the desired format directly via the few-shot example,
+# so no override string is needed.
+_FORMAT_OVERRIDE_PBEBENCH = ""
+
+
+def _format_override(task_family: str) -> str:
+    return _FORMAT_OVERRIDE_PBEBENCH if task_family == "pbebench" else _FORMAT_OVERRIDE_DEFAULT
+
+
+# ---------------------------------------------------------------------------
+# IMPORT mode (text-based, default and Anthropic fallback)
+# ---------------------------------------------------------------------------
+
+_IMPORT_INSTRUCTION_DEFAULT = (
+    "You task is to write Python program solutions to the given questions.\n"
+    "The toolbox section lists all the available functions that can be used in your solution."
+)
+
+_IMPORT_EXAMPLE_DEFAULT = """\
+## Example
+**Question**
+Given a list of strings and a list of (old, new) substitution pairs, apply all
+substitutions in order to each string and return the transformed list.
+Strings: ["cat", "bat"]
+Substitutions: [("a", "o"), ("t", "p")]
+
+**Toolbox**
+```python
+# Apply an ordered list of (old, new) substitutions to each string in a list.
+apply_substitutions(strings: list, substitutions: list) -> list
+```
+
+**Solution**
+```python
+strings = ["cat", "bat"]
+subs = [("a", "o"), ("t", "p")]
+result = apply_substitutions(strings, subs)
+print(result)
+```
+**Tools**
+```python
+from toolbox import apply_substitutions
+```"""
+
+_IMPORT_EXAMPLE_PBEBENCH = """\
+## Example
+**Question**
+You are given example input/output pairs. Produce a list of replace() calls
+that transforms each input into its expected output.
+
+Input:  "hello world"
+Output: "HELLO_WORLD"
+
+**Toolbox**
+```python
+# Apply a chain of (old, new) replacements to a string.
+find_replace_chain(s: str, pairs: list) -> str
+```
+
+**Solution**
+```python
+result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")])
+print(result)
+```
+**Tools**
+```python
+from toolbox import find_replace_chain
+```"""
+
+_IMPORT_TASK_TEMPLATE = """\
+## Task
+**Question**
+{question}
+
+**Toolbox**
+{toolbox}
+
+**Solution**
+"""
+
+
+def build_import_prompt(question: str, toolbox_str: str, task_family: str = "default") -> str:
+    """Build the text-based IMPORT-mode prompt (used for Anthropic and as fallback)."""
+    instruction = _IMPORT_INSTRUCTION_DEFAULT + _format_override(task_family)
+    example = _IMPORT_EXAMPLE_PBEBENCH if task_family == "pbebench" else _IMPORT_EXAMPLE_DEFAULT
+    return (
+        instruction
+        + "\n\n\n"
+        + example
+        + "\n\n\n"
+        + _IMPORT_TASK_TEMPLATE.format(question=question, toolbox=toolbox_str)
+    )
+
+
+# ---------------------------------------------------------------------------
+# IMPORT-with-tools mode (native OpenAI tool calling; no **Toolbox** block)
+# ---------------------------------------------------------------------------
+
+_IMPORT_WITH_TOOLS_INSTRUCTION_DEFAULT = (
+    "You task is to write Python program solutions to the given questions.\n"
+    "You have a set of helper functions available as tools. Call any of them "
+    "when they help you solve the question; otherwise solve directly. After "
+    "you have computed the answer, output it as executable Python in a "
+    "**Solution** block and end with print(answer)."
+)
+
+_IMPORT_WITH_TOOLS_INSTRUCTION_PBEBENCH = (
+    "You task is to produce a list of replace() calls that transforms each "
+    "input into its expected output for a Programming-by-Example task.\n"
+    "You have a set of helper functions available as tools. Call any of them "
+    "to test ideas or compute intermediate results; the final answer must be "
+    "produced as a Python program in the **Solution** block."
+)
+
+_IMPORT_WITH_TOOLS_EXAMPLE_DEFAULT = """\
+## Example
+**Question**
+Apply substitutions [("a","o"),("t","p")] to ["cat","bat"] and return the list.
+
+(After optionally calling `apply_substitutions` as a tool to confirm,
+the assistant produces:)
+
+**Solution**
+```python
+strings = ["cat", "bat"]
+subs = [("a", "o"), ("t", "p")]
+result = apply_substitutions(strings, subs)
+print(result)
+```"""
+
+_IMPORT_WITH_TOOLS_EXAMPLE_PBEBENCH = """\
+## Example
+**Question**
+Produce a sequence of replace() calls that transforms "hello world" into
+"HELLO_WORLD".
+
+(After optionally calling `find_replace_chain` as a tool to verify a
+candidate sequence, the assistant produces:)
+
+**Solution**
+```python
+result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")])
+print(result)
+```"""
+
+_IMPORT_WITH_TOOLS_TASK_TEMPLATE = """\
+## Task
+**Question**
+{question}
+
+**Solution**
+"""
+
+
+def build_import_with_tools_prompt(question: str, task_family: str = "default") -> str:
+    """
+    Build the IMPORT-with-tools prompt. The toolbox is NOT shown as text — it
+    is conveyed via the OpenAI tools=[...] parameter on the chat completion call.
+    """
+    if task_family == "pbebench":
+        instruction = _IMPORT_WITH_TOOLS_INSTRUCTION_PBEBENCH
+        example = _IMPORT_WITH_TOOLS_EXAMPLE_PBEBENCH
+    else:
+        instruction = _IMPORT_WITH_TOOLS_INSTRUCTION_DEFAULT
+        example = _IMPORT_WITH_TOOLS_EXAMPLE_DEFAULT
+    return (
+        instruction
+        + "\n\n\n"
+        + example
+        + "\n\n\n"
+        + _IMPORT_WITH_TOOLS_TASK_TEMPLATE.format(question=question)
+    )
+
+
+# ---------------------------------------------------------------------------
+# CREATE mode
+# ---------------------------------------------------------------------------
+
+_CREATE_INSTRUCTION_DEFAULT = (
+    "You task is to write Python program solutions to the given questions.\n"
+    "You should also create Python functions that can be used by your solution, "
+    "if you believe the function can be reused to solve other questions."
+)
+
+_CREATE_EXAMPLE_DEFAULT = """\
+## Example
+**Question**
+Given a list of strings and a list of (old, new) substitution pairs, apply all
+substitutions in order to each string and return the transformed list.
+Strings: ["hello", "world"]
+Substitutions: [("l", "r"), ("o", "0")]
+
+**Solution**
+```python
+strings = ["hello", "world"]
+subs = [("l", "r"), ("o", "0")]
+result = apply_substitutions(strings, subs)
+print(result)
+```
+**Tools**
+```python
+def apply_substitutions(strings, substitutions):
+    \"\"\"Apply an ordered list of (old, new) substitutions to each string in a list.\"\"\"
+    out = []
+    for s in strings:
+        for old, new in substitutions:
+            s = s.replace(old, new)
+        out.append(s)
+    return out
+```"""
+
+_CREATE_EXAMPLE_PBEBENCH = """\
+## Example
+**Question**
+Produce a sequence of replace() calls that transforms "hello world" into
+"HELLO_WORLD".
+
+**Solution**
+```python
+result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")])
+print(result)
+```
+**Tools**
+```python
+def find_replace_chain(s, pairs):
+    \"\"\"Apply a chain of (old, new) replacements to a string.\"\"\"
+    for old, new in pairs:
+        s = s.replace(old, new)
+    return s
+```"""
+
+_CREATE_TASK_TEMPLATE = """\
+## Task
+**Question**
+{question}
+
+**Solution**
+"""
+
+
+def build_create_prompt(question: str, task_family: str = "default") -> str:
+    """Build the CREATE-mode prompt for a single task."""
+    instruction = _CREATE_INSTRUCTION_DEFAULT + _format_override(task_family)
+    example = _CREATE_EXAMPLE_PBEBENCH if task_family == "pbebench" else _CREATE_EXAMPLE_DEFAULT
+    return (
+        instruction
+        + "\n\n\n"
+        + example
+        + "\n\n\n"
+        + _CREATE_TASK_TEMPLATE.format(question=question)
+    )
+
+
+# ---------------------------------------------------------------------------
+# SKIP mode
+# ---------------------------------------------------------------------------
+
+_SKIP_INSTRUCTION_DEFAULT = (
+    "You task is to write Python program solutions to the given questions."
+)
+
+_SKIP_EXAMPLE_DEFAULT = """\
+## Example
+**Question**
+Given the list of strings ["Hello", "World"], convert each to lowercase and
+return the resulting list.
+
+**Solution**
+```python
+strings = ["Hello", "World"]
+result = [s.lower() for s in strings]
+print(result)
+```
+**Tools**
+```python
+```"""
+
+_SKIP_EXAMPLE_PBEBENCH = """\
+## Example
+**Question**
+Produce a sequence of replace() calls that transforms "hello world" into
+"HELLO_WORLD".
+
+**Solution**
+```python
+s = "hello world"
+s = s.replace(" ", "_")
+s = s.replace("h", "H")
+s = s.replace("e", "E")
+s = s.replace("l", "L")
+s = s.replace("o", "O")
+s = s.replace("w", "W")
+s = s.replace("r", "R")
+s = s.replace("d", "D")
+print(s)
+```
+**Tools**
+```python
+```"""
+
+_SKIP_TASK_TEMPLATE = """\
+## Task
+**Question**
+{question}
+
+**Solution**
+"""
+
+
+def build_skip_prompt(question: str, task_family: str = "default") -> str:
+    """Build the SKIP-mode prompt for a single task."""
+    instruction = _SKIP_INSTRUCTION_DEFAULT + _format_override(task_family)
+    example = _SKIP_EXAMPLE_PBEBENCH if task_family == "pbebench" else _SKIP_EXAMPLE_DEFAULT
+    return (
+        instruction
+        + "\n\n\n"
+        + example
+        + "\n\n\n"
+        + _SKIP_TASK_TEMPLATE.format(question=question)
+    )
+
+
+def get_question(task_input: dict) -> str:
+    """
+    Extract the question/prompt string from a task_input dict.
+
+    Priority: question > prompt > task > str(task_input).
+    """
+    for key in ("question", "prompt", "task"):
+        val = task_input.get(key)
+        if val and isinstance(val, str) and val.strip():
+            return val.strip()
+    return str(task_input)
+```
+
+- [ ] **Step 3.2: Smoke-test the new prompts compile and dispatch correctly**
+
+Run: `python -c "from symbolic_agent.baselines.trove.prompts import build_import_prompt, build_create_prompt, build_skip_prompt, build_import_with_tools_prompt; print('--IMPORT default--'); print(build_import_prompt('Q?', 'TB')[:200]); print('--IMPORT pbebench--'); print(build_import_prompt('Q?', 'TB', task_family='pbebench')[:200]); print('--IMPORT_WITH_TOOLS pbebench--'); print(build_import_with_tools_prompt('Q?', task_family='pbebench')[:200])"`
+
+Expected: three short prompt previews, no exceptions, no `IMPORTANT:` line in the pbebench variant.
+
+- [ ] **Step 3.3: Commit**
+
+```bash
+git add symbolic_agent/baselines/trove/prompts.py
+git commit -m "$(cat <<'EOF'
+feat(trove): PBEBench-shaped few-shots and IMPORT-with-tools prompt
+
+- Add task_family parameter to all build_* prompt builders.
+- Add _CREATE_EXAMPLE_PBEBENCH and _SKIP_EXAMPLE_PBEBENCH demonstrating
+  replace()-chain solutions and a find_replace_chain helper.
+- Add build_import_with_tools_prompt for native tool calling: no
+  **Toolbox** markdown block (toolbox is conveyed via tools=[...]).
+- _FORMAT_OVERRIDE is empty for task_family="pbebench" (the example
+  models the desired format directly).
+EOF
+)"
+```
+
+---
+
+## Task 4: New `tools_api.py` (toolbox -> OpenAI schemas, dispatcher)
+
+**Files:**
+- Create: `symbolic_agent/baselines/trove/tools_api.py`
+- Create: `symbolic_agent/baselines/trove/tests/test_tools_api.py`
+
+- [ ] **Step 4.1: Write the failing tests**
+
+Create `symbolic_agent/baselines/trove/tests/test_tools_api.py`:
+
+```python
+"""Unit tests for tools_api.toolbox_to_openai_tools and dispatch_tool_call."""
+
+import json
+from types import SimpleNamespace
+
+from symbolic_agent.baselines.trove.toolbox import TroVEToolbox
+from symbolic_agent.baselines.trove.tools_api import (
+    dispatch_tool_call,
+    toolbox_to_openai_tools,
+)
+
+
+def _make_toolbox_with(func_src: str, name: str, docstr: str = "") -> TroVEToolbox:
+    tb = TroVEToolbox()
+    tb.add(
+        {
+            "name": name,
+            "docstr": docstr,
+            "signature": f"def {name}(...)",
+            "function": func_src,
+            "type": "function",
+        },
+        example_idx=0,
+    )
+    return tb
+
+
+def _tool_call(name: str, args: dict, call_id: str = "call_1"):
+    return SimpleNamespace(
+        id=call_id,
+        function=SimpleNamespace(name=name, arguments=json.dumps(args)),
+    )
+
+
+# ---------------------------------------------------------------------------
+# toolbox_to_openai_tools
+# ---------------------------------------------------------------------------
+
+def test_schema_basic_function():
+    src = (
+        "def find_replace_chain(s: str, pairs: list) -> str:\n"
+        "    \"\"\"Apply a chain of (old, new) replacements to a string.\"\"\"\n"
+        "    for old, new in pairs:\n"
+        "        s = s.replace(old, new)\n"
+        "    return s\n"
+    )
+    tb = _make_toolbox_with(src, "find_replace_chain", docstr="Apply a chain of (old, new) replacements to a string.")
+    tools = toolbox_to_openai_tools(tb, topk=10)
+    assert len(tools) == 1
+    fn = tools[0]
+    assert fn["type"] == "function"
+    assert fn["function"]["name"] == "find_replace_chain"
+    assert fn["function"]["description"] == "Apply a chain of (old, new) replacements to a string."
+    params = fn["function"]["parameters"]
+    assert params["type"] == "object"
+    assert set(params["properties"].keys()) == {"s", "pairs"}
+    assert params["properties"]["s"]["type"] == "string"
+    assert params["properties"]["pairs"]["type"] == "array"
+    assert set(params["required"]) == {"s", "pairs"}
+
+
+def test_schema_unannotated_falls_back_to_string():
+    src = (
+        "def f(x):\n"
+        "    return x\n"
+    )
+    tb = _make_toolbox_with(src, "f")
+    tools = toolbox_to_openai_tools(tb, topk=10)
+    assert tools[0]["function"]["parameters"]["properties"]["x"]["type"] == "string"
+
+
+def test_schema_skips_varargs_kwargs():
+    src = (
+        "def f(*args, **kwargs):\n"
+        "    return args\n"
+    )
+    tb = _make_toolbox_with(src, "f")
+    tools = toolbox_to_openai_tools(tb, topk=10)
+    assert tools == []
+
+
+def test_schema_required_excludes_defaults():
+    src = (
+        "def f(x: int, y: int = 5):\n"
+        "    return x + y\n"
+    )
+    tb = _make_toolbox_with(src, "f")
+    tools = toolbox_to_openai_tools(tb, topk=10)
+    params = tools[0]["function"]["parameters"]
+    assert params["required"] == ["x"]
+    assert params["properties"]["y"]["type"] == "integer"
+
+
+def test_schema_topk_respects_frequency():
+    tb = TroVEToolbox()
+    for n, freq in [("a", 3), ("b", 2), ("c", 1)]:
+        tb.add(
+            {
+                "name": n,
+                "docstr": "",
+                "signature": f"def {n}()",
+                "function": f"def {n}():\n    return 0\n",
+                "type": "function",
+            },
+            example_idx=0,
+        )
+        for _ in range(freq - 1):
+            tb.update_frequency(n, example_idx=0)
+    tools = toolbox_to_openai_tools(tb, topk=2)
+    assert [t["function"]["name"] for t in tools] == ["a", "b"]
+
+
+def test_schema_empty_toolbox():
+    assert toolbox_to_openai_tools(TroVEToolbox(), topk=10) == []
+
+
+# ---------------------------------------------------------------------------
+# dispatch_tool_call
+# ---------------------------------------------------------------------------
+
+def test_dispatch_runs_function_and_returns_stdout():
+    src = (
+        "def reverse_str(s):\n"
+        "    return s[::-1]\n"
+    )
+    tb = _make_toolbox_with(src, "reverse_str")
+    result = dispatch_tool_call(tb, _tool_call("reverse_str", {"s": "hello"}))
+    assert "olleh" in result
+
+
+def test_dispatch_unknown_tool_returns_error():
+    tb = TroVEToolbox()
+    result = dispatch_tool_call(tb, _tool_call("nonexistent", {}))
+    assert "not in toolbox" in result
+
+
+def test_dispatch_bad_json_returns_error():
+    src = "def f(x):\n    return x\n"
+    tb = _make_toolbox_with(src, "f")
+    bad = SimpleNamespace(
+        id="x",
+        function=SimpleNamespace(name="f", arguments="{not json"),
+    )
+    result = dispatch_tool_call(tb, bad)
+    assert "argument JSON parse failed" in result
+
+
+def test_dispatch_sanitizes_harmony_contamination():
+    src = "def reverse_str(s):\n    return s[::-1]\n"
+    tb = _make_toolbox_with(src, "reverse_str")
+    tc = _tool_call("reverse_str<|channel|>commentary", {"s": "abc"})
+    result = dispatch_tool_call(tb, tc)
+    assert "cba" in result
+
+
+def test_dispatch_truncates_long_output():
+    src = (
+        "def long_output(n):\n"
+        "    return 'x' * n\n"
+    )
+    tb = _make_toolbox_with(src, "long_output")
+    result = dispatch_tool_call(tb, _tool_call("long_output", {"n": 10000}))
+    assert len(result) <= 4096 + 100  # +slack for repr quotes and truncation marker
+```
+
+- [ ] **Step 4.2: Run the tests to confirm they fail**
+
+Run: `python -m pytest symbolic_agent/baselines/trove/tests/test_tools_api.py -v`
+
+Expected: ImportError on `tools_api` module.
+
+- [ ] **Step 4.3: Create the `tools_api.py` module**
+
+Create `symbolic_agent/baselines/trove/tools_api.py`:
+
+```python
+"""Translate the TroVE toolbox into OpenAI Chat Completions tool schemas
+and dispatch tool calls back through the executor.
+
+This module is the bridge between TroVE's in-memory toolbox and vLLM's
+native tool-calling protocol. It is invoked only from the IMPORT-with-tools
+controller branch.
+"""
+
+from __future__ import annotations
+
+import inspect
+import json
+import logging
+from typing import Any
+
+from .executor import run_solution
+from .toolbox import TroVEToolbox
+
+logger = logging.getLogger(__name__)
+
+_MAX_RESULT_CHARS = 4096
+
+# Type inference: Python annotation -> JSON Schema type.
+_TYPE_MAP = {
+    int: "integer",
+    float: "number",
+    bool: "boolean",
+    str: "string",
+    list: "array",
+    tuple: "array",
+    dict: "object",
+}
+
+
+def _infer_type(annotation: Any) -> str:
+    if annotation is inspect.Parameter.empty:
+        return "string"
+    # Plain types (int, str, etc.)
+    if annotation in _TYPE_MAP:
+        return _TYPE_MAP[annotation]
+    # typing.List, typing.Dict, etc. — fall through to string if unrecognised.
+    origin = getattr(annotation, "__origin__", None)
+    if origin in _TYPE_MAP:
+        return _TYPE_MAP[origin]
+    return "string"
+
+
+def _function_to_schema(name: str, fn: Any, docstr: str) -> dict | None:
+    """
+    Build one OpenAI tool dict from a callable. Returns None if the function
+    has *args or **kwargs (we cannot generate a meaningful schema).
+    """
+    try:
+        sig = inspect.signature(fn)
+    except (TypeError, ValueError) as exc:
+        logger.debug("Could not introspect %s: %s", name, exc)
+        return None
+
+    properties: dict = {}
+    required: list = []
+
+    for pname, param in sig.parameters.items():
+        if param.kind in (
+            inspect.Parameter.VAR_POSITIONAL,
+            inspect.Parameter.VAR_KEYWORD,
+        ):
+            logger.debug("Skipping %s — has *args/**kwargs", name)
+            return None
+        prop: dict = {"type": _infer_type(param.annotation)}
+        if param.default is not inspect.Parameter.empty:
+            if isinstance(param.default, (int, float, bool, str)):
+                prop["default"] = param.default
+        else:
+            required.append(pname)
+        properties[pname] = prop
+
+    return {
+        "type": "function",
+        "function": {
+            "name": name,
+            "description": docstr or "",
+            "parameters": {
+                "type": "object",
+                "properties": properties,
+                "required": required,
+            },
+        },
+    }
+
+
+def toolbox_to_openai_tools(toolbox: TroVEToolbox, topk: int = 10) -> list:
+    """
+    Convert the top-k toolbox functions (by frequency) into OpenAI Chat
+    Completions tool dicts.
+
+    Functions with *args / **kwargs are silently excluded.
+    Returns [] when the toolbox is empty.
+    """
+    entries = toolbox.snapshot()
+    if not entries:
+        return []
+    entries.sort(key=lambda e: -int(e.get("frequency", 0)))
+    selected = entries[:topk]
+
+    namespace: dict = {}
+    try:
+        exec(toolbox.get_full_code(), namespace)
+    except Exception as exc:
+        logger.warning("Could not exec toolbox source for schema generation: %s", exc)
+        return []
+
+    tools: list = []
+    for entry in selected:
+        name = entry.get("name", "")
+        if not name or name not in namespace:
+            continue
+        fn = namespace[name]
+        schema = _function_to_schema(name, fn, entry.get("docstr", ""))
+        if schema is not None:
+            tools.append(schema)
+    return tools
+
+
+def _sanitize_name(name: str) -> str:
+    """Defensive workaround for vLLM PR #35906 (Harmony control tokens
+    leaking into tool names like `reverse_str<|channel|>commentary`)."""
+    return name.split("<|", 1)[0].strip()
+
+
+def _truncate(s: str, limit: int = _MAX_RESULT_CHARS) -> str:
+    if len(s) <= limit:
+        return s
+    return s[:limit] + f"\n... [truncated {len(s) - limit} chars]"
+
+
+def dispatch_tool_call(toolbox: TroVEToolbox, tool_call) -> str:
+    """
+    Resolve `tool_call` against the toolbox, run it via the sandbox executor,
+    and return the captured stdout (truncated to 4096 chars) or an error
+    message string. Always returns a string — never raises.
+    """
+    name = _sanitize_name(getattr(tool_call.function, "name", "") or "")
+    if not name:
+        return json.dumps({"error": "tool_call has no function name"})
+    if name not in {e["name"] for e in toolbox.snapshot()}:
+        return json.dumps({"error": f"tool '{name}' not in toolbox"})
+
+    raw_args = getattr(tool_call.function, "arguments", "") or "{}"
+    try:
+        args = json.loads(raw_args)
+        if not isinstance(args, dict):
+            return json.dumps({"error": f"argument JSON parse failed: expected object, got {type(args).__name__}"})
+    except json.JSONDecodeError as exc:
+        return json.dumps({"error": f"argument JSON parse failed: {exc}"})
+
+    call_expr = f"print(repr({name}(**{args!r})))"
+    is_ok, output = run_solution(
+        solution_code=call_expr,
+        tools_code="",
+        toolbox_code=toolbox.get_full_code(),
+    )
+    if not is_ok:
+        return json.dumps({"error": "execution failed", "stderr": _truncate(output)})
+    return _truncate(output)
+```
+
+- [ ] **Step 4.4: Run the tests to confirm they pass**
+
+Run: `python -m pytest symbolic_agent/baselines/trove/tests/test_tools_api.py -v`
+
+Expected: 10 passed.
+
+- [ ] **Step 4.5: Commit**
+
+```bash
+git add symbolic_agent/baselines/trove/tools_api.py symbolic_agent/baselines/trove/tests/test_tools_api.py
+git commit -m "$(cat <<'EOF'
+feat(trove): add tools_api for native OpenAI tool calling
+
+- toolbox_to_openai_tools(toolbox, topk=10): converts top-k toolbox
+  functions into OpenAI Chat Completions tool schemas. Infers parameter
+  types from inspect.signature; functions with *args/**kwargs are
+  silently excluded.
+- dispatch_tool_call(toolbox, tool_call): runs the requested function
+  in the sandbox executor, returns stdout truncated to 4096 chars or
+  a JSON error string. Sanitizes Harmony control-token contamination
+  in tool names (defensive vs. open vLLM PR #35906).
+EOF
+)"
+```
+
+---
+
+## Task 5: `chat_with_tools` method on `TroVELLMClient`
+
+**Files:**
+- Modify: `symbolic_agent/baselines/trove/llm.py` (add new method, no signature changes to existing methods)
+
+This task has no automated test — the multi-turn loop is validated by the controller-level integration plus the smoke run.
+
+- [ ] **Step 5.1: Add `chat_with_tools` to `TroVELLMClient`**
+
+In `symbolic_agent/baselines/trove/llm.py`, add the following imports near the top (`Callable` may already be implicit via `typing`):
+
+```python
+from typing import Any, Callable, Dict, List, Optional
+```
+
+Then add the new method to the `TroVELLMClient` class (insert after `_call_openai`, before `_record`):
+
+```python
+    # ------------------------------------------------------------------
+    # Native tool calling (OpenAI/vLLM only)
+    # ------------------------------------------------------------------
+
+    def chat_with_tools(
+        self,
+        messages: List[Dict[str, Any]],
+        tools: List[Dict[str, Any]],
+        model: str,
+        max_tokens: int = DEFAULT_MAX_TOKENS,
+        max_tool_iters: int = 8,
+        on_tool_call: Optional[Callable[[Any], str]] = None,
+        tag: str = "",
+    ) -> Dict[str, Any]:
+        """
+        Multi-turn chat completion that supports native OpenAI tool calls.
+
+        Returns
+        -------
+        {
+            "final_text":     str,         # message.content (or reasoning_content fallback)
+            "tool_calls":     list[dict],  # ordered, each {name, args_preview, result_preview, ok}
+            "iterations":     int,         # number of round-trips actually used
+            "stopped_reason": str,         # "no_tool_calls" | "max_iters" | "error"
+        }
+
+        The caller is responsible for providing `on_tool_call(tc) -> str`,
+        which is invoked for every tool_call returned by the model. The
+        return value (already a string) is sent back as the tool message.
+
+        Anthropic backend is not supported — this method exists for the
+        OpenAI/vLLM tool-calling flow only. It raises NotImplementedError
+        on Anthropic as a defensive guard; controllers must check
+        `self.backend == "openai"` before calling.
+        """
+        if self.backend != "openai":
+            raise NotImplementedError("chat_with_tools requires the openai backend")
+
+        if on_tool_call is None:
+            raise ValueError("chat_with_tools requires an on_tool_call callback")
+
+        recorded_calls: List[Dict[str, Any]] = []
+        convo: List[Dict[str, Any]] = list(messages)
+        iterations = 0
+        final_text = ""
+        stopped_reason = "no_tool_calls"
+
+        for it in range(max_tool_iters + 1):
+            iterations = it + 1
+            iter_tag = f"{tag}_iter{it}" if tag else f"iter{it}"
+            response = None
+            last_exc = None
+
+            for attempt in range(3):
+                try:
+                    response = self._client.chat.completions.create(
+                        model=model,
+                        max_tokens=max_tokens,
+                        messages=convo,
+                        tools=tools,
+                        tool_choice="auto",
+                    )
+                    break
+                except Exception as exc:
+                    last_exc = exc
+                    if getattr(exc, "status_code", None) == 400:
+                        logger.warning(
+                            "OpenAI chat_with_tools 400 (tag=%s): %s", iter_tag, exc
+                        )
+                        self._record(iter_tag, model, json.dumps(convo)[:2000], "", max_tokens, {})
+                        return {
+                            "final_text": "",
+                            "tool_calls": recorded_calls,
+                            "iterations": iterations,
+                            "stopped_reason": "error",
+                        }
+                    if attempt < 2:
+                        wait = 5 * (2 ** attempt)
+                        logger.warning(
+                            "chat_with_tools failed (attempt %d/3, tag=%s): %s. Retrying in %ds.",
+                            attempt + 1, iter_tag, exc, wait,
+                        )
+                        time.sleep(wait)
+
+            if response is None:
+                logger.warning("All chat_with_tools retries exhausted (tag=%s): %s", iter_tag, last_exc)
+                stopped_reason = "error"
+                break
+
+            msg = response.choices[0].message
+            content = msg.content or getattr(msg, "reasoning_content", "") or ""
+            tool_calls = getattr(msg, "tool_calls", None) or []
+
+            u = getattr(response, "usage", None)
+            details = getattr(u, "completion_tokens_details", None)
+            usage = {
+                "input_tokens": getattr(u, "prompt_tokens", 0) or 0,
+                "output_tokens": getattr(u, "completion_tokens", 0) or 0,
+                "reasoning_tokens": getattr(details, "reasoning_tokens", 0) or 0 if details else 0,
+            }
+            self._record(
+                iter_tag,
+                model,
+                json.dumps(convo)[:2000],
+                json.dumps({"content": content, "tool_calls_count": len(tool_calls)}),
+                max_tokens,
+                usage,
+            )
+
+            if not tool_calls:
+                final_text = content
+                stopped_reason = "no_tool_calls"
+                break
+
+            assistant_msg: Dict[str, Any] = {
+                "role": "assistant",
+                "content": content,
+                "tool_calls": [
+                    {
+                        "id": tc.id,
+                        "type": "function",
+                        "function": {
+                            "name": tc.function.name,
+                            "arguments": tc.function.arguments,
+                        },
+                    }
+                    for tc in tool_calls
+                ],
+            }
+            convo.append(assistant_msg)
+
+            for tc in tool_calls:
+                try:
+                    result = on_tool_call(tc)
+                    ok = True
+                except Exception as exc:
+                    result = json.dumps({"error": f"on_tool_call raised: {exc}"})
+                    ok = False
+                args_preview = (tc.function.arguments or "")[:200]
+                result_preview = (result or "")[:200]
+                recorded_calls.append(
+                    {
+                        "name": tc.function.name,
+                        "args_preview": args_preview,
+                        "result_preview": result_preview,
+                        "ok": ok,
+                    }
+                )
+                convo.append(
+                    {
+                        "role": "tool",
+                        "tool_call_id": tc.id,
+                        "content": result,
+                    }
+                )
+
+            if it >= max_tool_iters - 1:
+                stopped_reason = "max_iters"
+                final_text = content
+                break
+
+        return {
+            "final_text": final_text,
+            "tool_calls": recorded_calls,
+            "iterations": iterations,
+            "stopped_reason": stopped_reason,
+        }
+```
+
+- [ ] **Step 5.2: Smoke-test the method does not break import**
+
+Run: `python -c "from symbolic_agent.baselines.trove.llm import TroVELLMClient; print(hasattr(TroVELLMClient, 'chat_with_tools'))"`
+
+Expected: `True`.
+
+- [ ] **Step 5.3: Smoke-test the Anthropic guard fires**
+
+Run: `python -c "from symbolic_agent.baselines.trove.llm import TroVELLMClient; c = TroVELLMClient(backend='anthropic', api_key='unused'); 
+try:
+    c.chat_with_tools([], [], model='x', on_tool_call=lambda x: '')
+    print('no exception (BUG)')
+except NotImplementedError as e:
+    print('guard fires:', e)"`
+
+Expected: `guard fires: chat_with_tools requires the openai backend`.
+
+- [ ] **Step 5.4: Commit**
+
+```bash
+git add symbolic_agent/baselines/trove/llm.py
+git commit -m "$(cat <<'EOF'
+feat(trove): add TroVELLMClient.chat_with_tools for native tool calls
+
+Multi-turn loop that handles tool_calls returned by gpt-oss/vLLM:
+appends assistant message + tool result messages until the model returns
+no tool_calls or max_tool_iters is reached. Records each call as
+{name, args_preview, result_preview, ok} for downstream telemetry.
+Reuses the existing 3-attempt retry, debug logging, and token accounting.
+
+Anthropic backend raises NotImplementedError as a defensive guard;
+controllers branch on self.backend == "openai" before calling.
+EOF
+)"
+```
+
+---
+
+## Task 6: Controller IMPORT-with-tools branch + telemetry fields
+
+**Files:**
+- Modify: `symbolic_agent/baselines/trove/controller.py`
+
+- [ ] **Step 6.1: Update imports and `__init__` signature**
+
+In `symbolic_agent/baselines/trove/controller.py`, replace the imports block at the top (currently lines 36-44) with:
+
+```python
+import logging
+from collections import Counter
+from typing import Callable, Dict, List, Optional
+
+from . import tools_api
+from .executor import run_solution
+from .llm import TroVELLMClient
+from .parse import count_ast_nodes, imported_callsites, parse_response
+from .prompts import (
+    build_create_prompt,
+    build_import_prompt,
+    build_import_with_tools_prompt,
+    build_skip_prompt,
+    get_question,
+)
+from .toolbox import TroVEToolbox
+```
+
+Then update `TroVEController.__init__` (currently around lines 78-105) to accept the two new parameters:
+
+```python
+    def __init__(
+        self,
+        api_key: Optional[str] = None,
+        model: str = "claude-sonnet-4-5",
+        base_url: Optional[str] = None,
+        debug_dir: Optional[str] = None,
+        k: int = DEFAULT_K,
+        trim_every: int = DEFAULT_TRIM_EVERY,
+        trim_C: float = 1.0,
+        temperature: float = 0.3,
+        top_p: float = 0.95,
+        task_family: str = "default",
+        selection: str = "reward",
+        max_tool_iters: int = 8,
+        tool_schema_topk: int = 10,
+    ):
+        self.model = model
+        self.k = k
+        self.trim_every = trim_every
+        self.trim_C = trim_C
+        self.task_family = task_family
+        self.selection = selection
+        self.max_tool_iters = max_tool_iters
+        self.tool_schema_topk = tool_schema_topk
+
+        self.backend = "openai" if base_url else "anthropic"
+        self.llm = TroVELLMClient(
+            backend=self.backend,
+            base_url=base_url,
+            api_key=api_key,
+            temperature=temperature,
+            top_p=top_p,
+            debug_dir=debug_dir,
+        )
+        self.toolbox = TroVEToolbox()
+        self._n_processed: int = 0
+```
+
+(Note `trim_C` default is now 1.0 to match the toolbox change in Task 1; controllers passing the default get the new behavior.)
+
+- [ ] **Step 6.2: Update existing build_* call-sites to pass `task_family`**
+
+In `_multi_way_generation`, find each call to `build_create_prompt(question)` and `build_skip_prompt(question)` and the legacy `build_import_prompt(question, toolbox_str)`, replacing them with:
+
+```python
+                prompt = build_import_prompt(question, toolbox_str, task_family=self.task_family)
+```
+
+```python
+            prompt = build_create_prompt(question, task_family=self.task_family)
+```
+
+```python
+            prompt = build_skip_prompt(question, task_family=self.task_family)
+```
+
+Also update `parse_response(raw)` calls to `parse_response(raw, task_family=self.task_family)`.
+
+- [ ] **Step 6.3: Insert the IMPORT-with-tools branch in `_multi_way_generation`**
+
+Locate the `# --- IMPORT mode ---` section (currently around lines 254-274). Replace it with:
+
+```python
+        # --- IMPORT mode ---
+        toolbox_nonempty = bool(toolbox_str)
+        use_tools_branch = toolbox_nonempty and self.backend == "openai"
+
+        if use_tools_branch:
+            import_candidates = self._generate_import_with_tools(
+                question, example_idx, reward_fn=reward_fn, entry=entry
+            )
+            best_import_idx, best_import_score = self._select_best(
+                import_candidates, reward_fn=reward_fn, entry=entry
+            )
+            best_import = import_candidates[best_import_idx]
+            best_import["_reward_score"] = best_import_score
+        elif toolbox_nonempty:
+            # Legacy text-based IMPORT (Anthropic or unforeseen non-OpenAI path).
+            import_candidates = []
+            for _ in range(self.k):
+                prompt = build_import_prompt(question, toolbox_str, task_family=self.task_family)
+                raw = self.llm.call(prompt, self.model, max_tokens=DEFAULT_MAX_TOKENS, tag="trove_import")
+                parsed = parse_response(raw, task_family=self.task_family)
+                is_ok, out = run_solution(
+                    parsed["solution_code"],
+                    parsed["tools_code"],
+                    self.toolbox.get_full_code(),
+                )
+                import_candidates.append(
+                    {**parsed, "is_success": is_ok, "exec_output": out, "tool_calls": [], "stopped_reason": "legacy"}
+                )
+            best_import_idx, best_import_score = self._select_best(
+                import_candidates, reward_fn=reward_fn, entry=entry
+            )
+            best_import = import_candidates[best_import_idx]
+            best_import["_reward_score"] = best_import_score
+        else:
+            best_import = {
+                "solution_code": "", "tools_code": "", "functions": [],
+                "is_success": False, "exec_output": "",
+                "tool_calls": [], "stopped_reason": "empty_toolbox",
+                "_reward_score": None,
+            }
+```
+
+- [ ] **Step 6.4: Add the `_generate_import_with_tools` method**
+
+Insert this new method into the `TroVEController` class, after `_multi_way_generation`:
+
+```python
+    def _generate_import_with_tools(
+        self,
+        question: str,
+        example_idx: int,
+        reward_fn: Optional[Callable] = None,
+        entry: Optional[dict] = None,
+    ) -> List[dict]:
+        """
+        IMPORT-mode generation using native OpenAI tool calling.
+        Builds K trajectories; each trajectory may invoke toolbox functions
+        via tool_calls during the multi-turn loop. Returns K candidate dicts
+        compatible with _select_best.
+        """
+        prompt = build_import_with_tools_prompt(question, task_family=self.task_family)
+        tools_schema = tools_api.toolbox_to_openai_tools(self.toolbox, topk=self.tool_schema_topk)
+
+        candidates: List[dict] = []
+        for i in range(self.k):
+            tag = f"trove_import_t{example_idx}_{i}"
+            messages = [{"role": "user", "content": prompt}]
+            on_tc = lambda tc: tools_api.dispatch_tool_call(self.toolbox, tc)
+            traj = self.llm.chat_with_tools(
+                messages=messages,
+                tools=tools_schema,
+                model=self.model,
+                max_tokens=DEFAULT_MAX_TOKENS,
+                max_tool_iters=self.max_tool_iters,
+                on_tool_call=on_tc,
+                tag=tag,
+            )
+            parsed = parse_response(traj["final_text"], task_family=self.task_family)
+            is_ok, out = run_solution(
+                parsed["solution_code"],
+                parsed["tools_code"],
+                self.toolbox.get_full_code(),
+            )
+            candidates.append(
+                {
+                    **parsed,
+                    "is_success": is_ok,
+                    "exec_output": out,
+                    "tool_calls": traj["tool_calls"],
+                    "stopped_reason": traj["stopped_reason"],
+                    "iterations": traj["iterations"],
+                }
+            )
+        return candidates
+```
+
+- [ ] **Step 6.5: Wire `selection="consistency"` to the existing consistency selector**
+
+Replace `_select_best` (currently around lines 337-361) with:
+
+```python
+    def _select_best(
+        self,
+        candidates: List[dict],
+        reward_fn: Optional[Callable] = None,
+        entry: Optional[dict] = None,
+    ):
+        """
+        Select the best candidate from a list of response dicts.
+
+        Returns (best_index, score_or_None) where score is (reward, message)
+        when reward-based selection is used, or None otherwise.
+
+        Selection strategy is governed by self.selection:
+          - "reward" (default): reward-based when reward_fn+entry provided,
+            falls back to consistency when not.
+          - "consistency": original TroVE majority-vote algorithm.
+        """
+        if self.selection == "consistency":
+            return self._select_best_by_consistency(candidates), None
+        if reward_fn is not None and entry is not None:
+            return self._select_best_by_reward(candidates, reward_fn, entry)
+        return self._select_best_by_consistency(candidates), None
+```
+
+- [ ] **Step 6.6: Update `_update_library` to credit frequency from tool_calls**
+
+Replace `_update_library` (currently around lines 419-432) with:
+
+```python
+    def _update_library(self, mode: str, resp: dict, example_idx: int) -> None:
+        """Update toolbox based on winning mode (faithful to run_trove.py)."""
+        if mode == "import":
+            tool_calls = resp.get("tool_calls") or []
+            if tool_calls:
+                # Native tool-calling path: credit by unique tool_call.function.name
+                # (defensive: sanitize and let toolbox.update_frequency filter unknowns).
+                unique_names = {
+                    tc["name"].split("<|", 1)[0].strip()
+                    for tc in tool_calls
+                    if tc.get("name")
+                }
+                for name in unique_names:
+                    if name:
+                        self.toolbox.update_frequency(name, example_idx)
+            else:
+                # Legacy text-based IMPORT: credit functions parsed from **Tools**.
+                for func_dict in resp.get("functions", []):
+                    name = func_dict.get("name", "")
+                    if name:
+                        self.toolbox.update_frequency(name, example_idx)
+        elif mode == "create" and resp.get("is_success"):
+            for func_dict in resp.get("functions", []):
+                self.toolbox.add(func_dict, example_idx)
+
+        # SKIP: no library changes
+```
+
+- [ ] **Step 6.7: Add telemetry fields to `_make_result`**
+
+Replace `_make_result` (currently around lines 438-480) with:
+
+```python
+    def _make_result(
+        self,
+        task_input: dict,
+        task_type: str,
+        best_mode: str,
+        best_resp: dict,
+        is_success: bool,
+        output: str,
+        best_reward_score=None,
+    ) -> dict:
+        """
+        Build a result dict compatible with main.py's _print_result() and
+        _append_task_output(). Adds passive TroVE telemetry fields.
+        """
+        tool_calls = best_resp.get("tool_calls") or []
+        tools_called = sorted({
+            tc["name"].split("<|", 1)[0].strip()
+            for tc in tool_calls
+            if tc.get("name")
+        })
+        candidate_names = {e["name"] for e in self.toolbox.snapshot()}
+        actually_called = sorted(
+            imported_callsites(
+                solution_code=best_resp.get("solution_code", ""),
+                tools_code=best_resp.get("tools_code", ""),
+                candidate_names=candidate_names,
+            )
+        )
+        import_eligible = len(self.toolbox) > 0  # state AFTER this task's update
+        # Note: import_eligible reflects the current toolbox state after
+        # _update_library has already run for this task. The analyzer should
+        # interpret this as "a non-empty toolbox existed at some point during
+        # this task's processing". For pre-task eligibility, infer from
+        # toolbox snapshots in adjacent tasks.
+
+        return {
+            "task_type": task_type,
+            "original_prompt": str(task_input),
+            "solved": is_success,
+            "steps": 1,
+            "trace": [
+                {
+                    "step": 0,
+                    "agent": "trove",
+                    "action": best_mode,
+                    "is_success": is_success,
+                }
+            ],
+            "solution": best_resp.get("solution_code", ""),
+            "library_snapshot": self.toolbox.snapshot(),
+            "cost_summary": {},
+            "final_output": {
+                "answer": output,
+                "explanation": f"TroVE mode={best_mode}",
+                "confidence": "high" if is_success else "low",
+                "execution_result": output,
+            },
+            "agent_messages": self.llm.get_task_log(),
+            "reward_history": [],
+            "best_reward": None,
+            "final_reward": None,
+            "_best_reward_score": best_reward_score,
+            # TroVE native-tool-calling telemetry
+            "won_mode": best_mode,
+            "import_eligible": import_eligible,
+            "import_was_winner": best_mode == "import",
+            "tool_calls": tool_calls,
+            "tool_call_count": len(tool_calls),
+            "tools_called": tools_called,
+            "actually_called": actually_called,
+            "trove_stopped_reason": best_resp.get("stopped_reason", ""),
+        }
+```
+
+- [ ] **Step 6.8: Sanity-check the controller imports and constructs**
+
+Run: `python -c "from symbolic_agent.baselines.trove.controller import TroVEController; c = TroVEController(api_key='unused', model='x', task_family='pbebench', selection='reward'); print(c.task_family, c.selection, c.backend, c.max_tool_iters, c.tool_schema_topk)"`
+
+Expected: `pbebench reward anthropic 8 10`.
+
+- [ ] **Step 6.9: Run all tests to confirm no regressions**
+
+Run: `python -m pytest symbolic_agent/baselines/trove/tests/ -v`
+
+Expected: 16 passed (10 from tools_api + 6 from parse_callsites + 4 more = 20 actually; verify count matches what was added).
+
+Actual expected: 6 (parse_callsites) + 10 (tools_api) = 16 passed.
+
+- [ ] **Step 6.10: Commit**
+
+```bash
+git add symbolic_agent/baselines/trove/controller.py
+git commit -m "$(cat <<'EOF'
+feat(trove): controller branch for native IMPORT tool calling
+
+- Add task_family and selection params to TroVEController.__init__.
+- IMPORT branch dispatches to _generate_import_with_tools when toolbox
+  is non-empty and backend is openai; otherwise falls back to legacy
+  text-based IMPORT.
+- _generate_import_with_tools builds K multi-turn trajectories via
+  TroVELLMClient.chat_with_tools, parses **Solution** strictly for
+  pbebench, and runs the result through the executor.
+- _update_library credits frequency by unique tool_call.function.name
+  for the native path; legacy path still credits parsed functions.
+- _make_result emits won_mode, import_eligible, import_was_winner,
+  tool_calls, tool_call_count, tools_called, actually_called,
+  trove_stopped_reason as passive telemetry.
+- _select_best honors selection="consistency" or "reward" (default).
+EOF
+)"
+```
+
+---
+
+## Task 7: `main.py` CLI flags (`--trove-selection`, `--trove-task-family`)
+
+**Files:**
+- Modify: `main.py:794-810` (add new flags) and `main.py:1002-1011` (pass through to controller)
+
+- [ ] **Step 7.1: Add the two new argparse flags**
+
+In `main.py`, after the existing `--trove-trim-every` argument (around line 810), insert:
+
+```python
+    parser.add_argument(
+        "--trove-selection",
+        choices=["reward", "consistency"],
+        default="reward",
+        help="[TroVE] Candidate selection strategy. 'reward' (default) uses "
+             "the per-task reward function with AST tie-breaking. "
+             "'consistency' uses the original TroVE majority-vote algorithm. "
+             "(default: reward)",
+    )
+    parser.add_argument(
+        "--trove-task-family",
+        choices=["default", "pbebench"],
+        default="default",
+        help="[TroVE] Task family for prompt selection and parser strictness. "
+             "'pbebench' uses PBEBench-shaped few-shots and strict **Solution** "
+             "parsing (no fallback to any python block). (default: default)",
+    )
+```
+
+- [ ] **Step 7.2: Plumb the flags into the `TroVEController` constructor**
+
+Find the `elif args.framework == "trove":` block (around line 1002) and replace the `controller = TroVEController(...)` call with:
+
+```python
+    elif args.framework == "trove":
+        controller = TroVEController(
+            api_key=api_key,
+            model=model,
+            base_url=base_url,
+            debug_dir=args.debug_dir,
+            k=args.trove_k,
+            trim_every=args.trove_trim_every,
+            task_family=args.trove_task_family,
+            selection=args.trove_selection,
+        )
+        logger.info(
+            "Framework: TroVE (k=%d, trim_every=%d, task_family=%s, selection=%s)",
+            args.trove_k, args.trove_trim_every, args.trove_task_family, args.trove_selection,
+        )
+```
+
+- [ ] **Step 7.3: Sanity-check the CLI parses both flags**
+
+Run: `python main.py --help 2>&1 | grep -E "trove-selection|trove-task-family"`
+
+Expected: two lines, one for each new flag, both showing the choices and defaults.
+
+- [ ] **Step 7.4: Sanity-check controller wires through**
+
+Construct an empty tasks file so the run finishes immediately after parsing args:
+
+```bash
+echo '[]' > /tmp/_pbebench_empty.json
+VLLM_API_KEY=EMPTY python main.py \
+  --framework trove \
+  --trove-task-family pbebench \
+  --trove-selection reward \
+  --tasks-file /tmp/_pbebench_empty.json \
+  --model openai/gpt-oss-20b \
+  --backend vllm \
+  --base-url http://localhost:8000/v1 \
+  2>&1 | grep -E "Framework: TroVE|ERROR" | head -5
+```
+
+Expected: `Framework: TroVE (k=5, trim_every=500, task_family=pbebench, selection=reward)` then an `ERROR: no records found` from the loader. Both confirm the flags parsed and the controller was constructed.
+
+- [ ] **Step 7.5: Commit**
+
+```bash
+git add main.py
+git commit -m "$(cat <<'EOF'
+feat(trove): CLI flags --trove-selection and --trove-task-family
+
+- --trove-selection {reward,consistency} (default: reward).
+- --trove-task-family {default,pbebench} (default: default). Plumbed
+  through to TroVEController; PBEBench runs should pass --trove-task-family
+  pbebench to enable PBEBench-shaped few-shots and strict **Solution**
+  parsing.
+EOF
+)"
+```
+
+---
+
+## Task 8: Update vLLM launcher script with tool-calling flags
+
+**Files:**
+- Modify: `scripts/launch_vllm_gpt_oss_120b.sh`
+
+- [ ] **Step 8.1: Add the three vLLM flags**
+
+Replace the body of `scripts/launch_vllm_gpt_oss_120b.sh` with:
+
+```bash
+#!/bin/bash
+
+mkdir -p /tmp/$USER-tiktoken-cache /tmp/$USER-tmp
+chmod 700 /tmp/$USER-tiktoken-cache /tmp/$USER-tmp
+export TIKTOKEN_CACHE_DIR=/tmp/$USER-tiktoken-cache
+export TMPDIR=/tmp/$USER-tmp
+
+ts=$(date +%Y%m%d_%H%M%S)
+
+# Required vLLM tool-calling flags (vLLM >= v0.16.0 for PR #28729):
+#   --enable-auto-tool-choice  enables tool_choice="auto"
+#   --tool-call-parser openai  parses gpt-oss Harmony commentary channel
+#   --reasoning-parser openai_gptoss  routes analysis-channel content into
+#                                     message.reasoning_content
+nohup python -m vllm.entrypoints.openai.api_server \
+  --model "openai/gpt-oss-120b" \
+  --tokenizer "openai/gpt-oss-120b" \
+  --dtype auto \
+  --port ${1} \
+  --gpu-memory-utilization 0.95 \
+  --tensor-parallel-size 2 \
+  --enable-auto-tool-choice \
+  --tool-call-parser openai \
+  --reasoning-parser openai_gptoss \
+  > vllm_logs/vllm_${1}_${ts}.log 2>&1 & echo $! > vllm_logs/vllm_${1}_${ts}.pid
+```
+
+- [ ] **Step 8.2: Lint the script**
+
+Run: `bash -n scripts/launch_vllm_gpt_oss_120b.sh && echo OK`
+
+Expected: `OK`.
+
+- [ ] **Step 8.3: Commit**
+
+```bash
+git add scripts/launch_vllm_gpt_oss_120b.sh
+git commit -m "$(cat <<'EOF'
+chore(launcher): enable native tool calling for gpt-oss-120b vLLM server
+
+Add three flags required for OpenAI-compatible tool calling on gpt-oss
+served by vLLM >= v0.16.0:
+  --enable-auto-tool-choice
+  --tool-call-parser openai
+  --reasoning-parser openai_gptoss
+
+Without these the controller's chat_with_tools loop sees no tool_calls
+in the response and degrades to no-tool behavior.
+EOF
+)"
+```
+
+---
+
+## Task 9: `scripts/analyze_trove_run.py`
+
+**Files:**
+- Create: `scripts/analyze_trove_run.py`
+
+- [ ] **Step 9.1: Create the analysis script**
+
+Create `scripts/analyze_trove_run.py`:
+
+```python
+#!/usr/bin/env python3
+"""Post-hoc analysis of a TroVE run JSONL output.
+
+Reads the per-task JSONL file produced by main.py --output-file and reports:
+  - Overall accuracy
+  - Final toolbox size
+  - Per-mode wins
+  - IMPORT-mode tool-use breakdown
+  - Top-10 most-called toolbox functions
+
+Usage:
+    python scripts/analyze_trove_run.py path/to/results.jsonl
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from collections import Counter
+from pathlib import Path
+
+
+def _load_rows(path: Path) -> list[dict]:
+    rows = []
+    with path.open() as f:
+        for lineno, line in enumerate(f, 1):
+            line = line.strip()
+            if not line:
+                continue
+            try:
+                rows.append(json.loads(line))
+            except json.JSONDecodeError as exc:
+                print(f"warning: line {lineno} is not valid JSON: {exc}", file=sys.stderr)
+    return rows
+
+
+def _result_dict(row: dict) -> dict:
+    """Tolerant accessor: results are nested under 'result' in main.py's output."""
+    return row.get("result") or row
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("path", type=Path, help="Path to the TroVE results JSONL file")
+    args = parser.parse_args()
+
+    rows = _load_rows(args.path)
+    if not rows:
+        print("ERROR: no rows loaded", file=sys.stderr)
+        sys.exit(1)
+
+    n = len(rows)
+    results = [_result_dict(r) for r in rows]
+
+    # Overall accuracy
+    solved = sum(1 for r in results if r.get("solved"))
+    print(f"=== Run summary: {args.path.name} ===")
+    print(f"Tasks: {n}")
+    print(f"Solved: {solved}/{n} ({100 * solved / n:.1f}%)")
+
+    # Final toolbox size — take the snapshot from the last row.
+    last_snapshot = results[-1].get("library_snapshot") or []
+    print(f"Final toolbox size: {len(last_snapshot)}")
+
+    # Per-mode wins
+    mode_counter = Counter(r.get("won_mode", "?") for r in results)
+    print(f"Mode wins: {dict(mode_counter)}")
+
+    # IMPORT-mode tool-use breakdown
+    import_eligible = [r for r in results if r.get("import_eligible")]
+    if not import_eligible:
+        print("No IMPORT-eligible tasks observed.")
+    else:
+        with_calls = [r for r in import_eligible if (r.get("tool_call_count") or 0) >= 1]
+        n_eligible = len(import_eligible)
+        n_with = len(with_calls)
+        mean_calls = (
+            sum((r.get("tool_call_count") or 0) for r in import_eligible) / n_eligible
+        )
+        all_calls = [tc for r in import_eligible for tc in (r.get("tool_calls") or [])]
+        n_calls_total = len(all_calls)
+        n_calls_ok = sum(1 for tc in all_calls if tc.get("ok"))
+        success_rate = (100 * n_calls_ok / n_calls_total) if n_calls_total else 0.0
+        print(
+            f"IMPORT-eligible tasks: {n_eligible}\n"
+            f"  Tasks with >=1 tool call: {n_with}/{n_eligible} ({100 * n_with / n_eligible:.1f}%)\n"
+            f"  Mean tool calls / task:   {mean_calls:.2f}\n"
+            f"  Tool-call success rate:   {n_calls_ok}/{n_calls_total} ({success_rate:.1f}%)"
+        )
+
+    # Top-10 most-called functions
+    name_counter: Counter = Counter()
+    for r in results:
+        for tc in r.get("tool_calls") or []:
+            name = (tc.get("name") or "").split("<|", 1)[0].strip()
+            if name:
+                name_counter[name] += 1
+    if name_counter:
+        print("Top-10 most-called toolbox functions:")
+        for name, cnt in name_counter.most_common(10):
+            print(f"  {cnt:4d}  {name}")
+    else:
+        print("No tool calls recorded in this run.")
+
+
+if __name__ == "__main__":
+    main()
+```
+
+- [ ] **Step 9.2: Make the script executable and lint-check**
+
+Run: `chmod +x scripts/analyze_trove_run.py && python -c "import ast; ast.parse(open('scripts/analyze_trove_run.py').read())" && echo OK`
+
+Expected: `OK`.
+
+- [ ] **Step 9.3: Smoke-test on synthetic data**
+
+Run:
+
+```bash
+python -c "
+import json, tempfile, subprocess
+rows = [
+    {'result': {'solved': True,  'won_mode': 'import', 'import_eligible': True,  'tool_call_count': 2, 'tool_calls': [{'name':'find_replace_chain','ok':True},{'name':'find_replace_chain','ok':True}], 'library_snapshot':[{'name':'find_replace_chain'}]}},
+    {'result': {'solved': False, 'won_mode': 'create', 'import_eligible': False, 'tool_call_count': 0, 'tool_calls': [], 'library_snapshot':[{'name':'find_replace_chain'}]}},
+]
+with tempfile.NamedTemporaryFile('w', suffix='.jsonl', delete=False) as f:
+    for r in rows: f.write(json.dumps(r) + '\n')
+    p = f.name
+print(subprocess.check_output(['python','scripts/analyze_trove_run.py', p]).decode())
+"
+```
+
+Expected output contains `Solved: 1/2 (50.0%)`, `Final toolbox size: 1`, `Mode wins: {'import': 1, 'create': 1}`, `IMPORT-eligible tasks: 1`, `Tool-call success rate: 2/2 (100.0%)`, and a row `2  find_replace_chain` in the top-10.
+
+- [ ] **Step 9.4: Commit**
+
+```bash
+git add scripts/analyze_trove_run.py
+git commit -m "$(cat <<'EOF'
+feat(trove): add analyze_trove_run.py for post-hoc telemetry reports
+
+Reads a TroVE JSONL output and reports overall accuracy, final toolbox
+size, per-mode wins, IMPORT-mode tool-use breakdown (>=1 call rate,
+mean calls/task, success rate), and the top-10 most-called toolbox
+functions. Sanitizes Harmony control-token contamination in tool names
+when aggregating.
+EOF
+)"
+```
+
+---
+
+## Task 10: Rewrite `docs/deviations.md`
+
+**Files:**
+- Create: `symbolic_agent/baselines/trove/docs/deviations.md`
+
+- [ ] **Step 10.1: Create the directory and the deviations doc**
+
+Create `symbolic_agent/baselines/trove/docs/deviations.md`:
+
+```markdown
+# TroVE Implementation: Deviations and Faithful Elements
+
+This document tracks how this port differs from — and where it stays
+faithful to — the original TroVE algorithm
+([Wang et al., 2024](https://arxiv.org/abs/2401.12869),
+[zorazrw/trove](https://github.com/zorazrw/trove)).
+
+## 1. Algorithmic deviations
+
+### 1.1 Native OpenAI tool calling for IMPORT mode
+The original TroVE shows the model a `**Toolbox**` markdown block
+listing top-k function signatures and asks it to write a `**Solution**`
+plus `**Tools**` block referencing those functions by name. We replace
+this for the IMPORT mode (when `backend == "openai"` and the toolbox is
+non-empty) with **native OpenAI tool calling**: the toolbox is exposed
+via the `tools=[...]` parameter of `chat.completions.create`, the model
+emits structured `tool_calls` during its reasoning, and `dispatch_tool_call`
+runs each one in the sandboxed executor and returns the stdout. This
+makes function usage observable and credit-able from the trajectory
+itself.
+
+### 1.2 Reward-based candidate selection (default)
+The paper uses self-consistency (majority vote on stdout, AST tie-break)
+to pick the best of K samples per mode. We default to **reward-based
+selection**: every candidate is scored by the per-task reward function,
+ties broken by minimum AST node count. This is more reliable on
+PBEBench (program-list outputs rarely tie as strings). The original
+self-consistency selector remains available via `--trove-selection consistency`.
+
+### 1.3 PBEBench-shaped few-shot examples
+For `task_family="pbebench"` we replace the generic CREATE / SKIP / IMPORT
+example pairs with PBEBench-shaped pairs that demonstrate `replace()`
+chains and a small reusable helper (`find_replace_chain`). The legacy
+default examples remain for `task_family="default"`.
+
+### 1.4 Strict **Solution** parsing for PBEBench
+The legacy parser falls back to "first ```python``` block anywhere" when
+no `**Solution**` block is present. For `task_family="pbebench"` this
+fallback is disabled, preventing CoT scratchpad from being accidentally
+promoted to the answer.
+
+## 2. Faithful elements
+
+- 3-mode generation (IMPORT, CREATE, SKIP).
+- K samples per mode (default K=5, paper).
+- AST-tie-breaking by node count (simplest solution wins).
+- Periodic toolbox trimming with threshold `C·log_{20}(n)`, default
+  `C=1.0`, matching the original implementation.
+- Frequency-based top-k retrieval for the toolbox view.
+- Dict-keyed toolbox structure mirroring `utils/code.py`.
+- Library updates: IMPORT credits frequency, CREATE adds new functions
+  on success, SKIP makes no library changes.
+
+## 3. Infrastructural patches
+
+- **JSONL-per-task checkpointing** via `--output-file`, with crash
+  resumption.
+- **`reasoning_content` fallback** in `_call_openai` for `gpt-oss` Harmony
+  channel splits where the answer text lives in `message.reasoning_content`.
+- **Executor timeout 60s** (vs. 10s in earlier versions of this port),
+  closer to the original's ~100s.
+- **`<|`-truncation sanitizer** in `dispatch_tool_call` and
+  `_update_library`. Defensive workaround for the open vLLM
+  [PR #35906](https://github.com/vllm-project/vllm/pull/35906) covering
+  Harmony control-token leakage into tool names. When that PR lands
+  upstream the sanitizer becomes a no-op and is left in place.
+
+## 4. Backend coverage caveat
+
+Anthropic backend code paths exist and are exercised by CREATE / SKIP and
+the legacy text-based IMPORT fallback, but **the smoke run and reported
+numbers are vLLM-served `gpt-oss` only**. IMPORT-with-tools requires
+the OpenAI/vLLM backend and is the only path we test end-to-end.
+
+## 5. vLLM version requirement
+
+- Minimum vLLM: **v0.16.0** (branch-cut 2026-02-08).
+- Required upstream change: [PR #28729](https://github.com/vllm-project/vllm/pull/28729)
+  ("Multiple fixes for gpt-oss Chat Completion prompting"), merged
+  2025-12-12. v0.16.0 is the first stable release branch-cut after the merge.
+- Known open caveat: [PR #35906](https://github.com/vllm-project/vllm/pull/35906)
+  ("Sanitize leaked Harmony control tokens"), still open as of late
+  March 2026 — see §3 for the sanitizer mitigation.
+```
+
+- [ ] **Step 10.2: Verify the file renders**
+
+Run: `head -20 symbolic_agent/baselines/trove/docs/deviations.md`
+
+Expected: the document renders with the title on the first line.
+
+- [ ] **Step 10.3: Commit**
+
+```bash
+git add symbolic_agent/baselines/trove/docs/deviations.md
+git commit -m "$(cat <<'EOF'
+docs(trove): rewrite deviations.md for native tool calling
+
+Document algorithmic deviations (native OpenAI tool calling for IMPORT,
+reward-based selection by default, PBEBench-shaped few-shots, strict
+**Solution** parsing for pbebench), faithful elements (3-mode generation,
+K-sampling, AST tie-break, C*log_20(n) trimming with C=1.0), and
+infrastructural patches (JSONL checkpointing, reasoning_content
+fallback, 60s executor timeout, defensive <|-truncation sanitizer).
+
+Includes vLLM version requirement (>= v0.16.0 for PR #28729) and the
+backend coverage caveat (smoke run is vLLM-served gpt-oss only).
+EOF
+)"
+```
+
+---
+
+## Task 11: Pre-flight sanity check + 50-task smoke run + report
+
+**Files:** none modified. This is the validation task.
+
+- [ ] **Step 11.1: Re-launch vLLM with the new flags**
+
+The existing launcher is named `launch_vllm_gpt_oss_120b.sh` but the spec calls for `gpt-oss-20b`. Two options — pick one:
+
+(a) **Smoke on 120b directly** (no script change beyond Task 8). Run:
+
+```bash
+bash scripts/launch_vllm_gpt_oss_120b.sh 8000
+```
+
+Then in Tasks 11.2 and 11.4, replace `--model openai/gpt-oss-20b` with `--model openai/gpt-oss-120b`.
+
+(b) **Smoke on 20b** (one-line edit). In `scripts/launch_vllm_gpt_oss_120b.sh`, change `openai/gpt-oss-120b` → `openai/gpt-oss-20b` for both `--model` and `--tokenizer`, and lower `--tensor-parallel-size 2` → `--tensor-parallel-size 1` (20b fits on one GPU). Then:
+
+```bash
+bash scripts/launch_vllm_gpt_oss_120b.sh 8000
+```
+
+(Do not commit the edit — restore the file before the final commit, or rename the script if you want the 20b variant kept.)
+
+Then wait 60–120 seconds and confirm the server is up:
+
+Run: `curl -sS http://localhost:8000/v1/models | head -5`
+
+Expected: a JSON response listing the model you launched.
+
+- [ ] **Step 11.2: Pre-flight: one-task smoke**
+
+Run a single task to verify the tool-calling round-trip works end-to-end. The codebase has no `--num-tasks` flag, so we slice the first row out of the 50-task PBEBench-Lite file:
+
+```bash
+mkdir -p outputs/trove_pbebench_preflight
+head -n 1 data/pbebench/lite_pilot_tasks.jsonl > /tmp/_pbebench_one.jsonl
+VLLM_API_KEY=EMPTY python main.py \
+  --framework trove \
+  --tasks-file /tmp/_pbebench_one.jsonl \
+  --output-file outputs/trove_pbebench_preflight/results.jsonl \
+  --model openai/gpt-oss-20b \
+  --backend vllm \
+  --base-url http://localhost:8000/v1 \
+  --trove-task-family pbebench \
+  --trove-selection reward \
+  --trove-k 3 \
+  --trove-trim-every 9999 \
+  --max-tokens 4096 \
+  --debug-dir outputs/trove_pbebench_preflight/debug
+```
+
+Expected: the run completes without crashing. The output file should contain one row.
+
+- [ ] **Step 11.3: Verify the tool-calling pre-flight check**
+
+This task starts with an empty toolbox so the IMPORT-with-tools branch will not run. Inspect the most recent debug-dir log file with `trove_create` or `trove_skip` in the name and confirm it contains a non-empty response:
+
+Run: `ls -t outputs/trove_pbebench_preflight/debug/trove_run_*/0001_*.json | head -1 | xargs python -c "import json,sys; d=json.load(open(sys.argv[1])); print('content length:', len(d['response']['content']))"`
+
+Expected: non-zero content length. If zero, the `reasoning_content` fallback (Task 1.3) is not engaging — debug before proceeding.
+
+- [ ] **Step 11.4: Run the 50-task smoke**
+
+`data/pbebench/lite_pilot_tasks.jsonl` is exactly 50 PBEBench-Lite tasks with per-task `reward: pbebench`, so no slicing or `--default-reward` flag is required.
+
+```bash
+mkdir -p outputs/trove_pbebench_smoke
+VLLM_API_KEY=EMPTY python main.py \
+  --framework trove \
+  --tasks-file data/pbebench/lite_pilot_tasks.jsonl \
+  --output-file outputs/trove_pbebench_smoke/results.jsonl \
+  --model openai/gpt-oss-20b \
+  --backend vllm \
+  --base-url http://localhost:8000/v1 \
+  --trove-task-family pbebench \
+  --trove-selection reward \
+  --trove-k 3 \
+  --trove-trim-every 9999 \
+  --max-tokens 4096 \
+  --debug-dir outputs/trove_pbebench_smoke/debug
+```
+
+Expected: ~30–60 minutes wall-clock on local vLLM. Run completes without crashes. Auto-resume from checkpoint is supported by `--output-file` if the run is interrupted.
+
+- [ ] **Step 11.5: Run the analysis script and capture the report**
+
+Run: `python scripts/analyze_trove_run.py outputs/trove_pbebench_smoke/results.jsonl | tee outputs/trove_pbebench_smoke/report.txt`
+
+Expected: the report shows accuracy, toolbox size, mode wins, IMPORT-mode tool-use breakdown, and top-10 functions.
+
+- [ ] **Step 11.6: Report numbers to the user (no prompt iteration)**
+
+Per the spec's "done criteria", report the contents of `outputs/trove_pbebench_smoke/report.txt` plus a short narrative paragraph noting any anomalies (e.g. `<|channel|>` contamination from PR #35906, `max_iters` stops, JSON-arg parse failures).
+
+**No prompt iteration. No threshold tuning. The numbers are what they are.**
+
+---
+
+## Self-Review
+
+### 1. Spec coverage
+
+| Spec section | Implementing task |
+|---|---|
+| §3 Architecture overview | Tasks 1–8 collectively |
+| §4 Data flow for IMPORT-with-tools | Tasks 4–6 |
+| §5.1 New `tools_api.py` | Task 4 |
+| §5.2 `_call_openai` reasoning fallback | Task 1 |
+| §5.2 `chat_with_tools` method | Task 5 |
+| §5.3 Controller `__init__` params, IMPORT branch, `_update_library`, `_make_result` | Task 6 |
+| §5.4 `imported_callsites`, `task_family` in parse_response | Task 2 |
+| §5.5 PBEBench prompts and IMPORT-with-tools prompt | Task 3 |
+| §5.6 Trim `C=1.0` | Task 1 |
+| §5.7 Executor timeout 60s | Task 1 |
+| §5.8 main.py CLI flags | Task 7 |
+| §5.9 vLLM launcher flags | Task 8 |
+| §5.10 `analyze_trove_run.py` | Task 9 |
+| §5.11 deviations.md rewrite | Task 10 |
+| §6 Telemetry fields | Task 6.7 |
+| §7 Implementation defaults | Tasks 4–6 |
+| §8 Smoke run + done criteria | Task 11 |
+
+All sections accounted for.
+
+### 2. Placeholder scan
+
+No `TBD`, `TODO`, `implement later`, "appropriate", "various", or "fill in details" in any task. All test code is fully written (not "write tests for the above"). All file paths are exact. All commit messages are pre-written.
+
+### 3. Type and signature consistency
+
+- `imported_callsites(solution_code, tools_code, candidate_names)` — defined in Task 2, called in Task 6.7 with matching kwargs.
+- `toolbox_to_openai_tools(toolbox, topk=10)` — defined in Task 4, called in Task 6.4.
+- `dispatch_tool_call(toolbox, tool_call) -> str` — defined in Task 4, called via the `on_tc` closure in Task 6.4.
+- `chat_with_tools(messages, tools, model, max_tokens, max_tool_iters, on_tool_call, tag)` — defined in Task 5, called in Task 6.4 with matching kwargs.
+- `build_import_with_tools_prompt(question, task_family)` — defined in Task 3, called in Task 6.4.
+- `build_import_prompt(question, toolbox_str, task_family)` — extended in Task 3, called in Task 6.3.
+- `parse_response(text, task_family)` — extended in Task 2, called in Tasks 6.3 and 6.4.
+- `TroVEController(__init__)` new params (`task_family`, `selection`, `max_tool_iters`, `tool_schema_topk`) — defined in Task 6.1, passed in Task 7.2 (only `task_family` and `selection` from CLI; the other two use defaults, which matches the spec's defaults table).
+
+All consistent.
+
+### 4. Plan quirks worth noting to the executor
+
+- Task 11.4 relies on the user's `task_index_25_direct_feedback.json` having at least 50 tasks. If it has fewer, swap to whichever PBEBench-Lite tasks file is available (the spec calls for "50 PBEBench-Lite tasks"; the exact filename is not load-bearing).
+- Task 11.5 `tee` output captures the report for the user-facing message in 11.6.
+- The `import_eligible` field in `_make_result` is computed *after* `_update_library` runs for the current task. The doc-comment in Task 6.7 explains the consequence; the analyzer in Task 9 doesn't depend on the pre-task value.
+- Task 6.5's `_select_best` change wraps the existing reward/consistency selectors. When `selection="consistency"` is set, the `reward_fn` and `entry` arguments are ignored — that is intentional and matches the user's choice to keep both flags as opt-ins.
+
+---
+
+## Execution Handoff
+
+Plan complete and saved to `docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md`. Two execution options:
+
+**1. Subagent-Driven (recommended)** — I dispatch a fresh subagent per task, review between tasks, fast iteration.
+
+**2. Inline Execution** — Execute tasks in this session using executing-plans, batch execution with checkpoints.
+
+Which approach?

From a80fc2820a03364844c535a4aa06106046a915e2 Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 17:54:56 -0400
Subject: [PATCH 03/24] fix(trove): infra patches for native tool calling

- toolbox.trim default C=1.0 (matches original TroVE)
- executor DEFAULT_TIMEOUT=60s (PBEBench + multi-turn headroom)
- llm._call_openai falls back to message.reasoning_content when
  message.content is empty (gpt-oss Harmony channel split)

Made-with: Cursor
---
 symbolic_agent/baselines/trove/executor.py | 2 +-
 symbolic_agent/baselines/trove/llm.py      | 3 ++-
 symbolic_agent/baselines/trove/toolbox.py  | 4 ++--
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/symbolic_agent/baselines/trove/executor.py b/symbolic_agent/baselines/trove/executor.py
index cf23471b..1b8717e4 100644
--- a/symbolic_agent/baselines/trove/executor.py
+++ b/symbolic_agent/baselines/trove/executor.py
@@ -16,7 +16,7 @@
 
 logger = logging.getLogger(__name__)
 
-DEFAULT_TIMEOUT = 10  # seconds, matching TroVE's original
+DEFAULT_TIMEOUT = 60  # seconds — generous for PBEBench replace() chains and multi-turn dispatch
 
 
 def run_solution(
diff --git a/symbolic_agent/baselines/trove/llm.py b/symbolic_agent/baselines/trove/llm.py
index d27f8d28..ec98472f 100644
--- a/symbolic_agent/baselines/trove/llm.py
+++ b/symbolic_agent/baselines/trove/llm.py
@@ -189,7 +189,8 @@ def _call_openai(self, prompt: str, model: str, max_tokens: int, tag: str) -> st
                     messages=messages,
                     # No response_format — TroVE uses free-form text
                 )
-                raw = response.choices[0].message.content or ""
+                msg = response.choices[0].message
+                raw = msg.content or getattr(msg, "reasoning_content", "") or ""
                 u = getattr(response, "usage", None)
                 details = getattr(u, "completion_tokens_details", None)
                 usage = {
diff --git a/symbolic_agent/baselines/trove/toolbox.py b/symbolic_agent/baselines/trove/toolbox.py
index 9cae9532..617b66ae 100644
--- a/symbolic_agent/baselines/trove/toolbox.py
+++ b/symbolic_agent/baselines/trove/toolbox.py
@@ -114,7 +114,7 @@ def get_full_code(self) -> str:
     # Trimming
     # ------------------------------------------------------------------
 
-    def trim(self, n_processed: int, C: float = 0.5) -> set:
+    def trim(self, n_processed: int, C: float = 1.0) -> set:
         """
         Remove functions whose frequency is below the threshold
             C * log_{20}(n_processed)
@@ -122,7 +122,7 @@ def trim(self, n_processed: int, C: float = 0.5) -> set:
 
         Faithful to trim_library() in run_trove.py:
             threshold = math.log(n, 20)   # log base 20
-        C defaults to 0.5, matching the paper (§3.3): λ = ½ · log_{10}(n).
+        C defaults to 1.0, matching the original implementation (C·log_{20}(n)).
         Note: the original uses log base-20 not base-10; we keep base-20.
         """
         if n_processed <= 1:

From 91cd92f77b392b03378e05eb53789590a7d95bc6 Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 17:59:36 -0400
Subject: [PATCH 04/24] feat(trove): add imported_callsites helper and
 task_family to parse_response

- imported_callsites(solution, tools, names) -> set: AST-walks Solution
  code and returns names from the candidate set that are actually called.
  Handles bare Name and Attribute (toolbox.foo) callees.
- parse_response(text, task_family="default"): when task_family="pbebench"
  the parser does not fall back to the first python block when **Solution**
  is missing. Prevents CoT scratchpad from being promoted to the answer.

Made-with: Cursor
---
 symbolic_agent/baselines/trove/parse.py       | 56 ++++++++++++----
 .../baselines/trove/tests/__init__.py         |  0
 .../trove/tests/test_parse_callsites.py       | 65 +++++++++++++++++++
 3 files changed, 110 insertions(+), 11 deletions(-)
 create mode 100644 symbolic_agent/baselines/trove/tests/__init__.py
 create mode 100644 symbolic_agent/baselines/trove/tests/test_parse_callsites.py

diff --git a/symbolic_agent/baselines/trove/parse.py b/symbolic_agent/baselines/trove/parse.py
index 56a90cba..4a53a733 100644
--- a/symbolic_agent/baselines/trove/parse.py
+++ b/symbolic_agent/baselines/trove/parse.py
@@ -83,7 +83,7 @@ def _make_executable(code: str) -> str:
     return stripped
 
 
-def parse_response(text: str) -> dict:
+def parse_response(text: str, task_family: str = "default") -> dict:
     """
     Parse a TroVE-format LLM response.
 
@@ -95,20 +95,17 @@ def parse_response(text: str) -> dict:
         "functions":     list[dict],  # parsed tool dicts from the Tools block
     }
 
-    Fallback behaviour
-    ------------------
-    Tasks like PBEBench embed their own format instructions (e.g. "output a
-    **Program Sequence** block") that can override the TroVE **Solution**
-    header.  When no **Solution** block is found we grab the first ```python```
-    block in the response and, if it is a bare list/string literal, wrap it
-    in print() so it can be executed and its stdout captured as the answer.
+    task_family
+    -----------
+    "default": if no **Solution** block is found, falls back to the first
+    ```python``` block anywhere (legacy behaviour).
+    "pbebench": no fallback. Strict **Solution**-block-only parsing avoids
+    accidentally promoting CoT scratchpad to the answer.
     """
     solution_code = _extract_code_block(text, "Solution") or ""
     tools_code = _extract_code_block(text, "Tools") or ""
 
-    # Fallback: model followed the task's own format (e.g. **Program Sequence**)
-    # instead of the TroVE **Solution** header.
-    if not solution_code:
+    if not solution_code and task_family != "pbebench":
         raw = _extract_any_python_block(text)
         if raw:
             solution_code = _make_executable(raw)
@@ -267,3 +264,40 @@ def count_ast_nodes(code: str) -> int:
         return sum(1 for _ in ast.walk(tree))
     except SyntaxError:
         return 99_999
+
+
+def imported_callsites(
+    solution_code: str,
+    tools_code: str,
+    candidate_names: set,
+) -> set:
+    """
+    Return the subset of `candidate_names` that appear as call-sites in
+    `solution_code`. Used for the `actually_called` telemetry field.
+
+    Detects two callee shapes:
+      - bare Name:        find_replace_chain(...)
+      - Attribute(name):  toolbox.find_replace_chain(...)
+
+    `tools_code` is currently unused (kept in the signature so callers can
+    pass through the **Tools** block context if we later want to filter by
+    what was actually imported).
+
+    Returns an empty set on empty input or SyntaxError.
+    """
+    if not solution_code or not candidate_names:
+        return set()
+    try:
+        tree = ast.parse(solution_code)
+    except SyntaxError:
+        return set()
+    found: set = set()
+    for node in ast.walk(tree):
+        if not isinstance(node, ast.Call):
+            continue
+        func = node.func
+        if isinstance(func, ast.Name) and func.id in candidate_names:
+            found.add(func.id)
+        elif isinstance(func, ast.Attribute) and func.attr in candidate_names:
+            found.add(func.attr)
+    return found
diff --git a/symbolic_agent/baselines/trove/tests/__init__.py b/symbolic_agent/baselines/trove/tests/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/symbolic_agent/baselines/trove/tests/test_parse_callsites.py b/symbolic_agent/baselines/trove/tests/test_parse_callsites.py
new file mode 100644
index 00000000..3429061b
--- /dev/null
+++ b/symbolic_agent/baselines/trove/tests/test_parse_callsites.py
@@ -0,0 +1,65 @@
+"""Unit tests for parse.imported_callsites and parse_response(task_family=)."""
+
+from symbolic_agent.baselines.trove.parse import imported_callsites, parse_response
+
+
+# ---------------------------------------------------------------------------
+# imported_callsites
+# ---------------------------------------------------------------------------
+
+def test_callsites_bare_name():
+    code = "result = find_replace_chain(s, [('a', 'b')])\nprint(result)"
+    assert imported_callsites(code, tools_code="", candidate_names={"find_replace_chain", "other"}) == {"find_replace_chain"}
+
+
+def test_callsites_attribute_access():
+    code = "result = toolbox.find_replace_chain(s, pairs)\nprint(result)"
+    assert imported_callsites(code, tools_code="", candidate_names={"find_replace_chain"}) == {"find_replace_chain"}
+
+
+def test_callsites_no_match():
+    code = "print(s.replace('a', 'b'))"
+    assert imported_callsites(code, tools_code="", candidate_names={"find_replace_chain"}) == set()
+
+
+def test_callsites_multiple_calls_same_name_dedup():
+    code = "x = f(1)\ny = f(2)\nprint(x, y)"
+    assert imported_callsites(code, tools_code="", candidate_names={"f", "g"}) == {"f"}
+
+
+def test_callsites_syntax_error_returns_empty():
+    code = "this is not valid python ::"
+    assert imported_callsites(code, tools_code="", candidate_names={"f"}) == set()
+
+
+def test_callsites_empty_inputs():
+    assert imported_callsites("", "", set()) == set()
+    assert imported_callsites("print(1)", "", set()) == set()
+
+
+# ---------------------------------------------------------------------------
+# parse_response(task_family=)
+# ---------------------------------------------------------------------------
+
+def test_parse_response_pbebench_strict_no_solution_block():
+    text = "Here is some reasoning.\n```python\nprint('answer')\n```\n"
+    out = parse_response(text, task_family="pbebench")
+    assert out["solution_code"] == ""
+
+
+def test_parse_response_pbebench_with_solution_block():
+    text = "**Solution**\n```python\nprint('answer')\n```\n"
+    out = parse_response(text, task_family="pbebench")
+    assert out["solution_code"] == "print('answer')"
+
+
+def test_parse_response_default_falls_back_to_any_python_block():
+    text = "Here is some reasoning.\n```python\nprint('answer')\n```\n"
+    out = parse_response(text, task_family="default")
+    assert "print('answer')" in out["solution_code"]
+
+
+def test_parse_response_default_call_signature_unchanged():
+    text = "**Solution**\n```python\nprint('answer')\n```\n"
+    out = parse_response(text)
+    assert out["solution_code"] == "print('answer')"

From 7ffddbe3c9276fa192df4c4e9ccb7a8bb2e157bf Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 18:03:42 -0400
Subject: [PATCH 05/24] feat(trove): PBEBench-shaped few-shots and
 IMPORT-with-tools prompt

- Add task_family parameter to all build_* prompt builders.
- Add _CREATE_EXAMPLE_PBEBENCH and _SKIP_EXAMPLE_PBEBENCH demonstrating
  replace()-chain solutions and a find_replace_chain helper.
- Add build_import_with_tools_prompt for native tool calling: no
  **Toolbox** markdown block (toolbox is conveyed via tools=[...]).
- _FORMAT_OVERRIDE is empty for task_family="pbebench" (the example
  models the desired format directly).

Made-with: Cursor
---
 symbolic_agent/baselines/trove/prompts.py | 213 +++++++++++++++++++---
 1 file changed, 187 insertions(+), 26 deletions(-)

diff --git a/symbolic_agent/baselines/trove/prompts.py b/symbolic_agent/baselines/trove/prompts.py
index edab732c..78be7add 100644
--- a/symbolic_agent/baselines/trove/prompts.py
+++ b/symbolic_agent/baselines/trove/prompts.py
@@ -15,27 +15,36 @@
 applicable to both PBEBench and ReasoningGym string tasks.
 """
 
-# Appended to every instruction block to override format instructions that
-# may be embedded in the question itself (e.g. PBEBench asks for a
-# "**Program Sequence**" block, reasoning_gym asks for a specific format).
-_FORMAT_OVERRIDE = (
+# ---------------------------------------------------------------------------
+# Format override (default-family only)
+# ---------------------------------------------------------------------------
+
+_FORMAT_OVERRIDE_DEFAULT = (
     "\nIMPORTANT: Regardless of any formatting instructions inside the question, "
     "always produce your answer as executable Python in the **Solution** block "
     "and end it with print(answer). "
     "Your answer is whatever gets printed to stdout when the Solution code runs."
 )
 
+# PBEBench prompts model the desired format directly via the few-shot example,
+# so no override string is needed.
+_FORMAT_OVERRIDE_PBEBENCH = ""
+
+
+def _format_override(task_family: str) -> str:
+    return _FORMAT_OVERRIDE_PBEBENCH if task_family == "pbebench" else _FORMAT_OVERRIDE_DEFAULT
+
+
 # ---------------------------------------------------------------------------
-# IMPORT mode  (use functions from the toolbox)
+# IMPORT mode (text-based, default and Anthropic fallback)
 # ---------------------------------------------------------------------------
 
-_IMPORT_INSTRUCTION = (
+_IMPORT_INSTRUCTION_DEFAULT = (
     "You task is to write Python program solutions to the given questions.\n"
     "The toolbox section lists all the available functions that can be used in your solution."
-    + _FORMAT_OVERRIDE
 )
 
-_IMPORT_EXAMPLE = """\
+_IMPORT_EXAMPLE_DEFAULT = """\
 ## Example
 **Question**
 Given a list of strings and a list of (old, new) substitution pairs, apply all
@@ -61,6 +70,31 @@
 from toolbox import apply_substitutions
 ```"""
 
+_IMPORT_EXAMPLE_PBEBENCH = """\
+## Example
+**Question**
+You are given example input/output pairs. Produce a list of replace() calls
+that transforms each input into its expected output.
+
+Input:  "hello world"
+Output: "HELLO_WORLD"
+
+**Toolbox**
+```python
+# Apply a chain of (old, new) replacements to a string.
+find_replace_chain(s: str, pairs: list) -> str
+```
+
+**Solution**
+```python
+result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")])
+print(result)
+```
+**Tools**
+```python
+from toolbox import find_replace_chain
+```"""
+
 _IMPORT_TASK_TEMPLATE = """\
 ## Task
 **Question**
@@ -73,29 +107,110 @@
 """
 
 
-def build_import_prompt(question: str, toolbox_str: str) -> str:
-    """Build the IMPORT-mode prompt for a single task."""
+def build_import_prompt(question: str, toolbox_str: str, task_family: str = "default") -> str:
+    """Build the text-based IMPORT-mode prompt (used for Anthropic and as fallback)."""
+    instruction = _IMPORT_INSTRUCTION_DEFAULT + _format_override(task_family)
+    example = _IMPORT_EXAMPLE_PBEBENCH if task_family == "pbebench" else _IMPORT_EXAMPLE_DEFAULT
     return (
-        _IMPORT_INSTRUCTION
+        instruction
         + "\n\n\n"
-        + _IMPORT_EXAMPLE
+        + example
         + "\n\n\n"
         + _IMPORT_TASK_TEMPLATE.format(question=question, toolbox=toolbox_str)
     )
 
 
 # ---------------------------------------------------------------------------
-# CREATE mode  (create new reusable functions)
+# IMPORT-with-tools mode (native OpenAI tool calling; no **Toolbox** block)
 # ---------------------------------------------------------------------------
 
-_CREATE_INSTRUCTION = (
+_IMPORT_WITH_TOOLS_INSTRUCTION_DEFAULT = (
+    "You task is to write Python program solutions to the given questions.\n"
+    "You have a set of helper functions available as tools. Call any of them "
+    "when they help you solve the question; otherwise solve directly. After "
+    "you have computed the answer, output it as executable Python in a "
+    "**Solution** block and end with print(answer)."
+)
+
+_IMPORT_WITH_TOOLS_INSTRUCTION_PBEBENCH = (
+    "You task is to produce a list of replace() calls that transforms each "
+    "input into its expected output for a Programming-by-Example task.\n"
+    "You have a set of helper functions available as tools. Call any of them "
+    "to test ideas or compute intermediate results; the final answer must be "
+    "produced as a Python program in the **Solution** block."
+)
+
+_IMPORT_WITH_TOOLS_EXAMPLE_DEFAULT = """\
+## Example
+**Question**
+Apply substitutions [("a","o"),("t","p")] to ["cat","bat"] and return the list.
+
+(After optionally calling `apply_substitutions` as a tool to confirm,
+the assistant produces:)
+
+**Solution**
+```python
+strings = ["cat", "bat"]
+subs = [("a", "o"), ("t", "p")]
+result = apply_substitutions(strings, subs)
+print(result)
+```"""
+
+_IMPORT_WITH_TOOLS_EXAMPLE_PBEBENCH = """\
+## Example
+**Question**
+Produce a sequence of replace() calls that transforms "hello world" into
+"HELLO_WORLD".
+
+(After optionally calling `find_replace_chain` as a tool to verify a
+candidate sequence, the assistant produces:)
+
+**Solution**
+```python
+result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")])
+print(result)
+```"""
+
+_IMPORT_WITH_TOOLS_TASK_TEMPLATE = """\
+## Task
+**Question**
+{question}
+
+**Solution**
+"""
+
+
+def build_import_with_tools_prompt(question: str, task_family: str = "default") -> str:
+    """
+    Build the IMPORT-with-tools prompt. The toolbox is NOT shown as text — it
+    is conveyed via the OpenAI tools=[...] parameter on the chat completion call.
+    """
+    if task_family == "pbebench":
+        instruction = _IMPORT_WITH_TOOLS_INSTRUCTION_PBEBENCH
+        example = _IMPORT_WITH_TOOLS_EXAMPLE_PBEBENCH
+    else:
+        instruction = _IMPORT_WITH_TOOLS_INSTRUCTION_DEFAULT
+        example = _IMPORT_WITH_TOOLS_EXAMPLE_DEFAULT
+    return (
+        instruction
+        + "\n\n\n"
+        + example
+        + "\n\n\n"
+        + _IMPORT_WITH_TOOLS_TASK_TEMPLATE.format(question=question)
+    )
+
+
+# ---------------------------------------------------------------------------
+# CREATE mode
+# ---------------------------------------------------------------------------
+
+_CREATE_INSTRUCTION_DEFAULT = (
     "You task is to write Python program solutions to the given questions.\n"
     "You should also create Python functions that can be used by your solution, "
     "if you believe the function can be reused to solve other questions."
-    + _FORMAT_OVERRIDE
 )
 
-_CREATE_EXAMPLE = """\
+_CREATE_EXAMPLE_DEFAULT = """\
 ## Example
 **Question**
 Given a list of strings and a list of (old, new) substitution pairs, apply all
@@ -122,6 +237,26 @@ def apply_substitutions(strings, substitutions):
     return out
 ```"""
 
+_CREATE_EXAMPLE_PBEBENCH = """\
+## Example
+**Question**
+Produce a sequence of replace() calls that transforms "hello world" into
+"HELLO_WORLD".
+
+**Solution**
+```python
+result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")])
+print(result)
+```
+**Tools**
+```python
+def find_replace_chain(s, pairs):
+    \"\"\"Apply a chain of (old, new) replacements to a string.\"\"\"
+    for old, new in pairs:
+        s = s.replace(old, new)
+    return s
+```"""
+
 _CREATE_TASK_TEMPLATE = """\
 ## Task
 **Question**
@@ -131,27 +266,28 @@ def apply_substitutions(strings, substitutions):
 """
 
 
-def build_create_prompt(question: str) -> str:
+def build_create_prompt(question: str, task_family: str = "default") -> str:
     """Build the CREATE-mode prompt for a single task."""
+    instruction = _CREATE_INSTRUCTION_DEFAULT + _format_override(task_family)
+    example = _CREATE_EXAMPLE_PBEBENCH if task_family == "pbebench" else _CREATE_EXAMPLE_DEFAULT
     return (
-        _CREATE_INSTRUCTION
+        instruction
         + "\n\n\n"
-        + _CREATE_EXAMPLE
+        + example
         + "\n\n\n"
         + _CREATE_TASK_TEMPLATE.format(question=question)
     )
 
 
 # ---------------------------------------------------------------------------
-# SKIP mode  (inline solution, no new functions)
+# SKIP mode
 # ---------------------------------------------------------------------------
 
-_SKIP_INSTRUCTION = (
+_SKIP_INSTRUCTION_DEFAULT = (
     "You task is to write Python program solutions to the given questions."
-    + _FORMAT_OVERRIDE
 )
 
-_SKIP_EXAMPLE = """\
+_SKIP_EXAMPLE_DEFAULT = """\
 ## Example
 **Question**
 Given the list of strings ["Hello", "World"], convert each to lowercase and
@@ -167,6 +303,29 @@ def build_create_prompt(question: str) -> str:
 ```python
 ```"""
 
+_SKIP_EXAMPLE_PBEBENCH = """\
+## Example
+**Question**
+Produce a sequence of replace() calls that transforms "hello world" into
+"HELLO_WORLD".
+
+**Solution**
+```python
+s = "hello world"
+s = s.replace(" ", "_")
+s = s.replace("h", "H")
+s = s.replace("e", "E")
+s = s.replace("l", "L")
+s = s.replace("o", "O")
+s = s.replace("w", "W")
+s = s.replace("r", "R")
+s = s.replace("d", "D")
+print(s)
+```
+**Tools**
+```python
+```"""
+
 _SKIP_TASK_TEMPLATE = """\
 ## Task
 **Question**
@@ -176,12 +335,14 @@ def build_create_prompt(question: str) -> str:
 """
 
 
-def build_skip_prompt(question: str) -> str:
+def build_skip_prompt(question: str, task_family: str = "default") -> str:
     """Build the SKIP-mode prompt for a single task."""
+    instruction = _SKIP_INSTRUCTION_DEFAULT + _format_override(task_family)
+    example = _SKIP_EXAMPLE_PBEBENCH if task_family == "pbebench" else _SKIP_EXAMPLE_DEFAULT
     return (
-        _SKIP_INSTRUCTION
+        instruction
         + "\n\n\n"
-        + _SKIP_EXAMPLE
+        + example
         + "\n\n\n"
         + _SKIP_TASK_TEMPLATE.format(question=question)
     )

From 5cd4fd33529317d434671a131f2d1c24c2aa04e3 Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 18:11:27 -0400
Subject: [PATCH 06/24] feat(trove): add tools_api for native OpenAI tool
 calling

- toolbox_to_openai_tools(toolbox, topk=10): converts top-k toolbox
  functions into OpenAI Chat Completions tool schemas. Infers parameter
  types from inspect.signature; functions with *args/**kwargs are
  silently excluded.
- dispatch_tool_call(toolbox, tool_call): runs the requested function
  in the sandbox executor, returns stdout truncated to 4096 chars or
  a JSON error string. Sanitizes Harmony control-token contamination
  in tool names (defensive vs. open vLLM PR #35906).

Made-with: Cursor
---
 .../baselines/trove/tests/test_tools_api.py   | 163 +++++++++++++++++
 symbolic_agent/baselines/trove/tools_api.py   | 170 ++++++++++++++++++
 2 files changed, 333 insertions(+)
 create mode 100644 symbolic_agent/baselines/trove/tests/test_tools_api.py
 create mode 100644 symbolic_agent/baselines/trove/tools_api.py

diff --git a/symbolic_agent/baselines/trove/tests/test_tools_api.py b/symbolic_agent/baselines/trove/tests/test_tools_api.py
new file mode 100644
index 00000000..8fc9d671
--- /dev/null
+++ b/symbolic_agent/baselines/trove/tests/test_tools_api.py
@@ -0,0 +1,163 @@
+"""Unit tests for tools_api.toolbox_to_openai_tools and dispatch_tool_call."""
+
+import json
+from types import SimpleNamespace
+
+from symbolic_agent.baselines.trove.toolbox import TroVEToolbox
+from symbolic_agent.baselines.trove.tools_api import (
+    dispatch_tool_call,
+    toolbox_to_openai_tools,
+)
+
+
+def _make_toolbox_with(func_src: str, name: str, docstr: str = "") -> TroVEToolbox:
+    tb = TroVEToolbox()
+    tb.add(
+        {
+            "name": name,
+            "docstr": docstr,
+            "signature": f"def {name}(...)",
+            "function": func_src,
+            "type": "function",
+        },
+        example_idx=0,
+    )
+    return tb
+
+
+def _tool_call(name: str, args: dict, call_id: str = "call_1"):
+    return SimpleNamespace(
+        id=call_id,
+        function=SimpleNamespace(name=name, arguments=json.dumps(args)),
+    )
+
+
+# ---------------------------------------------------------------------------
+# toolbox_to_openai_tools
+# ---------------------------------------------------------------------------
+
+def test_schema_basic_function():
+    src = (
+        "def find_replace_chain(s: str, pairs: list) -> str:\n"
+        '    """Apply a chain of (old, new) replacements to a string."""\n'
+        "    for old, new in pairs:\n"
+        "        s = s.replace(old, new)\n"
+        "    return s\n"
+    )
+    tb = _make_toolbox_with(src, "find_replace_chain", docstr="Apply a chain of (old, new) replacements to a string.")
+    tools = toolbox_to_openai_tools(tb, topk=10)
+    assert len(tools) == 1
+    fn = tools[0]
+    assert fn["type"] == "function"
+    assert fn["function"]["name"] == "find_replace_chain"
+    assert fn["function"]["description"] == "Apply a chain of (old, new) replacements to a string."
+    params = fn["function"]["parameters"]
+    assert params["type"] == "object"
+    assert set(params["properties"].keys()) == {"s", "pairs"}
+    assert params["properties"]["s"]["type"] == "string"
+    assert params["properties"]["pairs"]["type"] == "array"
+    assert set(params["required"]) == {"s", "pairs"}
+
+
+def test_schema_unannotated_falls_back_to_string():
+    src = (
+        "def f(x):\n"
+        "    return x\n"
+    )
+    tb = _make_toolbox_with(src, "f")
+    tools = toolbox_to_openai_tools(tb, topk=10)
+    assert tools[0]["function"]["parameters"]["properties"]["x"]["type"] == "string"
+
+
+def test_schema_skips_varargs_kwargs():
+    src = (
+        "def f(*args, **kwargs):\n"
+        "    return args\n"
+    )
+    tb = _make_toolbox_with(src, "f")
+    tools = toolbox_to_openai_tools(tb, topk=10)
+    assert tools == []
+
+
+def test_schema_required_excludes_defaults():
+    src = (
+        "def f(x: int, y: int = 5):\n"
+        "    return x + y\n"
+    )
+    tb = _make_toolbox_with(src, "f")
+    tools = toolbox_to_openai_tools(tb, topk=10)
+    params = tools[0]["function"]["parameters"]
+    assert params["required"] == ["x"]
+    assert params["properties"]["y"]["type"] == "integer"
+
+
+def test_schema_topk_respects_frequency():
+    tb = TroVEToolbox()
+    for n, freq in [("a", 3), ("b", 2), ("c", 1)]:
+        tb.add(
+            {
+                "name": n,
+                "docstr": "",
+                "signature": f"def {n}()",
+                "function": f"def {n}():\n    return 0\n",
+                "type": "function",
+            },
+            example_idx=0,
+        )
+        for _ in range(freq - 1):
+            tb.update_frequency(n, example_idx=0)
+    tools = toolbox_to_openai_tools(tb, topk=2)
+    assert [t["function"]["name"] for t in tools] == ["a", "b"]
+
+
+def test_schema_empty_toolbox():
+    assert toolbox_to_openai_tools(TroVEToolbox(), topk=10) == []
+
+
+# ---------------------------------------------------------------------------
+# dispatch_tool_call
+# ---------------------------------------------------------------------------
+
+def test_dispatch_runs_function_and_returns_stdout():
+    src = (
+        "def reverse_str(s):\n"
+        "    return s[::-1]\n"
+    )
+    tb = _make_toolbox_with(src, "reverse_str")
+    result = dispatch_tool_call(tb, _tool_call("reverse_str", {"s": "hello"}))
+    assert "olleh" in result
+
+
+def test_dispatch_unknown_tool_returns_error():
+    tb = TroVEToolbox()
+    result = dispatch_tool_call(tb, _tool_call("nonexistent", {}))
+    assert "not in toolbox" in result
+
+
+def test_dispatch_bad_json_returns_error():
+    src = "def f(x):\n    return x\n"
+    tb = _make_toolbox_with(src, "f")
+    bad = SimpleNamespace(
+        id="x",
+        function=SimpleNamespace(name="f", arguments="{not json"),
+    )
+    result = dispatch_tool_call(tb, bad)
+    assert "argument JSON parse failed" in result
+
+
+def test_dispatch_sanitizes_harmony_contamination():
+    src = "def reverse_str(s):\n    return s[::-1]\n"
+    tb = _make_toolbox_with(src, "reverse_str")
+    tc = _tool_call("reverse_str<|channel|>commentary", {"s": "abc"})
+    result = dispatch_tool_call(tb, tc)
+    assert "cba" in result
+
+
+def test_dispatch_truncates_long_output():
+    src = (
+        "def long_output(n):\n"
+        "    return 'x' * n\n"
+    )
+    tb = _make_toolbox_with(src, "long_output")
+    result = dispatch_tool_call(tb, _tool_call("long_output", {"n": 10000}))
+    assert len(result) <= 4096 + 100  # +slack for repr quotes and truncation marker
diff --git a/symbolic_agent/baselines/trove/tools_api.py b/symbolic_agent/baselines/trove/tools_api.py
new file mode 100644
index 00000000..fc093d5f
--- /dev/null
+++ b/symbolic_agent/baselines/trove/tools_api.py
@@ -0,0 +1,170 @@
+"""Translate the TroVE toolbox into OpenAI Chat Completions tool schemas
+and dispatch tool calls back through the executor.
+
+This module is the bridge between TroVE's in-memory toolbox and vLLM's
+native tool-calling protocol. It is invoked only from the IMPORT-with-tools
+controller branch.
+"""
+
+from __future__ import annotations
+
+import inspect
+import json
+import logging
+from typing import Any
+
+from .executor import run_solution
+from .toolbox import TroVEToolbox
+
+logger = logging.getLogger(__name__)
+
+_MAX_RESULT_CHARS = 4096
+
+# Type inference: Python annotation -> JSON Schema type.
+_TYPE_MAP = {
+    int: "integer",
+    float: "number",
+    bool: "boolean",
+    str: "string",
+    list: "array",
+    tuple: "array",
+    dict: "object",
+}
+
+
+def _infer_type(annotation: Any) -> str:
+    if annotation is inspect.Parameter.empty:
+        return "string"
+    # Plain types (int, str, etc.)
+    if annotation in _TYPE_MAP:
+        return _TYPE_MAP[annotation]
+    # typing.List, typing.Dict, etc. — fall through to string if unrecognised.
+    origin = getattr(annotation, "__origin__", None)
+    if origin in _TYPE_MAP:
+        return _TYPE_MAP[origin]
+    return "string"
+
+
+def _function_to_schema(name: str, fn: Any, docstr: str) -> dict | None:
+    """
+    Build one OpenAI tool dict from a callable. Returns None if the function
+    has *args or **kwargs (we cannot generate a meaningful schema).
+    """
+    try:
+        sig = inspect.signature(fn)
+    except (TypeError, ValueError) as exc:
+        logger.debug("Could not introspect %s: %s", name, exc)
+        return None
+
+    properties: dict = {}
+    required: list = []
+
+    for pname, param in sig.parameters.items():
+        if param.kind in (
+            inspect.Parameter.VAR_POSITIONAL,
+            inspect.Parameter.VAR_KEYWORD,
+        ):
+            logger.debug("Skipping %s — has *args/**kwargs", name)
+            return None
+        prop: dict = {"type": _infer_type(param.annotation)}
+        if param.default is not inspect.Parameter.empty:
+            if isinstance(param.default, (int, float, bool, str)):
+                prop["default"] = param.default
+        else:
+            required.append(pname)
+        properties[pname] = prop
+
+    return {
+        "type": "function",
+        "function": {
+            "name": name,
+            "description": docstr or "",
+            "parameters": {
+                "type": "object",
+                "properties": properties,
+                "required": required,
+            },
+        },
+    }
+
+
+def toolbox_to_openai_tools(toolbox: TroVEToolbox, topk: int = 10) -> list:
+    """
+    Convert the top-k toolbox functions (by frequency) into OpenAI Chat
+    Completions tool dicts.
+
+    Functions with *args / **kwargs are silently excluded.
+    Returns [] when the toolbox is empty.
+    """
+    entries = toolbox.snapshot()
+    if not entries:
+        return []
+    entries.sort(key=lambda e: -int(e.get("frequency", 0)))
+    selected = entries[:topk]
+
+    namespace: dict = {}
+    try:
+        # compile(..., dont_inherit=True) so this module's `from __future__ import
+        # annotations` is not applied to the toolbox source; we need real types in
+        # `__annotations__` for inspect.signature() / _infer_type.
+        _code = compile(
+            toolbox.get_full_code(), "<trove-toolbox>", "exec", dont_inherit=True
+        )
+        exec(_code, namespace)
+    except Exception as exc:
+        logger.warning("Could not exec toolbox source for schema generation: %s", exc)
+        return []
+
+    tools: list = []
+    for entry in selected:
+        name = entry.get("name", "")
+        if not name or name not in namespace:
+            continue
+        fn = namespace[name]
+        schema = _function_to_schema(name, fn, entry.get("docstr", ""))
+        if schema is not None:
+            tools.append(schema)
+    return tools
+
+
+def _sanitize_name(name: str) -> str:
+    """Defensive workaround for vLLM PR #35906 (Harmony control tokens
+    leaking into tool names like `reverse_str<|channel|>commentary`)."""
+    return name.split("<|", 1)[0].strip()
+
+
+def _truncate(s: str, limit: int = _MAX_RESULT_CHARS) -> str:
+    if len(s) <= limit:
+        return s
+    return s[:limit] + f"\n... [truncated {len(s) - limit} chars]"
+
+
+def dispatch_tool_call(toolbox: TroVEToolbox, tool_call) -> str:
+    """
+    Resolve `tool_call` against the toolbox, run it via the sandbox executor,
+    and return the captured stdout (truncated to 4096 chars) or an error
+    message string. Always returns a string — never raises.
+    """
+    name = _sanitize_name(getattr(tool_call.function, "name", "") or "")
+    if not name:
+        return json.dumps({"error": "tool_call has no function name"})
+    if name not in {e["name"] for e in toolbox.snapshot()}:
+        return json.dumps({"error": f"tool '{name}' not in toolbox"})
+
+    raw_args = getattr(tool_call.function, "arguments", "") or "{}"
+    try:
+        args = json.loads(raw_args)
+        if not isinstance(args, dict):
+            return json.dumps({"error": f"argument JSON parse failed: expected object, got {type(args).__name__}"})
+    except json.JSONDecodeError as exc:
+        return json.dumps({"error": f"argument JSON parse failed: {exc}"})
+
+    call_expr = f"print(repr({name}(**{args!r})))"
+    is_ok, output = run_solution(
+        solution_code=call_expr,
+        tools_code="",
+        toolbox_code=toolbox.get_full_code(),
+    )
+    if not is_ok:
+        return json.dumps({"error": "execution failed", "stderr": _truncate(output)})
+    return _truncate(output)

From 06116b12e33c3ec80f5fa2268985c0b98793705f Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 18:15:07 -0400
Subject: [PATCH 07/24] fix(trove): correct misleading 'stderr' key in
 tools_api error payload

executor.run_solution returns proc.stdout.strip(), not stderr. Rename the
JSON error key from 'stderr' to 'stdout' so the field name matches what is
actually being returned. Caught in code-quality review for Task 4.

Made-with: Cursor
---
 symbolic_agent/baselines/trove/tools_api.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/symbolic_agent/baselines/trove/tools_api.py b/symbolic_agent/baselines/trove/tools_api.py
index fc093d5f..c0edc151 100644
--- a/symbolic_agent/baselines/trove/tools_api.py
+++ b/symbolic_agent/baselines/trove/tools_api.py
@@ -166,5 +166,5 @@ def dispatch_tool_call(toolbox: TroVEToolbox, tool_call) -> str:
         toolbox_code=toolbox.get_full_code(),
     )
     if not is_ok:
-        return json.dumps({"error": "execution failed", "stderr": _truncate(output)})
+        return json.dumps({"error": "execution failed", "stdout": _truncate(output)})
     return _truncate(output)

From 6ee331cb02e90d4ef7dc399b8bc1095a522401d5 Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 18:17:02 -0400
Subject: [PATCH 08/24] feat(trove): add TroVELLMClient.chat_with_tools for
 native tool calls

Multi-turn loop that handles tool_calls returned by gpt-oss/vLLM:
appends assistant message + tool result messages until the model returns
no tool_calls or max_tool_iters is reached. Records each call as
{name, args_preview, result_preview, ok} for downstream telemetry.
Reuses the existing 3-attempt retry, debug logging, and token accounting.

Anthropic backend raises NotImplementedError as a defensive guard;
controllers branch on self.backend == "openai" before calling.

Made-with: Cursor
---
 symbolic_agent/baselines/trove/llm.py | 170 +++++++++++++++++++++++++-
 1 file changed, 169 insertions(+), 1 deletion(-)

diff --git a/symbolic_agent/baselines/trove/llm.py b/symbolic_agent/baselines/trove/llm.py
index ec98472f..dda158eb 100644
--- a/symbolic_agent/baselines/trove/llm.py
+++ b/symbolic_agent/baselines/trove/llm.py
@@ -16,7 +16,7 @@
 import os
 import time
 from datetime import datetime, timezone
-from typing import Dict, List, Optional
+from typing import Any, Callable, Dict, List, Optional
 
 logger = logging.getLogger(__name__)
 
@@ -219,6 +219,174 @@ def _call_openai(self, prompt: str, model: str, max_tokens: int, tag: str) -> st
         logger.warning("All OpenAI retries exhausted (tag=%s): %s", tag, last_exc)
         return ""
 
+    # ------------------------------------------------------------------
+    # Native tool calling (OpenAI/vLLM only)
+    # ------------------------------------------------------------------
+
+    def chat_with_tools(
+        self,
+        messages: List[Dict[str, Any]],
+        tools: List[Dict[str, Any]],
+        model: str,
+        max_tokens: int = DEFAULT_MAX_TOKENS,
+        max_tool_iters: int = 8,
+        on_tool_call: Optional[Callable[[Any], str]] = None,
+        tag: str = "",
+    ) -> Dict[str, Any]:
+        """
+        Multi-turn chat completion that supports native OpenAI tool calls.
+
+        Returns
+        -------
+        {
+            "final_text":     str,         # message.content (or reasoning_content fallback)
+            "tool_calls":     list[dict],  # ordered, each {name, args_preview, result_preview, ok}
+            "iterations":     int,         # number of round-trips actually used
+            "stopped_reason": str,         # "no_tool_calls" | "max_iters" | "error"
+        }
+
+        The caller is responsible for providing `on_tool_call(tc) -> str`,
+        which is invoked for every tool_call returned by the model. The
+        return value (already a string) is sent back as the tool message.
+
+        Anthropic backend is not supported — this method exists for the
+        OpenAI/vLLM tool-calling flow only. It raises NotImplementedError
+        on Anthropic as a defensive guard; controllers must check
+        `self.backend == "openai"` before calling.
+        """
+        if self.backend != "openai":
+            raise NotImplementedError("chat_with_tools requires the openai backend")
+
+        if on_tool_call is None:
+            raise ValueError("chat_with_tools requires an on_tool_call callback")
+
+        recorded_calls: List[Dict[str, Any]] = []
+        convo: List[Dict[str, Any]] = list(messages)
+        iterations = 0
+        final_text = ""
+        stopped_reason = "no_tool_calls"
+
+        for it in range(max_tool_iters + 1):
+            iterations = it + 1
+            iter_tag = f"{tag}_iter{it}" if tag else f"iter{it}"
+            response = None
+            last_exc = None
+
+            for attempt in range(3):
+                try:
+                    response = self._client.chat.completions.create(
+                        model=model,
+                        max_tokens=max_tokens,
+                        messages=convo,
+                        tools=tools,
+                        tool_choice="auto",
+                    )
+                    break
+                except Exception as exc:
+                    last_exc = exc
+                    if getattr(exc, "status_code", None) == 400:
+                        logger.warning(
+                            "OpenAI chat_with_tools 400 (tag=%s): %s", iter_tag, exc
+                        )
+                        self._record(iter_tag, model, json.dumps(convo)[:2000], "", max_tokens, {})
+                        return {
+                            "final_text": "",
+                            "tool_calls": recorded_calls,
+                            "iterations": iterations,
+                            "stopped_reason": "error",
+                        }
+                    if attempt < 2:
+                        wait = 5 * (2 ** attempt)
+                        logger.warning(
+                            "chat_with_tools failed (attempt %d/3, tag=%s): %s. Retrying in %ds.",
+                            attempt + 1, iter_tag, exc, wait,
+                        )
+                        time.sleep(wait)
+
+            if response is None:
+                logger.warning("All chat_with_tools retries exhausted (tag=%s): %s", iter_tag, last_exc)
+                stopped_reason = "error"
+                break
+
+            msg = response.choices[0].message
+            content = msg.content or getattr(msg, "reasoning_content", "") or ""
+            tool_calls = getattr(msg, "tool_calls", None) or []
+
+            u = getattr(response, "usage", None)
+            details = getattr(u, "completion_tokens_details", None)
+            usage = {
+                "input_tokens": getattr(u, "prompt_tokens", 0) or 0,
+                "output_tokens": getattr(u, "completion_tokens", 0) or 0,
+                "reasoning_tokens": getattr(details, "reasoning_tokens", 0) or 0 if details else 0,
+            }
+            self._record(
+                iter_tag,
+                model,
+                json.dumps(convo)[:2000],
+                json.dumps({"content": content, "tool_calls_count": len(tool_calls)}),
+                max_tokens,
+                usage,
+            )
+
+            if not tool_calls:
+                final_text = content
+                stopped_reason = "no_tool_calls"
+                break
+
+            assistant_msg: Dict[str, Any] = {
+                "role": "assistant",
+                "content": content,
+                "tool_calls": [
+                    {
+                        "id": tc.id,
+                        "type": "function",
+                        "function": {
+                            "name": tc.function.name,
+                            "arguments": tc.function.arguments,
+                        },
+                    }
+                    for tc in tool_calls
+                ],
+            }
+            convo.append(assistant_msg)
+
+            for tc in tool_calls:
+                try:
+                    result = on_tool_call(tc)
+                    ok = True
+                except Exception as exc:
+                    result = json.dumps({"error": f"on_tool_call raised: {exc}"})
+                    ok = False
+                args_preview = (tc.function.arguments or "")[:200]
+                result_preview = (result or "")[:200]
+                recorded_calls.append(
+                    {
+                        "name": tc.function.name,
+                        "args_preview": args_preview,
+                        "result_preview": result_preview,
+                        "ok": ok,
+                    }
+                )
+                convo.append(
+                    {
+                        "role": "tool",
+                        "tool_call_id": tc.id,
+                        "content": result,
+                    }
+                )
+
+            if it >= max_tool_iters - 1:
+                stopped_reason = "max_iters"
+                final_text = content
+                break
+
+        return {
+            "final_text": final_text,
+            "tool_calls": recorded_calls,
+            "iterations": iterations,
+            "stopped_reason": stopped_reason,
+        }
+
     # ------------------------------------------------------------------
     # Logging
     # ------------------------------------------------------------------

From ace60481438d65937314b032b39dbc28c7262ec4 Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 18:22:49 -0400
Subject: [PATCH 09/24] feat(trove): controller branch for native IMPORT tool
 calling

- Add task_family and selection params to TroVEController.__init__.
- IMPORT branch dispatches to _generate_import_with_tools when toolbox
  is non-empty and backend is openai; otherwise falls back to legacy
  text-based IMPORT.
- _generate_import_with_tools builds K multi-turn trajectories via
  TroVELLMClient.chat_with_tools, parses **Solution** strictly for
  pbebench, and runs the result through the executor.
- _update_library credits frequency by unique tool_call.function.name
  for the native path; legacy path still credits parsed functions.
- _make_result emits won_mode, import_eligible, import_was_winner,
  tool_calls, tool_call_count, tools_called, actually_called,
  trove_stopped_reason as passive telemetry.
- _select_best honors selection="consistency" or "reward" (default).

Made-with: Cursor
---
 symbolic_agent/baselines/trove/controller.py | 192 +++++++++++++++----
 1 file changed, 156 insertions(+), 36 deletions(-)

diff --git a/symbolic_agent/baselines/trove/controller.py b/symbolic_agent/baselines/trove/controller.py
index d11d8b23..173c3837 100644
--- a/symbolic_agent/baselines/trove/controller.py
+++ b/symbolic_agent/baselines/trove/controller.py
@@ -37,10 +37,17 @@
 from collections import Counter
 from typing import Callable, Dict, List, Optional
 
+from . import tools_api
 from .executor import run_solution
 from .llm import TroVELLMClient
-from .parse import count_ast_nodes, parse_response
-from .prompts import build_create_prompt, build_import_prompt, build_skip_prompt, get_question
+from .parse import count_ast_nodes, imported_callsites, parse_response
+from .prompts import (
+    build_create_prompt,
+    build_import_prompt,
+    build_import_with_tools_prompt,
+    build_skip_prompt,
+    get_question,
+)
 from .toolbox import TroVEToolbox
 
 logger = logging.getLogger(__name__)
@@ -83,18 +90,26 @@ def __init__(
         debug_dir: Optional[str] = None,
         k: int = DEFAULT_K,
         trim_every: int = DEFAULT_TRIM_EVERY,
-        trim_C: float = 0.5,
+        trim_C: float = 1.0,
         temperature: float = 0.3,
         top_p: float = 0.95,
+        task_family: str = "default",
+        selection: str = "reward",
+        max_tool_iters: int = 8,
+        tool_schema_topk: int = 10,
     ):
         self.model = model
         self.k = k
         self.trim_every = trim_every
         self.trim_C = trim_C
+        self.task_family = task_family
+        self.selection = selection
+        self.max_tool_iters = max_tool_iters
+        self.tool_schema_topk = tool_schema_topk
 
-        backend = "openai" if base_url else "anthropic"
+        self.backend = "openai" if base_url else "anthropic"
         self.llm = TroVELLMClient(
-            backend=backend,
+            backend=self.backend,
             base_url=base_url,
             api_key=api_key,
             temperature=temperature,
@@ -252,33 +267,52 @@ def _multi_way_generation(
         toolbox_str = self.toolbox.format_toolbox()
 
         # --- IMPORT mode ---
-        import_candidates = []
-        if toolbox_str:
+        toolbox_nonempty = bool(toolbox_str)
+        use_tools_branch = toolbox_nonempty and self.backend == "openai"
+
+        if use_tools_branch:
+            import_candidates = self._generate_import_with_tools(
+                question, example_idx, reward_fn=reward_fn, entry=entry
+            )
+            best_import_idx, best_import_score = self._select_best(
+                import_candidates, reward_fn=reward_fn, entry=entry
+            )
+            best_import = import_candidates[best_import_idx]
+            best_import["_reward_score"] = best_import_score
+        elif toolbox_nonempty:
+            # Legacy text-based IMPORT (Anthropic or unforeseen non-OpenAI path).
+            import_candidates = []
             for _ in range(self.k):
-                prompt = build_import_prompt(question, toolbox_str)
+                prompt = build_import_prompt(question, toolbox_str, task_family=self.task_family)
                 raw = self.llm.call(prompt, self.model, max_tokens=DEFAULT_MAX_TOKENS, tag="trove_import")
-                parsed = parse_response(raw)
+                parsed = parse_response(raw, task_family=self.task_family)
                 is_ok, out = run_solution(
                     parsed["solution_code"],
                     parsed["tools_code"],
                     self.toolbox.get_full_code(),
                 )
-                import_candidates.append({**parsed, "is_success": is_ok, "exec_output": out})
+                import_candidates.append(
+                    {**parsed, "is_success": is_ok, "exec_output": out, "tool_calls": [], "stopped_reason": "legacy"}
+                )
             best_import_idx, best_import_score = self._select_best(
                 import_candidates, reward_fn=reward_fn, entry=entry
             )
             best_import = import_candidates[best_import_idx]
             best_import["_reward_score"] = best_import_score
         else:
-            best_import = {"solution_code": "", "tools_code": "", "functions": [],
-                           "is_success": False, "exec_output": "", "_reward_score": None}
+            best_import = {
+                "solution_code": "", "tools_code": "", "functions": [],
+                "is_success": False, "exec_output": "",
+                "tool_calls": [], "stopped_reason": "empty_toolbox",
+                "_reward_score": None,
+            }
 
         # --- CREATE mode ---
         create_candidates = []
         for _ in range(self.k):
-            prompt = build_create_prompt(question)
+            prompt = build_create_prompt(question, task_family=self.task_family)
             raw = self.llm.call(prompt, self.model, max_tokens=DEFAULT_MAX_TOKENS, tag="trove_create")
-            parsed = parse_response(raw)
+            parsed = parse_response(raw, task_family=self.task_family)
             is_ok, out = run_solution(
                 parsed["solution_code"],
                 parsed["tools_code"],
@@ -294,9 +328,9 @@ def _multi_way_generation(
         # --- SKIP mode ---
         skip_candidates = []
         for _ in range(self.k):
-            prompt = build_skip_prompt(question)
+            prompt = build_skip_prompt(question, task_family=self.task_family)
             raw = self.llm.call(prompt, self.model, max_tokens=DEFAULT_MAX_TOKENS, tag="trove_skip")
-            parsed = parse_response(raw)
+            parsed = parse_response(raw, task_family=self.task_family)
             is_ok, out = run_solution(
                 parsed["solution_code"],
                 parsed["tools_code"],
@@ -334,6 +368,54 @@ def _multi_way_generation(
         )
         return winning_mode, best_resp, best_score
 
+    def _generate_import_with_tools(
+        self,
+        question: str,
+        example_idx: int,
+        reward_fn: Optional[Callable] = None,
+        entry: Optional[dict] = None,
+    ) -> List[dict]:
+        """
+        IMPORT-mode generation using native OpenAI tool calling.
+        Builds K trajectories; each trajectory may invoke toolbox functions
+        via tool_calls during the multi-turn loop. Returns K candidate dicts
+        compatible with _select_best.
+        """
+        prompt = build_import_with_tools_prompt(question, task_family=self.task_family)
+        tools_schema = tools_api.toolbox_to_openai_tools(self.toolbox, topk=self.tool_schema_topk)
+
+        candidates: List[dict] = []
+        for i in range(self.k):
+            tag = f"trove_import_t{example_idx}_{i}"
+            messages = [{"role": "user", "content": prompt}]
+            on_tc = lambda tc: tools_api.dispatch_tool_call(self.toolbox, tc)
+            traj = self.llm.chat_with_tools(
+                messages=messages,
+                tools=tools_schema,
+                model=self.model,
+                max_tokens=DEFAULT_MAX_TOKENS,
+                max_tool_iters=self.max_tool_iters,
+                on_tool_call=on_tc,
+                tag=tag,
+            )
+            parsed = parse_response(traj["final_text"], task_family=self.task_family)
+            is_ok, out = run_solution(
+                parsed["solution_code"],
+                parsed["tools_code"],
+                self.toolbox.get_full_code(),
+            )
+            candidates.append(
+                {
+                    **parsed,
+                    "is_success": is_ok,
+                    "exec_output": out,
+                    "tool_calls": traj["tool_calls"],
+                    "stopped_reason": traj["stopped_reason"],
+                    "iterations": traj["iterations"],
+                }
+            )
+        return candidates
+
     def _select_best(
         self,
         candidates: List[dict],
@@ -344,18 +426,15 @@ def _select_best(
         Select the best candidate from a list of response dicts.
 
         Returns (best_index, score_or_None) where score is (reward, message)
-        when reward-based selection is used, or None for majority-vote mode.
-
-        Two selection strategies:
-        1. Reward-based (when reward_fn + entry provided):
-           Score all K candidates with reward_fn; pick highest reward,
-           tiebreak by minimum AST node count (simplest solution).
-           This is reliable for PBEBench (program lists rarely match exactly
-           as strings) and equally good for reasoning_gym.
-        2. Majority-vote fallback (original TroVE algorithm):
-           Filter successes → majority vote on stdout → min AST tiebreak.
-           Used when no reward function is available (e.g. bare solve()).
+        when reward-based selection is used, or None otherwise.
+
+        Selection strategy is governed by self.selection:
+          - "reward" (default): reward-based when reward_fn+entry provided,
+            falls back to consistency when not.
+          - "consistency": original TroVE majority-vote algorithm.
         """
+        if self.selection == "consistency":
+            return self._select_best_by_consistency(candidates), None
         if reward_fn is not None and entry is not None:
             return self._select_best_by_reward(candidates, reward_fn, entry)
         return self._select_best_by_consistency(candidates), None
@@ -419,13 +498,25 @@ def _select_best_by_consistency(self, candidates: List[dict]) -> int:
     def _update_library(self, mode: str, resp: dict, example_idx: int) -> None:
         """Update toolbox based on winning mode (faithful to run_trove.py)."""
         if mode == "import":
-            # IMPORT: credit existing functions that were used
-            for func_dict in resp.get("functions", []):
-                name = func_dict.get("name", "")
-                if name:
-                    self.toolbox.update_frequency(name, example_idx)
+            tool_calls = resp.get("tool_calls") or []
+            if tool_calls:
+                # Native tool-calling path: credit by unique tool_call.function.name
+                # (defensive: sanitize and let toolbox.update_frequency filter unknowns).
+                unique_names = {
+                    tc["name"].split("<|", 1)[0].strip()
+                    for tc in tool_calls
+                    if tc.get("name")
+                }
+                for name in unique_names:
+                    if name:
+                        self.toolbox.update_frequency(name, example_idx)
+            else:
+                # Legacy text-based IMPORT: credit functions parsed from **Tools**.
+                for func_dict in resp.get("functions", []):
+                    name = func_dict.get("name", "")
+                    if name:
+                        self.toolbox.update_frequency(name, example_idx)
         elif mode == "create" and resp.get("is_success"):
-            # CREATE: add new functions only when execution succeeded
             for func_dict in resp.get("functions", []):
                 self.toolbox.add(func_dict, example_idx)
 
@@ -447,8 +538,29 @@ def _make_result(
     ) -> dict:
         """
         Build a result dict compatible with main.py's _print_result() and
-        _append_task_output().
+        _append_task_output(). Adds passive TroVE telemetry fields.
         """
+        tool_calls = best_resp.get("tool_calls") or []
+        tools_called = sorted({
+            tc["name"].split("<|", 1)[0].strip()
+            for tc in tool_calls
+            if tc.get("name")
+        })
+        candidate_names = {e["name"] for e in self.toolbox.snapshot()}
+        actually_called = sorted(
+            imported_callsites(
+                solution_code=best_resp.get("solution_code", ""),
+                tools_code=best_resp.get("tools_code", ""),
+                candidate_names=candidate_names,
+            )
+        )
+        import_eligible = len(self.toolbox) > 0  # state AFTER this task's update
+        # Note: import_eligible reflects the current toolbox state after
+        # _update_library has already run for this task. The analyzer should
+        # interpret this as "a non-empty toolbox existed at some point during
+        # this task's processing". For pre-task eligibility, infer from
+        # toolbox snapshots in adjacent tasks.
+
         return {
             "task_type": task_type,
             "original_prompt": str(task_input),
@@ -464,7 +576,7 @@ def _make_result(
             ],
             "solution": best_resp.get("solution_code", ""),
             "library_snapshot": self.toolbox.snapshot(),
-            "cost_summary": {},  # TroVE has no cost model
+            "cost_summary": {},
             "final_output": {
                 "answer": output,
                 "explanation": f"TroVE mode={best_mode}",
@@ -475,6 +587,14 @@ def _make_result(
             "reward_history": [],
             "best_reward": None,
             "final_reward": None,
-            # Cached score from reward-based selection; consumed and removed by solve_with_reward.
             "_best_reward_score": best_reward_score,
+            # TroVE native-tool-calling telemetry
+            "won_mode": best_mode,
+            "import_eligible": import_eligible,
+            "import_was_winner": best_mode == "import",
+            "tool_calls": tool_calls,
+            "tool_call_count": len(tool_calls),
+            "tools_called": tools_called,
+            "actually_called": actually_called,
+            "trove_stopped_reason": best_resp.get("stopped_reason", ""),
         }

From 5f1ff88b2ebbf022c8569653a69b49a93ccbb374 Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 18:34:32 -0400
Subject: [PATCH 10/24] docs(trove): align TroVEController class docstring with
 new params

Update the class-level Parameters block to:
- Reflect trim_C default of 1.0 (matches __init__).
- Document task_family, selection, max_tool_iters, tool_schema_topk.
- Note that base_url governs which backend is used and that native
  tool-calling IMPORT requires the openai backend.

Made-with: Cursor
---
 symbolic_agent/baselines/trove/controller.py | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/symbolic_agent/baselines/trove/controller.py b/symbolic_agent/baselines/trove/controller.py
index 173c3837..149f2a28 100644
--- a/symbolic_agent/baselines/trove/controller.py
+++ b/symbolic_agent/baselines/trove/controller.py
@@ -68,18 +68,34 @@ class TroVEController:
     model : str
         LLM model identifier.
     base_url : str, optional
-        For OpenAI-compatible (vLLM) backends.
+        For OpenAI-compatible (vLLM) backends. When set, ``self.backend`` is
+        ``"openai"``; otherwise ``"anthropic"``. Native tool-calling IMPORT
+        requires the openai backend.
     debug_dir : str, optional
     k : int
         Number of samples per mode (paper default: 5).
     trim_every : int
         Trim toolbox every N tasks (paper default: 500).
     trim_C : float
-        Trimming threshold multiplier: threshold = C·log₂₀(n). Default: 0.5.
+        Trimming threshold multiplier: threshold = C·log₂₀(n). Default: 1.0
+        (matches the original TroVE implementation).
     temperature : float
         Sampling temperature. Default: 0.3 (TroVE paper).
     top_p : float
         Nucleus sampling top-p. Default: 0.95 (TroVE paper).
+    task_family : str
+        Prompt/parsing family. ``"default"`` (generic) or ``"pbebench"``
+        (PBEBench-shaped few-shots; strict ``**Solution**`` parsing).
+    selection : str
+        Candidate selection strategy. ``"reward"`` (default) uses the
+        reward function when available and falls back to consistency;
+        ``"consistency"`` always uses the original TroVE majority-vote.
+    max_tool_iters : int
+        Maximum tool-call rounds per IMPORT trajectory in the native
+        tool-calling path. Default: 8.
+    tool_schema_topk : int
+        Number of top-frequency toolbox functions exposed as OpenAI tool
+        schemas in the native IMPORT path. Default: 10.
     """
 
     def __init__(

From d8a76a4000f35b30afe47e6d1a6c65e8340e289f Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 18:35:39 -0400
Subject: [PATCH 11/24] feat(trove): CLI flags --trove-selection and
 --trove-task-family

- --trove-selection {reward,consistency} (default: reward).
- --trove-task-family {default,pbebench} (default: default). Plumbed
  through to TroVEController; PBEBench runs should pass --trove-task-family
  pbebench to enable PBEBench-shaped few-shots and strict **Solution**
  parsing.

Made-with: Cursor
---
 main.py | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/main.py b/main.py
index f04aff88..3bfbc6b3 100644
--- a/main.py
+++ b/main.py
@@ -808,6 +808,23 @@ def main() -> None:
         help="[TroVE] Trim low-frequency toolbox functions every N tasks. "
              "Paper default: 500. Set to 9999 to disable for small datasets. (default: 500)",
     )
+    parser.add_argument(
+        "--trove-selection",
+        choices=["reward", "consistency"],
+        default="reward",
+        help="[TroVE] Candidate selection strategy. 'reward' (default) uses "
+             "the per-task reward function with AST tie-breaking. "
+             "'consistency' uses the original TroVE majority-vote algorithm. "
+             "(default: reward)",
+    )
+    parser.add_argument(
+        "--trove-task-family",
+        choices=["default", "pbebench"],
+        default="default",
+        help="[TroVE] Task family for prompt selection and parser strictness. "
+             "'pbebench' uses PBEBench-shaped few-shots and strict **Solution** "
+             "parsing (no fallback to any python block). (default: default)",
+    )
     # ReGAL-specific flags
     parser.add_argument(
         "--regal-train-file",
@@ -1007,8 +1024,13 @@ def main() -> None:
             debug_dir=args.debug_dir,
             k=args.trove_k,
             trim_every=args.trove_trim_every,
+            task_family=args.trove_task_family,
+            selection=args.trove_selection,
+        )
+        logger.info(
+            "Framework: TroVE (k=%d, trim_every=%d, task_family=%s, selection=%s)",
+            args.trove_k, args.trove_trim_every, args.trove_task_family, args.trove_selection,
         )
-        logger.info("Framework: TroVE (k=%d, trim_every=%d)", args.trove_k, args.trove_trim_every)
     elif args.framework == "regal":
         from pathlib import Path as _Path
         controller = ReGALController(

From a19309b93c285283b039d45322f31e4df420daa5 Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 18:36:31 -0400
Subject: [PATCH 12/24] chore(launcher): enable native tool calling for
 gpt-oss-120b vLLM server

Add three flags required for OpenAI-compatible tool calling on gpt-oss
served by vLLM >= v0.16.0:
  --enable-auto-tool-choice
  --tool-call-parser openai
  --reasoning-parser openai_gptoss

Without these the controller's chat_with_tools loop sees no tool_calls
in the response and degrades to no-tool behavior.

Made-with: Cursor
---
 scripts/launch_vllm_gpt_oss_120b.sh | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/scripts/launch_vllm_gpt_oss_120b.sh b/scripts/launch_vllm_gpt_oss_120b.sh
index 74b10dac..5ae5216c 100644
--- a/scripts/launch_vllm_gpt_oss_120b.sh
+++ b/scripts/launch_vllm_gpt_oss_120b.sh
@@ -7,6 +7,11 @@ export TMPDIR=/tmp/$USER-tmp
 
 ts=$(date +%Y%m%d_%H%M%S)
 
+# Required vLLM tool-calling flags (vLLM >= v0.16.0 for PR #28729):
+#   --enable-auto-tool-choice  enables tool_choice="auto"
+#   --tool-call-parser openai  parses gpt-oss Harmony commentary channel
+#   --reasoning-parser openai_gptoss  routes analysis-channel content into
+#                                     message.reasoning_content
 nohup python -m vllm.entrypoints.openai.api_server \
   --model "openai/gpt-oss-120b" \
   --tokenizer "openai/gpt-oss-120b" \
@@ -14,4 +19,7 @@ nohup python -m vllm.entrypoints.openai.api_server \
   --port ${1} \
   --gpu-memory-utilization 0.95 \
   --tensor-parallel-size 2 \
-  > vllm_logs/vllm_${1}_${ts}.log 2>&1 & echo $! > vllm_logs/vllm_${1}_${ts}.pid
\ No newline at end of file
+  --enable-auto-tool-choice \
+  --tool-call-parser openai \
+  --reasoning-parser openai_gptoss \
+  > vllm_logs/vllm_${1}_${ts}.log 2>&1 & echo $! > vllm_logs/vllm_${1}_${ts}.pid

From 8c32e0c4bbbbc2cb1919785dde48863832d0ef69 Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 18:37:12 -0400
Subject: [PATCH 13/24] feat(trove): add analyze_trove_run.py for post-hoc
 telemetry reports

Reads a TroVE JSONL output and reports overall accuracy, final toolbox
size, per-mode wins, IMPORT-mode tool-use breakdown (>=1 call rate,
mean calls/task, success rate), and the top-10 most-called toolbox
functions. Sanitizes Harmony control-token contamination in tool names
when aggregating.

Made-with: Cursor
---
 scripts/analyze_trove_run.py | 103 +++++++++++++++++++++++++++++++++++
 1 file changed, 103 insertions(+)
 create mode 100755 scripts/analyze_trove_run.py

diff --git a/scripts/analyze_trove_run.py b/scripts/analyze_trove_run.py
new file mode 100755
index 00000000..0fe2758e
--- /dev/null
+++ b/scripts/analyze_trove_run.py
@@ -0,0 +1,103 @@
+#!/usr/bin/env python3
+"""Post-hoc analysis of a TroVE run JSONL output.
+
+Reads the per-task JSONL file produced by main.py --output-file and reports:
+  - Overall accuracy
+  - Final toolbox size
+  - Per-mode wins
+  - IMPORT-mode tool-use breakdown
+  - Top-10 most-called toolbox functions
+
+Usage:
+    python scripts/analyze_trove_run.py path/to/results.jsonl
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from collections import Counter
+from pathlib import Path
+
+
+def _load_rows(path: Path) -> list[dict]:
+    rows = []
+    with path.open() as f:
+        for lineno, line in enumerate(f, 1):
+            line = line.strip()
+            if not line:
+                continue
+            try:
+                rows.append(json.loads(line))
+            except json.JSONDecodeError as exc:
+                print(f"warning: line {lineno} is not valid JSON: {exc}", file=sys.stderr)
+    return rows
+
+
+def _result_dict(row: dict) -> dict:
+    """Tolerant accessor: results are nested under 'result' in main.py's output."""
+    return row.get("result") or row
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("path", type=Path, help="Path to the TroVE results JSONL file")
+    args = parser.parse_args()
+
+    rows = _load_rows(args.path)
+    if not rows:
+        print("ERROR: no rows loaded", file=sys.stderr)
+        sys.exit(1)
+
+    n = len(rows)
+    results = [_result_dict(r) for r in rows]
+
+    solved = sum(1 for r in results if r.get("solved"))
+    print(f"=== Run summary: {args.path.name} ===")
+    print(f"Tasks: {n}")
+    print(f"Solved: {solved}/{n} ({100 * solved / n:.1f}%)")
+
+    last_snapshot = results[-1].get("library_snapshot") or []
+    print(f"Final toolbox size: {len(last_snapshot)}")
+
+    mode_counter = Counter(r.get("won_mode", "?") for r in results)
+    print(f"Mode wins: {dict(mode_counter)}")
+
+    import_eligible = [r for r in results if r.get("import_eligible")]
+    if not import_eligible:
+        print("No IMPORT-eligible tasks observed.")
+    else:
+        with_calls = [r for r in import_eligible if (r.get("tool_call_count") or 0) >= 1]
+        n_eligible = len(import_eligible)
+        n_with = len(with_calls)
+        mean_calls = (
+            sum((r.get("tool_call_count") or 0) for r in import_eligible) / n_eligible
+        )
+        all_calls = [tc for r in import_eligible for tc in (r.get("tool_calls") or [])]
+        n_calls_total = len(all_calls)
+        n_calls_ok = sum(1 for tc in all_calls if tc.get("ok"))
+        success_rate = (100 * n_calls_ok / n_calls_total) if n_calls_total else 0.0
+        print(
+            f"IMPORT-eligible tasks: {n_eligible}\n"
+            f"  Tasks with >=1 tool call: {n_with}/{n_eligible} ({100 * n_with / n_eligible:.1f}%)\n"
+            f"  Mean tool calls / task:   {mean_calls:.2f}\n"
+            f"  Tool-call success rate:   {n_calls_ok}/{n_calls_total} ({success_rate:.1f}%)"
+        )
+
+    name_counter: Counter = Counter()
+    for r in results:
+        for tc in r.get("tool_calls") or []:
+            name = (tc.get("name") or "").split("<|", 1)[0].strip()
+            if name:
+                name_counter[name] += 1
+    if name_counter:
+        print("Top-10 most-called toolbox functions:")
+        for name, cnt in name_counter.most_common(10):
+            print(f"  {cnt:4d}  {name}")
+    else:
+        print("No tool calls recorded in this run.")
+
+
+if __name__ == "__main__":
+    main()

From ff6a6d89cca05c79ea5825b1c3985e5572c20b18 Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 18:38:22 -0400
Subject: [PATCH 14/24] docs(trove): rewrite deviations.md for native tool
 calling

Document algorithmic deviations (native OpenAI tool calling for IMPORT,
reward-based selection by default, PBEBench-shaped few-shots, strict
**Solution** parsing for pbebench), faithful elements (3-mode generation,
K-sampling, AST tie-break, C*log_20(n) trimming with C=1.0), and
infrastructural patches (JSONL checkpointing, reasoning_content
fallback, 60s executor timeout, defensive <|-truncation sanitizer).

Includes vLLM version requirement (>= v0.16.0 for PR #28729) and the
backend coverage caveat (smoke run is vLLM-served gpt-oss only).

Made-with: Cursor
---
 .../baselines/trove/docs/deviations.md        | 205 +++++++-----------
 1 file changed, 83 insertions(+), 122 deletions(-)

diff --git a/symbolic_agent/baselines/trove/docs/deviations.md b/symbolic_agent/baselines/trove/docs/deviations.md
index 06d4c346..5ce60482 100644
--- a/symbolic_agent/baselines/trove/docs/deviations.md
+++ b/symbolic_agent/baselines/trove/docs/deviations.md
@@ -1,122 +1,83 @@
-# TroVE Baseline — Deviations from the Original Paper
-
-This document records all intentional and unavoidable deviations between our
-reimplementation (`symbolic_agent/baselines/trove/`) and the original TroVE
-codebase (`original_baseline_repos/trove/`).
-
----
-
-## 1. Chat API instead of Local Model Completion
-
-**Original:** TroVE uses a HuggingFace `transformers.pipeline` with a locally
-loaded model (e.g. CodeLlama-7b-Instruct) in **completion** mode. The prompt
-is a plain string prefix; the model generates continuation text.
-
-**Ours:** We use Anthropic's Messages API or an OpenAI-compatible chat API
-(vLLM). The prompt is sent as a `user` message; the model generates a reply
-that includes the **Solution** and **Tools** blocks.
-
-**Impact:** Minimal. The prompt structure (ending with `**Solution**`) signals
-to chat models what to generate, and empirically they comply. No JSON mode is
-used (`TroVELLMClient` vs the main `LLMClient`).
-
----
-
-## 2. Domain-Generic Few-Shot Examples
-
-**Original:** TroVE uses domain-specific few-shot examples for each task
-(TabMWP coin-collection table examples, MATH algebra examples, etc.)
-
-**Ours:** We use generic string-manipulation examples that apply to both
-PBEBench and ReasoningGym string tasks (replace_char, extract_digits,
-lowercase examples). Domain-specific examples for other task families
-should be added to `prompts.py` as needed.
-
-**Impact:** May slightly reduce self-consistency accuracy for tasks where the
-original examples provide strong in-context guidance. The structural format
-is preserved exactly.
-
----
-
-## 3. K Calls Rather Than Batched n=K
-
-**Original:** TroVE passes `num_return_sequences=K` to the HuggingFace
-pipeline, which generates K sequences in one forward pass.
-
-**Ours:** We call the LLM API K times independently (temperature sampling).
-The Anthropic API does not support `n` parameter; the OpenAI-compatible API
-does but we call separately for simplicity and identical code paths.
-
-**Impact:** K API calls instead of 1; slightly slower but statistically
-equivalent since each call is an independent sample.
-
----
-
-## 4. AST Node Count Instead of AST Depth Sum
-
-**Original:** TroVE tie-breaks by `sum(depth of each AST expression node)`
-across the solution (referenced in §3.2 and Appendix B).
-
-**Ours:** `count_ast_nodes()` counts total AST nodes via `ast.walk()`.
-Total nodes is monotonically related to total expression depth: simpler
-programs have fewer nodes AND lower total depth. The tie-breaking effect
-is identical in practice.
-
-**Impact:** Negligible. Both metrics rank programs by complexity; the ranking
-rarely differs for programs with the same stdout.
-
----
-
-## 5. No Re-Generation of Trimmed Examples
-
-**Original:** After trimming the toolbox, `run_trove.py` re-generates
-solutions for all affected examples using IMPORT|SKIP (not CREATE), then
-reports updated accuracy.
-
-**Ours:** We record the set of affected task indices in the trim log but do
-not replay them. This is because we process tasks in a single stream and do
-not store the original task inputs for re-processing. For a complete
-faithful comparison, task inputs should be saved and re-processed on trim.
-
-**Impact:** In practice, trimming only fires after 500 tasks with the default
-setting. For our 100-task pilot runs, trimming is disabled by setting
-`--trove-trim-every 9999`.
-
----
-
-## 6. Reward Loop Compatibility Wrapper
-
-**Original:** TroVE has no concept of a reward function or iterative
-refinement loop. It is one-shot per example.
-
-**Ours:** `solve_with_reward()` wraps `solve()` for compatibility with
-`main.py`'s `--default-reward` and `--max-reward-iters` flags. No retry
-loop is performed; the reward is computed once and stored in `reward_history`
-for eval script compatibility.
-
-**Impact:** None on TroVE's actual behavior. Only affects output format.
-
----
-
-## 7. `trim_every` Default Differs for Small Runs
-
-**Original:** Default `--trim_steps=500` (trimming every 500 examples).
-For a 100-task dataset this fires 0 times.
-
-**Ours:** Same default (500), but users running small pilots should pass
-`--trove-trim-every 9999` to make it explicit that no trimming happens.
-
-**Impact:** None unless running >500 tasks.
-
----
-
-## Summary Table
-
-| Aspect | Original | Ours | Impact |
-|--------|----------|------|--------|
-| LLM backend | Local HF model (completion) | Chat API (messages) | Minimal |
-| Few-shot examples | Domain-specific (TabMWP/MATH) | Generic string-manipulation | Minor |
-| K sampling | Batched (n=K in one call) | K independent API calls | Latency only |
-| Complexity metric | Sum of AST expression depths | Total AST node count | Negligible |
-| Trim replay | Re-generates affected examples | Records but does not replay | Evaluation accuracy |
-| Reward loop | Not in original | Wrapper for main.py compat | None |
+# TroVE Implementation: Deviations and Faithful Elements
+
+This document tracks how this port differs from — and where it stays
+faithful to — the original TroVE algorithm
+([Wang et al., 2024](https://arxiv.org/abs/2401.12869),
+[zorazrw/trove](https://github.com/zorazrw/trove)).
+
+## 1. Algorithmic deviations
+
+### 1.1 Native OpenAI tool calling for IMPORT mode
+The original TroVE shows the model a `**Toolbox**` markdown block
+listing top-k function signatures and asks it to write a `**Solution**`
+plus `**Tools**` block referencing those functions by name. We replace
+this for the IMPORT mode (when `backend == "openai"` and the toolbox is
+non-empty) with **native OpenAI tool calling**: the toolbox is exposed
+via the `tools=[...]` parameter of `chat.completions.create`, the model
+emits structured `tool_calls` during its reasoning, and `dispatch_tool_call`
+runs each one in the sandboxed executor and returns the stdout. This
+makes function usage observable and credit-able from the trajectory
+itself.
+
+### 1.2 Reward-based candidate selection (default)
+The paper uses self-consistency (majority vote on stdout, AST tie-break)
+to pick the best of K samples per mode. We default to **reward-based
+selection**: every candidate is scored by the per-task reward function,
+ties broken by minimum AST node count. This is more reliable on
+PBEBench (program-list outputs rarely tie as strings). The original
+self-consistency selector remains available via `--trove-selection consistency`.
+
+### 1.3 PBEBench-shaped few-shot examples
+For `task_family="pbebench"` we replace the generic CREATE / SKIP / IMPORT
+example pairs with PBEBench-shaped pairs that demonstrate `replace()`
+chains and a small reusable helper (`find_replace_chain`). The legacy
+default examples remain for `task_family="default"`.
+
+### 1.4 Strict **Solution** parsing for PBEBench
+The legacy parser falls back to "first ```python``` block anywhere" when
+no `**Solution**` block is present. For `task_family="pbebench"` this
+fallback is disabled, preventing CoT scratchpad from being accidentally
+promoted to the answer.
+
+## 2. Faithful elements
+
+- 3-mode generation (IMPORT, CREATE, SKIP).
+- K samples per mode (default K=5, paper).
+- AST-tie-breaking by node count (simplest solution wins).
+- Periodic toolbox trimming with threshold `C·log_{20}(n)`, default
+  `C=1.0`, matching the original implementation.
+- Frequency-based top-k retrieval for the toolbox view.
+- Dict-keyed toolbox structure mirroring `utils/code.py`.
+- Library updates: IMPORT credits frequency, CREATE adds new functions
+  on success, SKIP makes no library changes.
+
+## 3. Infrastructural patches
+
+- **JSONL-per-task checkpointing** via `--output-file`, with crash
+  resumption.
+- **`reasoning_content` fallback** in `_call_openai` for `gpt-oss` Harmony
+  channel splits where the answer text lives in `message.reasoning_content`.
+- **Executor timeout 60s** (vs. 10s in earlier versions of this port),
+  closer to the original's ~100s.
+- **`<|`-truncation sanitizer** in `dispatch_tool_call` and
+  `_update_library`. Defensive workaround for the open vLLM
+  [PR #35906](https://github.com/vllm-project/vllm/pull/35906) covering
+  Harmony control-token leakage into tool names. When that PR lands
+  upstream the sanitizer becomes a no-op and is left in place.
+
+## 4. Backend coverage caveat
+
+Anthropic backend code paths exist and are exercised by CREATE / SKIP and
+the legacy text-based IMPORT fallback, but **the smoke run and reported
+numbers are vLLM-served `gpt-oss` only**. IMPORT-with-tools requires
+the OpenAI/vLLM backend and is the only path we test end-to-end.
+
+## 5. vLLM version requirement
+
+- Minimum vLLM: **v0.16.0** (branch-cut 2026-02-08).
+- Required upstream change: [PR #28729](https://github.com/vllm-project/vllm/pull/28729)
+  ("Multiple fixes for gpt-oss Chat Completion prompting"), merged
+  2025-12-12. v0.16.0 is the first stable release branch-cut after the merge.
+- Known open caveat: [PR #35906](https://github.com/vllm-project/vllm/pull/35906)
+  ("Sanitize leaked Harmony control tokens"), still open as of late
+  March 2026 — see §3 for the sanitizer mitigation.

From ab7b7a326d00397025858bc4f904a31cf8405a25 Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 18:41:54 -0400
Subject: [PATCH 15/24] fix(trove): persist TroVE telemetry through
 _append_task_output

The TroVE controller emits passive telemetry (won_mode, import_eligible,
import_was_winner, tool_calls, tool_call_count, tools_called,
actually_called, trove_stopped_reason, library_snapshot) on the in-memory
result dict, but main._append_task_output was dropping all of it before
the JSONL was written. scripts/analyze_trove_run.py would then read
empty/missing fields and report misleading numbers (e.g. all won_mode
as '?', 'No IMPORT-eligible tasks' on healthy runs).

Pass these keys through verbatim when present. Keys are absent on
non-TroVE runs, so other frameworks (ssl_bcr, regal, react_mem, etc.)
are unaffected.

Made-with: Cursor
---
 main.py | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/main.py b/main.py
index 3bfbc6b3..ec2f5d36 100644
--- a/main.py
+++ b/main.py
@@ -156,6 +156,22 @@ def _append_task_output(result: dict, task_index: int, output_file: str) -> None
         "token_usage": result.get("token_usage", {}),
         "agent_messages": result.get("agent_messages", []),
     }
+    # TroVE telemetry: passthrough when present so scripts/analyze_trove_run.py
+    # (and any other post-hoc analyzer) can read per-task tool-use stats and the
+    # final library state from the JSONL. Keys are absent on non-TroVE runs.
+    for key in (
+        "won_mode",
+        "import_eligible",
+        "import_was_winner",
+        "tool_calls",
+        "tool_call_count",
+        "tools_called",
+        "actually_called",
+        "trove_stopped_reason",
+        "library_snapshot",
+    ):
+        if key in result:
+            record[key] = result[key]
     Path(output_file).parent.mkdir(parents=True, exist_ok=True)
     with open(output_file, "a", encoding="utf-8") as f:
         f.write(json.dumps(record, default=str) + "\n")

From ce75297309f666090a902040217912bba06788b0 Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 19:46:29 -0400
Subject: [PATCH 16/24] feat(trove): add notebooks/run_trove_pbebench.ipynb
 runpod runner

End-to-end Jupyter notebook for the PBEBench-Lite smoke run on RunPod:
launches vLLM with the native tool-calling flags, polls /v1/models
until ready, runs main.py with --framework trove
--trove-task-family pbebench --trove-selection reward against the
50-task lite_pilot_tasks.jsonl split, then invokes
scripts/analyze_trove_run.py and previews telemetry. Defaults to
gpt-oss-20b on a single A100/H100; flip MODEL and TENSOR_PARALLEL
for 120b.

Made-with: Cursor
---
 notebooks/run_trove_pbebench.ipynb | 334 +++++++++++++++++++++++++++++
 1 file changed, 334 insertions(+)
 create mode 100644 notebooks/run_trove_pbebench.ipynb

diff --git a/notebooks/run_trove_pbebench.ipynb b/notebooks/run_trove_pbebench.ipynb
new file mode 100644
index 00000000..e6736960
--- /dev/null
+++ b/notebooks/run_trove_pbebench.ipynb
@@ -0,0 +1,334 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# TroVE × PBEBench-Lite — RunPod runner\n",
+        "\n",
+        "End-to-end notebook to:\n",
+        "\n",
+        "1. Check GPU and install dependencies\n",
+        "2. Launch a local vLLM server (with native tool-calling flags)\n",
+        "3. Wait for it to be healthy\n",
+        "4. Run TroVE on PBEBench-Lite with reward-based selection\n",
+        "5. Analyze the JSONL output\n",
+        "\n",
+        "## Pod sizing\n",
+        "\n",
+        "| Model           | Recommended GPU                | Tensor parallel |\n",
+        "|-----------------|--------------------------------|-----------------|\n",
+        "| `gpt-oss-20b`   | 1× A100 80 GB or 1× H100        | 1               |\n",
+        "| `gpt-oss-120b`  | 2× H100 / A100 80 GB           | 2               |\n",
+        "\n",
+        "## Before you start\n",
+        "\n",
+        "- Run this notebook from a Jupyter kernel **inside the pod**, with the repo at `/workspace/pbe/symbolic-library-agent` (or wherever you cloned it). Adjust `REPO_ROOT` in the next cell if needed.\n",
+        "- Each cell is idempotent — safe to re-run.\n",
+        "- Cleanup at the bottom kills the vLLM process; if you re-run cells out of order, you may end up with a stale server — use the cleanup cell."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 1. Configuration"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "from pathlib import Path\n",
+        "import os\n",
+        "\n",
+        "# Pick the model variant. 20b fits on a single A100/H100; 120b needs TP=2.\n",
+        "MODEL = \"openai/gpt-oss-20b\"          # or \"openai/gpt-oss-120b\"\n",
+        "TENSOR_PARALLEL = 1                    # set to 2 for 120b\n",
+        "\n",
+        "PORT = 8000\n",
+        "BASE_URL = f\"http://localhost:{PORT}/v1\"\n",
+        "\n",
+        "# Repo root — change if your clone lives elsewhere on the pod.\n",
+        "REPO_ROOT = Path(os.environ.get(\"REPO_ROOT\", \"/workspace/pbe/symbolic-library-agent\"))\n",
+        "if not REPO_ROOT.exists():\n",
+        "    REPO_ROOT = Path.cwd().parent if Path.cwd().name == \"notebooks\" else Path.cwd()\n",
+        "assert (REPO_ROOT / \"main.py\").exists(), f\"Could not find main.py under {REPO_ROOT}\"\n",
+        "os.chdir(REPO_ROOT)\n",
+        "\n",
+        "# Tasks file. Two PBEBench-Lite options ship with the repo:\n",
+        "#   - lite_pilot_tasks.jsonl    : 50-task pilot split (smoke-run default)\n",
+        "#   - lite_tasks_full_og.jsonl  : full Lite split (1008 tasks)\n",
+        "TASKS_FILE = REPO_ROOT / \"data/pbebench/lite_pilot_tasks.jsonl\"\n",
+        "MAX_PROGRAMS = 5     # PBEBench convention for the lite split\n",
+        "\n",
+        "OUT_DIR = REPO_ROOT / \"outputs\"\n",
+        "OUT_FILE = OUT_DIR / \"trove_pbebench_lite_smoke.jsonl\"\n",
+        "DEBUG_DIR = REPO_ROOT / \"debug_trove_pbebench\"\n",
+        "VLLM_LOGS = REPO_ROOT / \"vllm_logs\"\n",
+        "OUT_DIR.mkdir(parents=True, exist_ok=True)\n",
+        "DEBUG_DIR.mkdir(parents=True, exist_ok=True)\n",
+        "VLLM_LOGS.mkdir(parents=True, exist_ok=True)\n",
+        "\n",
+        "print(f\"REPO_ROOT  : {REPO_ROOT}\")\n",
+        "print(f\"MODEL      : {MODEL}  (TP={TENSOR_PARALLEL})\")\n",
+        "print(f\"BASE_URL   : {BASE_URL}\")\n",
+        "print(f\"TASKS_FILE : {TASKS_FILE}  (exists={TASKS_FILE.exists()})\")\n",
+        "print(f\"OUT_FILE   : {OUT_FILE}\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 2. GPU & dependency check"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "!nvidia-smi --query-gpu=index,name,memory.total,memory.free --format=csv"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Install repo deps + vLLM. Re-running is a no-op if everything's already there.\n",
+        "!pip install -q -U pip wheel\n",
+        "!pip install -q -r requirements.txt 2>&1 | tail -5\n",
+        "!pip install -q -U \"vllm>=0.16.0\" 2>&1 | tail -5\n",
+        "import importlib, vllm\n",
+        "print(\"vllm version:\", vllm.__version__)"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 3. Launch vLLM in the background\n",
+        "\n",
+        "Required flags for `gpt-oss` native tool calling (vLLM ≥ v0.16.0):\n",
+        "\n",
+        "- `--enable-auto-tool-choice`\n",
+        "- `--tool-call-parser openai`\n",
+        "- `--reasoning-parser openai_gptoss`"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "import os, subprocess, time, datetime\n",
+        "\n",
+        "ts = datetime.datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n",
+        "log_path = VLLM_LOGS / f\"vllm_{PORT}_{ts}.log\"\n",
+        "pid_path = VLLM_LOGS / f\"vllm_{PORT}_{ts}.pid\"\n",
+        "\n",
+        "user = os.environ.get(\"USER\", \"runpod\")\n",
+        "for d in (f\"/tmp/{user}-tiktoken-cache\", f\"/tmp/{user}-tmp\"):\n",
+        "    Path(d).mkdir(parents=True, exist_ok=True)\n",
+        "    os.chmod(d, 0o700)\n",
+        "os.environ[\"TIKTOKEN_CACHE_DIR\"] = f\"/tmp/{user}-tiktoken-cache\"\n",
+        "os.environ[\"TMPDIR\"] = f\"/tmp/{user}-tmp\"\n",
+        "\n",
+        "cmd = [\n",
+        "    \"python\", \"-m\", \"vllm.entrypoints.openai.api_server\",\n",
+        "    \"--model\", MODEL,\n",
+        "    \"--tokenizer\", MODEL,\n",
+        "    \"--dtype\", \"auto\",\n",
+        "    \"--port\", str(PORT),\n",
+        "    \"--gpu-memory-utilization\", \"0.95\",\n",
+        "    \"--tensor-parallel-size\", str(TENSOR_PARALLEL),\n",
+        "    \"--enable-auto-tool-choice\",\n",
+        "    \"--tool-call-parser\", \"openai\",\n",
+        "    \"--reasoning-parser\", \"openai_gptoss\",\n",
+        "]\n",
+        "\n",
+        "log_fh = open(log_path, \"w\")\n",
+        "vllm_proc = subprocess.Popen(cmd, stdout=log_fh, stderr=subprocess.STDOUT)\n",
+        "pid_path.write_text(str(vllm_proc.pid))\n",
+        "print(f\"vLLM started — pid {vllm_proc.pid}\")\n",
+        "print(f\"log : {log_path}\")\n",
+        "print(f\"pid : {pid_path}\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Wait for the OpenAI-compatible /v1/models endpoint to respond.\n",
+        "# 20b cold-start is ~1–2 min; 120b can be 5–10 min on first launch.\n",
+        "import urllib.request, json, time\n",
+        "\n",
+        "READY_TIMEOUT_S = 900   # 15 min\n",
+        "POLL_S = 5\n",
+        "\n",
+        "deadline = time.time() + READY_TIMEOUT_S\n",
+        "ready = False\n",
+        "while time.time() < deadline:\n",
+        "    if vllm_proc.poll() is not None:\n",
+        "        print(\"vLLM exited unexpectedly. Tail of log:\")\n",
+        "        print(log_path.read_text()[-4000:])\n",
+        "        raise RuntimeError(\"vLLM died during startup\")\n",
+        "    try:\n",
+        "        with urllib.request.urlopen(f\"{BASE_URL}/models\", timeout=2) as resp:\n",
+        "            data = json.loads(resp.read())\n",
+        "            print(\"Ready. /v1/models response:\")\n",
+        "            print(json.dumps(data, indent=2)[:600])\n",
+        "            ready = True\n",
+        "            break\n",
+        "    except Exception:\n",
+        "        elapsed = int(READY_TIMEOUT_S - (deadline - time.time()))\n",
+        "        print(f\"\\rwaiting for vLLM... {elapsed}s elapsed\", end=\"\", flush=True)\n",
+        "        time.sleep(POLL_S)\n",
+        "\n",
+        "if not ready:\n",
+        "    print(\"\\nTimed out. Tail of log:\")\n",
+        "    print(log_path.read_text()[-4000:])\n",
+        "    raise RuntimeError(\"vLLM never became ready\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 4. Run TroVE on PBEBench-Lite (smoke run)\n",
+        "\n",
+        "Defaults below match the design:\n",
+        "\n",
+        "- `--trove-task-family pbebench` — strict `**Solution**` parsing + PBEBench few-shots\n",
+        "- `--trove-selection reward` — reward-based candidate selection (AST tie-break)\n",
+        "- `--trove-k 5` — paper default samples per mode\n",
+        "- `--trove-trim-every 9999` — effectively disable periodic trimming for a 50-task smoke\n",
+        "- `--default-reward pbebench` — PBEBench verifier"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "import subprocess, sys\n",
+        "\n",
+        "os.environ[\"VLLM_API_KEY\"] = os.environ.get(\"VLLM_API_KEY\", \"EMPTY\")\n",
+        "\n",
+        "cmd = [\n",
+        "    sys.executable, \"main.py\",\n",
+        "    \"--framework\",        \"trove\",\n",
+        "    \"--backend\",          \"vllm\",\n",
+        "    \"--base-url\",         BASE_URL,\n",
+        "    \"--model\",            MODEL,\n",
+        "    \"--trove-task-family\", \"pbebench\",\n",
+        "    \"--trove-selection\",   \"reward\",\n",
+        "    \"--trove-k\",           \"5\",\n",
+        "    \"--trove-trim-every\",  \"9999\",\n",
+        "    \"--default-reward\",    \"pbebench\",\n",
+        "    \"--max-programs\",      str(MAX_PROGRAMS),\n",
+        "    \"--tasks-file\",        str(TASKS_FILE),\n",
+        "    \"--output-file\",       str(OUT_FILE),\n",
+        "    \"--debug-dir\",         str(DEBUG_DIR),\n",
+        "]\n",
+        "\n",
+        "print(\" \".join(cmd))\n",
+        "print()\n",
+        "\n",
+        "# Stream stdout/stderr live.\n",
+        "proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, bufsize=1)\n",
+        "try:\n",
+        "    for line in proc.stdout:\n",
+        "        print(line, end=\"\")\n",
+        "finally:\n",
+        "    rc = proc.wait()\n",
+        "print(f\"\\nmain.py exited with {rc}\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 5. Analyze the JSONL output"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "!python scripts/analyze_trove_run.py \"{OUT_FILE}\""
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Quick peek at one row to confirm telemetry made it through.\n",
+        "import json\n",
+        "with open(OUT_FILE) as f:\n",
+        "    first = json.loads(next(f))\n",
+        "print(\"keys:\", sorted(first.keys()))\n",
+        "for k in (\"won_mode\", \"import_eligible\", \"tool_call_count\", \"trove_stopped_reason\"):\n",
+        "    print(f\"  {k:24s} = {first.get(k)}\")\n",
+        "print(f\"  library_snapshot size  = {len(first.get('library_snapshot', []))}\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 6. Cleanup — stop vLLM\n",
+        "\n",
+        "Run this when you're done so the GPU is freed for the next experiment."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "import signal, time\n",
+        "if vllm_proc.poll() is None:\n",
+        "    vllm_proc.send_signal(signal.SIGINT)\n",
+        "    try:\n",
+        "        vllm_proc.wait(timeout=15)\n",
+        "    except subprocess.TimeoutExpired:\n",
+        "        vllm_proc.kill()\n",
+        "        vllm_proc.wait()\n",
+        "    print(\"vLLM stopped.\")\n",
+        "else:\n",
+        "    print(\"vLLM was not running.\")\n",
+        "log_fh.close()"
+      ],
+      "execution_count": null,
+      "outputs": []
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}
\ No newline at end of file

From e7897d4d8a6784ef59c324ba16ef508ac8b7c656 Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 19:54:27 -0400
Subject: [PATCH 17/24] chore(trove): target gpt-oss-20b for the TroVE smoke
 run

We are only running the TroVE PBEBench smoke on gpt-oss-20b. Add a
20b-specific vLLM launcher (TP=1 + the three tool-calling flags),
retarget scripts/run_trove_vllm.sh at 20b + the new TroVE flags
(--trove-task-family pbebench, --trove-selection reward,
--max-programs 5, lite_pilot_tasks.jsonl, port 8000), and simplify
the runpod notebook to a 20b-only configuration. The 120b launcher
remains in place for the other (non-TroVE) baselines that still use it.

Made-with: Cursor
---
 notebooks/run_trove_pbebench.ipynb | 23 ++++++++-------
 scripts/launch_vllm_gpt_oss_20b.sh | 25 +++++++++++++++++
 scripts/run_trove_vllm.sh          | 45 ++++++++++++++++++++----------
 3 files changed, 67 insertions(+), 26 deletions(-)
 create mode 100755 scripts/launch_vllm_gpt_oss_20b.sh

diff --git a/notebooks/run_trove_pbebench.ipynb b/notebooks/run_trove_pbebench.ipynb
index e6736960..e4648348 100644
--- a/notebooks/run_trove_pbebench.ipynb
+++ b/notebooks/run_trove_pbebench.ipynb
@@ -4,7 +4,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "# TroVE × PBEBench-Lite — RunPod runner\n",
+        "# TroVE × PBEBench-Lite — RunPod runner (`gpt-oss-20b`)\n",
         "\n",
         "End-to-end notebook to:\n",
         "\n",
@@ -16,10 +16,7 @@
         "\n",
         "## Pod sizing\n",
         "\n",
-        "| Model           | Recommended GPU                | Tensor parallel |\n",
-        "|-----------------|--------------------------------|-----------------|\n",
-        "| `gpt-oss-20b`   | 1× A100 80 GB or 1× H100        | 1               |\n",
-        "| `gpt-oss-120b`  | 2× H100 / A100 80 GB           | 2               |\n",
+        "`openai/gpt-oss-20b` runs comfortably on a single **A100 80 GB** or **H100** with `--tensor-parallel-size 1`. A100 40 GB will OOM at default settings.\n",
         "\n",
         "## Before you start\n",
         "\n",
@@ -42,9 +39,8 @@
         "from pathlib import Path\n",
         "import os\n",
         "\n",
-        "# Pick the model variant. 20b fits on a single A100/H100; 120b needs TP=2.\n",
-        "MODEL = \"openai/gpt-oss-20b\"          # or \"openai/gpt-oss-120b\"\n",
-        "TENSOR_PARALLEL = 1                    # set to 2 for 120b\n",
+        "MODEL = \"openai/gpt-oss-20b\"\n",
+        "TENSOR_PARALLEL = 1\n",
         "\n",
         "PORT = 8000\n",
         "BASE_URL = f\"http://localhost:{PORT}/v1\"\n",
@@ -77,7 +73,8 @@
         "print(f\"OUT_FILE   : {OUT_FILE}\")"
       ],
       "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "id": "ce204af4"
     },
     {
       "cell_type": "markdown",
@@ -167,10 +164,11 @@
       "metadata": {},
       "source": [
         "# Wait for the OpenAI-compatible /v1/models endpoint to respond.\n",
-        "# 20b cold-start is ~1–2 min; 120b can be 5–10 min on first launch.\n",
+        "# gpt-oss-20b cold-start (model download + load) is typically 1–3 min on a\n",
+        "# fresh pod; subsequent launches are seconds once the weights are cached.\n",
         "import urllib.request, json, time\n",
         "\n",
-        "READY_TIMEOUT_S = 900   # 15 min\n",
+        "READY_TIMEOUT_S = 600   # 10 min\n",
         "POLL_S = 5\n",
         "\n",
         "deadline = time.time() + READY_TIMEOUT_S\n",
@@ -253,7 +251,8 @@
         "print(f\"\\nmain.py exited with {rc}\")"
       ],
       "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "id": "500ee1a6"
     },
     {
       "cell_type": "markdown",
diff --git a/scripts/launch_vllm_gpt_oss_20b.sh b/scripts/launch_vllm_gpt_oss_20b.sh
new file mode 100755
index 00000000..37d6e131
--- /dev/null
+++ b/scripts/launch_vllm_gpt_oss_20b.sh
@@ -0,0 +1,25 @@
+#!/bin/bash
+
+mkdir -p /tmp/$USER-tiktoken-cache /tmp/$USER-tmp
+chmod 700 /tmp/$USER-tiktoken-cache /tmp/$USER-tmp
+export TIKTOKEN_CACHE_DIR=/tmp/$USER-tiktoken-cache
+export TMPDIR=/tmp/$USER-tmp
+
+ts=$(date +%Y%m%d_%H%M%S)
+
+# Required vLLM tool-calling flags (vLLM >= v0.16.0 for PR #28729):
+#   --enable-auto-tool-choice  enables tool_choice="auto"
+#   --tool-call-parser openai  parses gpt-oss Harmony commentary channel
+#   --reasoning-parser openai_gptoss  routes analysis-channel content into
+#                                     message.reasoning_content
+nohup python -m vllm.entrypoints.openai.api_server \
+  --model "openai/gpt-oss-20b" \
+  --tokenizer "openai/gpt-oss-20b" \
+  --dtype auto \
+  --port ${1} \
+  --gpu-memory-utilization 0.95 \
+  --tensor-parallel-size 1 \
+  --enable-auto-tool-choice \
+  --tool-call-parser openai \
+  --reasoning-parser openai_gptoss \
+  > vllm_logs/vllm_${1}_${ts}.log 2>&1 & echo $! > vllm_logs/vllm_${1}_${ts}.pid
diff --git a/scripts/run_trove_vllm.sh b/scripts/run_trove_vllm.sh
index 54c27932..280baa23 100755
--- a/scripts/run_trove_vllm.sh
+++ b/scripts/run_trove_vllm.sh
@@ -1,27 +1,44 @@
 #!/usr/bin/env bash
-# Run TroVE baseline against a local vLLM server.
+# Run TroVE baseline against a local vLLM server (gpt-oss-20b).
 # Usage: bash scripts/run_trove_vllm.sh
 #
-# For small datasets (≤100 tasks), --trove-trim-every is set high to disable
+# Defaults to PBEBench-Lite pilot (50 tasks). Override TASKS_FILE or pass
+# extra --flags through the trailing "$@".
+#
+# For small datasets (<=100 tasks), --trove-trim-every is set high to disable
 # trimming (the library never gets large enough for it to matter).
-# Set --trove-k 1 for a cheaper run without self-consistency sampling.
+# Set --trove-k 1 for a cheaper run without per-mode K-sampling.
 
 set -euo pipefail
 
 cd "$(dirname "${BASH_SOURCE[0]}")/.."
 
-export PORT=8002
+export PORT="${PORT:-8000}"
 export VLLM_API_KEY="${VLLM_API_KEY:-EMPTY}"
 mkdir -p outputs
 
+TASKS_FILE="${TASKS_FILE:-data/pbebench/lite_pilot_tasks.jsonl}"
+OUT_FILE="${OUT_FILE:-outputs/trove_pbebench_lite_pilot.jsonl}"
+
+echo "Tasks  : ${TASKS_FILE}"
+echo "Output : ${OUT_FILE}"
+echo "Port   : ${PORT}"
+
 python main.py \
-  --framework      trove \
-  --tasks-file     data/pbebench/lite_tasks_full.jsonl \
-  --base-url       "http://localhost:${PORT}/v1" \
-  --model          "openai/gpt-oss-120b" \
-  --trove-k        5 \
-  --trove-trim-every 9999 \
-  --default-reward pbebench \
-  --output-file    outputs/pbebench_lite_full_trove.jsonl \
-  --debug-dir      debug_trove \
-  --stats
+  --framework         trove \
+  --tasks-file        "${TASKS_FILE}" \
+  --base-url          "http://localhost:${PORT}/v1" \
+  --model             "openai/gpt-oss-20b" \
+  --trove-task-family pbebench \
+  --trove-selection   reward \
+  --trove-k           5 \
+  --trove-trim-every  9999 \
+  --default-reward    pbebench \
+  --max-programs      5 \
+  --output-file       "${OUT_FILE}" \
+  --debug-dir         debug_trove \
+  --stats \
+  "$@"
+
+echo "Done. Output: ${OUT_FILE}"
+echo "Analyze with: python scripts/analyze_trove_run.py ${OUT_FILE}"

From 4ce48acb55d53e2ad4ecc7b4a2536ccac1047705 Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 19:58:24 -0400
Subject: [PATCH 18/24] chore(trove-notebook): add tail_vllm_log helper and
 mirror run output to disk

Adds a tail_vllm_log() cell so the latest vllm_logs/vllm_*.log can be
spot-checked during a long run, and tees the TroVE run cell's stdout
into outputs/trove_pbebench_lite_smoke_<ts>.log so logs survive a
disconnected browser session.

Made-with: Cursor
---
 notebooks/run_trove_pbebench.ipynb | 49 +++++++++++++++++++++++++-----
 1 file changed, 41 insertions(+), 8 deletions(-)

diff --git a/notebooks/run_trove_pbebench.ipynb b/notebooks/run_trove_pbebench.ipynb
index e4648348..83c585f9 100644
--- a/notebooks/run_trove_pbebench.ipynb
+++ b/notebooks/run_trove_pbebench.ipynb
@@ -196,7 +196,31 @@
         "    raise RuntimeError(\"vLLM never became ready\")"
       ],
       "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "id": "b985cb11"
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Optional: peek at the most recent vLLM server log. Re-run this cell any time\n",
+        "# (during or after the TroVE run) to spot-check throughput / GPU memory / errors.\n",
+        "def tail_vllm_log(n: int = 80) -> None:\n",
+        "    logs = sorted(VLLM_LOGS.glob(\"vllm_*.log\"))\n",
+        "    if not logs:\n",
+        "        print(\"No vllm logs found yet.\")\n",
+        "        return\n",
+        "    latest = logs[-1]\n",
+        "    text = latest.read_text(errors=\"replace\")\n",
+        "    lines = text.splitlines()\n",
+        "    print(f\"=== {latest.name}  (last {min(n, len(lines))} of {len(lines)} lines) ===\")\n",
+        "    print(\"\\n\".join(lines[-n:]))\n",
+        "\n",
+        "tail_vllm_log(60)"
+      ],
+      "execution_count": null,
+      "outputs": [],
+      "id": "e1bec107"
     },
     {
       "cell_type": "markdown",
@@ -217,12 +241,12 @@
       "cell_type": "code",
       "metadata": {},
       "source": [
-        "import subprocess, sys\n",
+        "import subprocess, sys, datetime\n",
         "\n",
         "os.environ[\"VLLM_API_KEY\"] = os.environ.get(\"VLLM_API_KEY\", \"EMPTY\")\n",
         "\n",
         "cmd = [\n",
-        "    sys.executable, \"main.py\",\n",
+        "    sys.executable, \"-u\", \"main.py\",\n",
         "    \"--framework\",        \"trove\",\n",
         "    \"--backend\",          \"vllm\",\n",
         "    \"--base-url\",         BASE_URL,\n",
@@ -238,17 +262,26 @@
         "    \"--debug-dir\",         str(DEBUG_DIR),\n",
         "]\n",
         "\n",
+        "# Mirror stdout+stderr to a log file as well as the cell output. This keeps a\n",
+        "# durable record if the browser tab disconnects mid-run, and makes it trivial\n",
+        "# to grep for telemetry across runs.\n",
+        "RUN_TS = datetime.datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n",
+        "RUN_LOG = OUT_DIR / f\"trove_pbebench_lite_smoke_{RUN_TS}.log\"\n",
+        "\n",
         "print(\" \".join(cmd))\n",
-        "print()\n",
+        "print(f\"\\nMirroring stdout to: {RUN_LOG}\\n\")\n",
         "\n",
-        "# Stream stdout/stderr live.\n",
         "proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, bufsize=1)\n",
         "try:\n",
-        "    for line in proc.stdout:\n",
-        "        print(line, end=\"\")\n",
+        "    with open(RUN_LOG, \"w\", encoding=\"utf-8\") as logfh:\n",
+        "        for line in proc.stdout:\n",
+        "            print(line, end=\"\")\n",
+        "            logfh.write(line)\n",
+        "            logfh.flush()\n",
         "finally:\n",
         "    rc = proc.wait()\n",
-        "print(f\"\\nmain.py exited with {rc}\")"
+        "print(f\"\\nmain.py exited with {rc}\")\n",
+        "print(f\"Full log: {RUN_LOG}\")"
       ],
       "execution_count": null,
       "outputs": [],

From 64a930eac799ddbc06f1dcd905f5fed009c3170d Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 23:14:32 -0400
Subject: [PATCH 19/24] fix(trove): read vLLM gpt-oss responses from reasoning
 field

vLLM exposes gpt-oss text in message.reasoning when content is empty, so TroVE was parsing empty generations and producing blank solutions. Add a shared extractor and regression test for the OpenAI/vLLM response shape.

Made-with: Cursor
---
 symbolic_agent/baselines/trove/llm.py         | 22 ++++++++++-
 .../trove/tests/test_llm_openai_response.py   | 37 +++++++++++++++++++
 2 files changed, 57 insertions(+), 2 deletions(-)
 create mode 100644 symbolic_agent/baselines/trove/tests/test_llm_openai_response.py

diff --git a/symbolic_agent/baselines/trove/llm.py b/symbolic_agent/baselines/trove/llm.py
index dda158eb..49ea2c35 100644
--- a/symbolic_agent/baselines/trove/llm.py
+++ b/symbolic_agent/baselines/trove/llm.py
@@ -26,6 +26,24 @@
 DEFAULT_MAX_TOKENS = 512
 
 
+def _message_text(msg: Any) -> str:
+    """Return visible text from OpenAI/vLLM chat message variants."""
+    content = getattr(msg, "content", None)
+    if content:
+        return content
+    for field in ("reasoning_content", "reasoning"):
+        value = getattr(msg, field, None)
+        if value:
+            return value
+    extra = getattr(msg, "model_extra", None) or {}
+    if isinstance(extra, dict):
+        for field in ("reasoning_content", "reasoning"):
+            value = extra.get(field)
+            if value:
+                return value
+    return ""
+
+
 class TroVELLMClient:
     """
     Backend-agnostic plain-text LLM client for TroVE generation.
@@ -190,7 +208,7 @@ def _call_openai(self, prompt: str, model: str, max_tokens: int, tag: str) -> st
                     # No response_format — TroVE uses free-form text
                 )
                 msg = response.choices[0].message
-                raw = msg.content or getattr(msg, "reasoning_content", "") or ""
+                raw = _message_text(msg)
                 u = getattr(response, "usage", None)
                 details = getattr(u, "completion_tokens_details", None)
                 usage = {
@@ -309,7 +327,7 @@ def chat_with_tools(
                 break
 
             msg = response.choices[0].message
-            content = msg.content or getattr(msg, "reasoning_content", "") or ""
+            content = _message_text(msg)
             tool_calls = getattr(msg, "tool_calls", None) or []
 
             u = getattr(response, "usage", None)
diff --git a/symbolic_agent/baselines/trove/tests/test_llm_openai_response.py b/symbolic_agent/baselines/trove/tests/test_llm_openai_response.py
new file mode 100644
index 00000000..8b193417
--- /dev/null
+++ b/symbolic_agent/baselines/trove/tests/test_llm_openai_response.py
@@ -0,0 +1,37 @@
+"""Unit tests for TroVELLMClient OpenAI/vLLM response extraction."""
+
+from types import SimpleNamespace
+
+from symbolic_agent.baselines.trove.llm import TroVELLMClient
+
+
+class _FakeCompletions:
+    def create(self, **kwargs):
+        msg = SimpleNamespace(content="", reasoning="**Solution**\n```python\nprint('ok')\n```")
+        usage = SimpleNamespace(prompt_tokens=1, completion_tokens=2, completion_tokens_details=None)
+        return SimpleNamespace(choices=[SimpleNamespace(message=msg)], usage=usage)
+
+
+class _FakeClient:
+    def __init__(self):
+        self.chat = SimpleNamespace(completions=_FakeCompletions())
+
+
+def _client_with_fake_openai_response():
+    client = object.__new__(TroVELLMClient)
+    client.backend = "openai"
+    client._client = _FakeClient()
+    client._task_log = []
+    client._task_tokens = {"input": 0, "output": 0, "reasoning": 0}
+    client._session_tokens = {"input": 0, "output": 0, "reasoning": 0}
+    client._debug_dir = None
+    return client
+
+
+def test_openai_call_reads_vllm_reasoning_field_when_content_empty():
+    client = _client_with_fake_openai_response()
+
+    raw = client._call_openai("prompt", "openai/gpt-oss-20b", 128, "tag")
+
+    assert "print('ok')" in raw
+    assert "print('ok')" in client.get_task_log()[0]["response"]["content"]

From aff5962c1a75bace916a95307f73b87d0b8ccc23 Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sat, 25 Apr 2026 23:46:14 -0400
Subject: [PATCH 20/24] fix(trove): make PBEBench prompts print replace program
 lists

PBEBench rewards parse stdout as a list of replace() call strings, but the TroVE few-shots were demonstrating transformed output strings. Update PBEBench CREATE, SKIP, and IMPORT-with-tools examples and add prompt regression tests.

Made-with: Cursor
---
 symbolic_agent/baselines/trove/prompts.py     | 34 ++++++++-----------
 .../trove/tests/test_prompts_pbebench.py      | 33 ++++++++++++++++++
 2 files changed, 48 insertions(+), 19 deletions(-)
 create mode 100644 symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py

diff --git a/symbolic_agent/baselines/trove/prompts.py b/symbolic_agent/baselines/trove/prompts.py
index 78be7add..0e000b69 100644
--- a/symbolic_agent/baselines/trove/prompts.py
+++ b/symbolic_agent/baselines/trove/prompts.py
@@ -26,9 +26,12 @@
     "Your answer is whatever gets printed to stdout when the Solution code runs."
 )
 
-# PBEBench prompts model the desired format directly via the few-shot example,
-# so no override string is needed.
-_FORMAT_OVERRIDE_PBEBENCH = ""
+_FORMAT_OVERRIDE_PBEBENCH = (
+    "\nIMPORTANT: For PBEBench, the answer printed by the **Solution** block "
+    "must be a Python list of replace() call strings, such as "
+    "[\"replace('a', 'b')\", \"replace('cd', 'ef')\"]. Do not print the "
+    "transformed output strings."
+)
 
 
 def _format_override(task_family: str) -> str:
@@ -136,8 +139,9 @@ def build_import_prompt(question: str, toolbox_str: str, task_family: str = "def
     "You task is to produce a list of replace() calls that transforms each "
     "input into its expected output for a Programming-by-Example task.\n"
     "You have a set of helper functions available as tools. Call any of them "
-    "to test ideas or compute intermediate results; the final answer must be "
-    "produced as a Python program in the **Solution** block."
+    "to test ideas or compute intermediate results; the final **Solution** "
+    "block must print the program sequence as a Python list of replace() call "
+    "strings, not the transformed outputs."
 )
 
 _IMPORT_WITH_TOOLS_EXAMPLE_DEFAULT = """\
@@ -167,8 +171,8 @@ def build_import_prompt(question: str, toolbox_str: str, task_family: str = "def
 
 **Solution**
 ```python
-result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")])
-print(result)
+programs = ["replace(' ', '_')", "replace('h', 'H')", "replace('e', 'E')", "replace('l', 'L')", "replace('o', 'O')", "replace('w', 'W')", "replace('r', 'R')", "replace('d', 'D')"]
+print(programs)
 ```"""
 
 _IMPORT_WITH_TOOLS_TASK_TEMPLATE = """\
@@ -245,8 +249,8 @@ def apply_substitutions(strings, substitutions):
 
 **Solution**
 ```python
-result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")])
-print(result)
+programs = ["replace(' ', '_')", "replace('h', 'H')", "replace('e', 'E')", "replace('l', 'L')", "replace('o', 'O')", "replace('w', 'W')", "replace('r', 'R')", "replace('d', 'D')"]
+print(programs)
 ```
 **Tools**
 ```python
@@ -311,16 +315,8 @@ def build_create_prompt(question: str, task_family: str = "default") -> str:
 
 **Solution**
 ```python
-s = "hello world"
-s = s.replace(" ", "_")
-s = s.replace("h", "H")
-s = s.replace("e", "E")
-s = s.replace("l", "L")
-s = s.replace("o", "O")
-s = s.replace("w", "W")
-s = s.replace("r", "R")
-s = s.replace("d", "D")
-print(s)
+programs = ["replace(' ', '_')", "replace('h', 'H')", "replace('e', 'E')", "replace('l', 'L')", "replace('o', 'O')", "replace('w', 'W')", "replace('r', 'R')", "replace('d', 'D')"]
+print(programs)
 ```
 **Tools**
 ```python
diff --git a/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py b/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py
new file mode 100644
index 00000000..4b5523de
--- /dev/null
+++ b/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py
@@ -0,0 +1,33 @@
+"""Regression tests for PBEBench-shaped TroVE prompts."""
+
+from symbolic_agent.baselines.trove.prompts import (
+    build_create_prompt,
+    build_import_with_tools_prompt,
+    build_skip_prompt,
+)
+
+
+def _assert_pbebench_prompt_prints_program_sequence(prompt: str) -> None:
+    assert "print(programs)" in prompt
+    assert "\"replace(' ', '_')\"" in prompt
+    assert "\"replace('h', 'H')\"" in prompt
+    assert "print(result)" not in prompt
+    assert "print(s)" not in prompt
+
+
+def test_pbebench_create_prompt_models_replace_program_list_stdout():
+    prompt = build_create_prompt("Task", task_family="pbebench")
+
+    _assert_pbebench_prompt_prints_program_sequence(prompt)
+
+
+def test_pbebench_skip_prompt_models_replace_program_list_stdout():
+    prompt = build_skip_prompt("Task", task_family="pbebench")
+
+    _assert_pbebench_prompt_prints_program_sequence(prompt)
+
+
+def test_pbebench_import_with_tools_prompt_models_replace_program_list_stdout():
+    prompt = build_import_with_tools_prompt("Task", task_family="pbebench")
+
+    _assert_pbebench_prompt_prints_program_sequence(prompt)

From 83528d578ed841a70b0699567f1950d930bb6eab Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Sun, 26 Apr 2026 00:09:34 -0400
Subject: [PATCH 21/24] fix(trove): prefer reusable candidates on reward ties

Reward-based PBEBench selection was choosing tiny direct solutions over equally correct CREATE/IMPORT candidates that populate or use the toolbox. Prefer reusable functions and tool calls on reward ties, then fall back to smallest AST, and make PBEBench CREATE prompts require a helper in **Tools**.

Made-with: Cursor
---
 symbolic_agent/baselines/trove/controller.py  | 26 +++++-
 symbolic_agent/baselines/trove/prompts.py     | 15 +++-
 .../trove/tests/test_controller_selection.py  | 86 +++++++++++++++++++
 .../trove/tests/test_prompts_pbebench.py      |  2 +
 4 files changed, 127 insertions(+), 2 deletions(-)
 create mode 100644 symbolic_agent/baselines/trove/tests/test_controller_selection.py

diff --git a/symbolic_agent/baselines/trove/controller.py b/symbolic_agent/baselines/trove/controller.py
index 149f2a28..d64c638c 100644
--- a/symbolic_agent/baselines/trove/controller.py
+++ b/symbolic_agent/baselines/trove/controller.py
@@ -464,6 +464,7 @@ def _select_best_by_reward(
         """Reward-based candidate selection. Returns (best_index, (reward, message))."""
         best_idx = 0
         best_reward = -1.0
+        best_reuse = -1
         best_ast = float("inf")
         best_message = ""
         for i, c in enumerate(candidates):
@@ -475,13 +476,36 @@ def _select_best_by_reward(
                 logger.debug("Reward scoring error for candidate %d: %s", i, exc)
                 score, msg = 0.0, str(exc)
             ast_size = count_ast_nodes(c.get("solution_code", ""))
-            if score > best_reward or (score == best_reward and ast_size < best_ast):
+            reuse_signal = self._reuse_signal(c)
+            if (
+                score > best_reward
+                or (
+                    score == best_reward
+                    and (
+                        reuse_signal > best_reuse
+                        or (reuse_signal == best_reuse and ast_size < best_ast)
+                    )
+                )
+            ):
                 best_idx = i
                 best_reward = score
+                best_reuse = reuse_signal
                 best_ast = ast_size
                 best_message = msg
         return best_idx, (best_reward, best_message)
 
+    @staticmethod
+    def _reuse_signal(candidate: dict) -> int:
+        """Tie-break signal for candidates that support TroVE's toolbox."""
+        functions = candidate.get("functions") or []
+        tool_calls = candidate.get("tool_calls") or []
+        unique_tool_names = {
+            (tc.get("name") or "").split("<|", 1)[0].strip()
+            for tc in tool_calls
+            if isinstance(tc, dict) and tc.get("name")
+        }
+        return len(functions) + len({name for name in unique_tool_names if name})
+
     def _select_best_by_consistency(self, candidates: List[dict]) -> int:
         """
         Original TroVE self-consistency selection (majority vote on stdout).
diff --git a/symbolic_agent/baselines/trove/prompts.py b/symbolic_agent/baselines/trove/prompts.py
index 0e000b69..43922948 100644
--- a/symbolic_agent/baselines/trove/prompts.py
+++ b/symbolic_agent/baselines/trove/prompts.py
@@ -214,6 +214,14 @@ def build_import_with_tools_prompt(question: str, task_family: str = "default")
     "if you believe the function can be reused to solve other questions."
 )
 
+_CREATE_INSTRUCTION_PBEBENCH = (
+    "You task is to write Python program solutions to the given questions.\n"
+    "In CREATE mode, you must define at least one reusable helper function "
+    "inside a **Tools** code block. The **Solution** block should use or "
+    "accompany that helper as appropriate, but the printed answer must remain "
+    "a Python list of replace() call strings."
+)
+
 _CREATE_EXAMPLE_DEFAULT = """\
 ## Example
 **Question**
@@ -272,7 +280,12 @@ def find_replace_chain(s, pairs):
 
 def build_create_prompt(question: str, task_family: str = "default") -> str:
     """Build the CREATE-mode prompt for a single task."""
-    instruction = _CREATE_INSTRUCTION_DEFAULT + _format_override(task_family)
+    create_instruction = (
+        _CREATE_INSTRUCTION_PBEBENCH
+        if task_family == "pbebench"
+        else _CREATE_INSTRUCTION_DEFAULT
+    )
+    instruction = create_instruction + _format_override(task_family)
     example = _CREATE_EXAMPLE_PBEBENCH if task_family == "pbebench" else _CREATE_EXAMPLE_DEFAULT
     return (
         instruction
diff --git a/symbolic_agent/baselines/trove/tests/test_controller_selection.py b/symbolic_agent/baselines/trove/tests/test_controller_selection.py
new file mode 100644
index 00000000..10d1e2f8
--- /dev/null
+++ b/symbolic_agent/baselines/trove/tests/test_controller_selection.py
@@ -0,0 +1,86 @@
+"""Unit tests for TroVE candidate selection."""
+
+from symbolic_agent.baselines.trove.controller import TroVEController
+
+
+def _reward(output, is_success, entry):
+    return {"value": 1.0 if is_success else 0.0, "message": ""}
+
+
+def _controller():
+    controller = object.__new__(TroVEController)
+    controller.selection = "reward"
+    return controller
+
+
+def test_reward_tie_prefers_candidate_that_adds_reusable_functions():
+    candidates = [
+        {
+            "solution_code": "programs = [\"replace('a','b')\"]\nprint(programs)",
+            "exec_output": "[\"replace('a','b')\"]",
+            "is_success": True,
+            "functions": [],
+        },
+        {
+            "solution_code": (
+                "programs = infer_programs(['a'], ['b'])\n"
+                "print(programs)\n"
+                "def helper_for_ast_size():\n"
+                "    return 1\n"
+            ),
+            "exec_output": "[\"replace('a','b')\"]",
+            "is_success": True,
+            "functions": [{"name": "infer_programs"}],
+        },
+    ]
+
+    idx, score = _controller()._select_best_by_reward(candidates, _reward, {})
+
+    assert idx == 1
+    assert score == (1.0, "")
+
+
+def test_reward_tie_prefers_candidate_that_called_import_tools():
+    candidates = [
+        {
+            "solution_code": "programs = [\"replace('a','b')\"]\nprint(programs)",
+            "exec_output": "[\"replace('a','b')\"]",
+            "is_success": True,
+            "functions": [],
+            "tool_calls": [],
+        },
+        {
+            "solution_code": "programs = infer_programs(['a'], ['b'])\nprint(programs)",
+            "exec_output": "[\"replace('a','b')\"]",
+            "is_success": True,
+            "functions": [],
+            "tool_calls": [{"name": "infer_programs"}],
+        },
+    ]
+
+    idx, score = _controller()._select_best_by_reward(candidates, _reward, {})
+
+    assert idx == 1
+    assert score == (1.0, "")
+
+
+def test_reward_tie_uses_smallest_ast_when_reuse_signal_matches():
+    candidates = [
+        {
+            "solution_code": "x = 1\ny = 2\nprograms = [\"replace('a','b')\"]\nprint(programs)",
+            "exec_output": "[\"replace('a','b')\"]",
+            "is_success": True,
+            "functions": [],
+        },
+        {
+            "solution_code": "programs = [\"replace('a','b')\"]\nprint(programs)",
+            "exec_output": "[\"replace('a','b')\"]",
+            "is_success": True,
+            "functions": [],
+        },
+    ]
+
+    idx, score = _controller()._select_best_by_reward(candidates, _reward, {})
+
+    assert idx == 1
+    assert score == (1.0, "")
diff --git a/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py b/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py
index 4b5523de..09cc64d3 100644
--- a/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py
+++ b/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py
@@ -19,6 +19,8 @@ def test_pbebench_create_prompt_models_replace_program_list_stdout():
     prompt = build_create_prompt("Task", task_family="pbebench")
 
     _assert_pbebench_prompt_prints_program_sequence(prompt)
+    assert "must define at least one reusable helper function" in prompt
+    assert "**Tools**" in prompt
 
 
 def test_pbebench_skip_prompt_models_replace_program_list_stdout():

From 94bc0d107db5489c80dda7af6bc170b24665fca5 Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Mon, 27 Apr 2026 16:38:53 -0400
Subject: [PATCH 22/24] fix(trove): encourage reusable PBEBench helpers

Made-with: Cursor
---
 .../baselines/trove/docs/running.md           | 149 ++++++++++++++++++
 symbolic_agent/baselines/trove/prompts.py     |  32 +++-
 .../trove/tests/test_prompts_pbebench.py      |  14 ++
 3 files changed, 189 insertions(+), 6 deletions(-)
 create mode 100644 symbolic_agent/baselines/trove/docs/running.md

diff --git a/symbolic_agent/baselines/trove/docs/running.md b/symbolic_agent/baselines/trove/docs/running.md
new file mode 100644
index 00000000..704ab74b
--- /dev/null
+++ b/symbolic_agent/baselines/trove/docs/running.md
@@ -0,0 +1,149 @@
+# Running TroVE on PBEBench-Lite
+
+This guide covers launching the TroVE baseline against `openai/gpt-oss-20b`
+served by vLLM. There are two paths:
+
+- **Notebook (recommended on RunPod)** — `notebooks/run_trove_pbebench.ipynb`
+  drives the whole flow (env setup → vLLM launch → TroVE run → analysis) from
+  one place and mirrors logs to disk.
+- **Shell scripts** — for SSH / tmux workflows where a notebook is awkward.
+
+Both paths assume an L40S/H100-class GPU with ≥40 GB VRAM and ≥40 GB free disk
+for the model cache.
+
+---
+
+## 0. Prerequisites
+
+- `vLLM >= 0.16.0` — earlier versions do not ship the gpt-oss reasoning parser
+  or auto tool-choice support.
+- `typing_extensions >= 4.12.2` — older versions break vLLM startup with
+  `cannot import name 'TypeIs' from typing_extensions`.
+- `huggingface_hub` with a working transfer backend. If `xet` errors during
+  download, set `HF_HUB_DISABLE_XET=1`.
+- `HF_HOME` pointed at a persistent volume (e.g. `/workspace/hf-cache`) so the
+  model is not re-downloaded across container restarts.
+
+Quick install / repair on a fresh container:
+
+```bash
+python -m pip install -U "typing_extensions>=4.12.2" \
+                        "huggingface_hub[hf_transfer]" hf_xet
+```
+
+---
+
+## 1. Notebook path (RunPod)
+
+```bash
+git clone <repo-url> /workspace/symbolic-library-agent
+cd /workspace/symbolic-library-agent
+jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
+```
+
+Then open `notebooks/run_trove_pbebench.ipynb` and run the cells top-to-bottom:
+
+1. **Env / cache setup** — sets `HF_HOME=/workspace/hf-cache` and disables xet.
+2. **`pip install` cell** — refreshes `typing_extensions` and HF transfer.
+3. **Launch vLLM** — backgrounds `scripts/launch_vllm_gpt_oss_20b.sh 8000` and
+   tails `vllm_logs/`.
+4. **Wait for server ready** — polls `/v1/models` until 200 OK.
+5. **`tail_vllm_log(60)` helper** — re-runnable cell for spot-checking the
+   server log at any time.
+6. **Run TroVE** — `subprocess.Popen` of `main.py` with the PBEBench-Lite
+   pilot tasks. Stdout is mirrored to `outputs/trove_pbebench_lite_smoke_<ts>.log`
+   on disk in addition to the cell output, so you can SSH in and `tail -f` the
+   run from another shell.
+7. **Analyze** — calls `scripts/analyze_trove_run.py` on the output JSONL.
+
+If the notebook cell stops responding, do **not** `pkill -f "main.py"` —
+that pattern can match the vLLM process tree on some images. Instead:
+
+```bash
+ps -ef | awk '/python .*main.py/ && /--framework/ && /trove/ {print $2}' \
+       | xargs -r kill
+```
+
+---
+
+## 2. Shell-script path
+
+Two scripts; run them in two terminals (or one tmux session with two panes).
+
+### 2a. Launch vLLM
+
+```bash
+cd /workspace/symbolic-library-agent
+mkdir -p vllm_logs
+bash scripts/launch_vllm_gpt_oss_20b.sh 8000
+# logs: vllm_logs/vllm_8000_<timestamp>.log
+# pid : vllm_logs/vllm_8000_<timestamp>.pid
+```
+
+The script forwards three flags that are required for our IMPORT-with-tools
+branch to work:
+
+- `--enable-auto-tool-choice`
+- `--tool-call-parser openai`
+- `--reasoning-parser openai_gptoss`
+
+Wait for `Application startup complete` in the log before continuing.
+
+### 2b. Run TroVE
+
+```bash
+PORT=8000 bash scripts/run_trove_vllm.sh
+```
+
+Defaults (overridable via env vars or trailing flags):
+
+| Env var      | Default                                   |
+| ------------ | ----------------------------------------- |
+| `PORT`       | `8000`                                    |
+| `TASKS_FILE` | `data/pbebench/lite_pilot_tasks.jsonl`    |
+| `OUT_FILE`   | `outputs/trove_pbebench_lite_pilot.jsonl` |
+
+Pass through any extra `main.py` flag, e.g.:
+
+```bash
+PORT=8000 bash scripts/run_trove_vllm.sh --num-tasks 10  # quick sanity run
+```
+
+### 2c. Analyze
+
+```bash
+python scripts/analyze_trove_run.py outputs/trove_pbebench_lite_pilot.jsonl
+```
+
+Reports overall accuracy, final toolbox size, per-mode wins, IMPORT-mode
+tool-call success rate, and the top-10 most-called toolbox functions.
+
+---
+
+## 3. Key flags (cheat sheet)
+
+The TroVE-specific flags on `main.py` matter most:
+
+| Flag                  | Default      | Purpose                                                 |
+| --------------------- | ------------ | ------------------------------------------------------- |
+| `--framework`         | —            | Set to `trove`                                          |
+| `--trove-task-family` | `default`    | Set to `pbebench` to enable PBEBench few-shots & parser |
+| `--trove-selection`   | `reward`     | `reward` (PBEBench) or `consistency` (original TroVE)   |
+| `--trove-k`           | `5`          | Candidates per mode (1 disables sampling)               |
+| `--trove-trim-every`  | `100`        | Set high (`9999`) for ≤100-task pilots                  |
+| `--default-reward`    | —            | Set to `pbebench` for the PBEBench verifier             |
+| `--max-programs`      | `5`          | PBEBench program-list length cap                        |
+
+---
+
+## 4. Resuming and cleanup
+
+- Resume: just re-run the same command. `main.py` checkpoints to the output
+  JSONL; if both the JSONL and `--debug-dir` are intact it will skip already-
+  completed task indices.
+- Force-restart: delete the output JSONL before running.
+- vLLM cleanup:
+  ```bash
+  kill "$(cat vllm_logs/vllm_8000_*.pid)" 2>/dev/null || true
+  pkill -f vllm.entrypoints.openai.api_server  # safe — only matches vLLM
+  ```
diff --git a/symbolic_agent/baselines/trove/prompts.py b/symbolic_agent/baselines/trove/prompts.py
index 43922948..60d6770e 100644
--- a/symbolic_agent/baselines/trove/prompts.py
+++ b/symbolic_agent/baselines/trove/prompts.py
@@ -219,7 +219,12 @@ def build_import_with_tools_prompt(question: str, task_family: str = "default")
     "In CREATE mode, you must define at least one reusable helper function "
     "inside a **Tools** code block. The **Solution** block should use or "
     "accompany that helper as appropriate, but the printed answer must remain "
-    "a Python list of replace() call strings."
+    "a Python list of replace() call strings.\n"
+    "Prefer general helpers that any PBEBench task could reuse (e.g. parsing a "
+    "replace() call string, applying a candidate program list to inputs, or "
+    "scoring a program list against input/output pairs). If a helper that "
+    "already exists in the toolbox would solve this question, reuse it via "
+    "IMPORT mode instead of defining a near-duplicate here."
 )
 
 _CREATE_EXAMPLE_DEFAULT = """\
@@ -262,11 +267,26 @@ def apply_substitutions(strings, substitutions):
 ```
 **Tools**
 ```python
-def find_replace_chain(s, pairs):
-    \"\"\"Apply a chain of (old, new) replacements to a string.\"\"\"
-    for old, new in pairs:
-        s = s.replace(old, new)
-    return s
+import ast
+
+def parse_replace_call(call_str):
+    \"\"\"Parse a 'replace(old, new)' string into an (old, new) tuple of literals.\"\"\"
+    expr = ast.parse(call_str.strip(), mode="eval").body
+    old = ast.literal_eval(expr.args[0])
+    new = ast.literal_eval(expr.args[1])
+    return old, new
+
+def score_programs(programs, examples):
+    \"\"\"Return the fraction of (input, output) examples that `programs` reproduces.\"\"\"
+    pairs = [parse_replace_call(p) for p in programs]
+    correct = 0
+    for inp, expected in examples:
+        s = inp
+        for old, new in pairs:
+            s = s.replace(old, new)
+        if s == expected:
+            correct += 1
+    return correct / len(examples) if examples else 0.0
 ```"""
 
 _CREATE_TASK_TEMPLATE = """\
diff --git a/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py b/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py
index 09cc64d3..d4fcc8d3 100644
--- a/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py
+++ b/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py
@@ -23,6 +23,20 @@ def test_pbebench_create_prompt_models_replace_program_list_stdout():
     assert "**Tools**" in prompt
 
 
+def test_pbebench_create_prompt_uses_pbebench_specific_helpers():
+    prompt = build_create_prompt("Task", task_family="pbebench")
+
+    assert "def parse_replace_call" in prompt
+    assert "def score_programs" in prompt
+    assert "def find_replace_chain" not in prompt
+
+
+def test_pbebench_create_prompt_warns_against_duplicating_existing_tools():
+    prompt = build_create_prompt("Task", task_family="pbebench")
+
+    assert "already exists" in prompt or "duplicate" in prompt.lower()
+
+
 def test_pbebench_skip_prompt_models_replace_program_list_stdout():
     prompt = build_skip_prompt("Task", task_family="pbebench")
 

From 4352d31519d042ceac9c8324cce0375979ffaf69 Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Wed, 29 Apr 2026 12:22:41 -0400
Subject: [PATCH 23/24] fix(trove): show PBEBench helper signatures in CREATE
 prompt

Made-with: Cursor
---
 .../baselines/trove/docs/deviations.md        |  6 ++-
 symbolic_agent/baselines/trove/prompts.py     | 41 ++++++++-----------
 .../trove/tests/test_prompts_pbebench.py      | 12 ++++--
 3 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/symbolic_agent/baselines/trove/docs/deviations.md b/symbolic_agent/baselines/trove/docs/deviations.md
index 5ce60482..fda9d359 100644
--- a/symbolic_agent/baselines/trove/docs/deviations.md
+++ b/symbolic_agent/baselines/trove/docs/deviations.md
@@ -30,8 +30,10 @@ self-consistency selector remains available via `--trove-selection consistency`.
 ### 1.3 PBEBench-shaped few-shot examples
 For `task_family="pbebench"` we replace the generic CREATE / SKIP / IMPORT
 example pairs with PBEBench-shaped pairs that demonstrate `replace()`
-chains and a small reusable helper (`find_replace_chain`). The legacy
-default examples remain for `task_family="default"`.
+chains. CREATE mode also shows signature-only examples of reusable helper
+shapes (apply, score, search, prune, debug, end-to-end solve) instead of
+full function definitions, to reduce anchoring on a single copied helper.
+The legacy default examples remain for `task_family="default"`.
 
 ### 1.4 Strict **Solution** parsing for PBEBench
 The legacy parser falls back to "first ```python``` block anywhere" when
diff --git a/symbolic_agent/baselines/trove/prompts.py b/symbolic_agent/baselines/trove/prompts.py
index 60d6770e..7058cae2 100644
--- a/symbolic_agent/baselines/trove/prompts.py
+++ b/symbolic_agent/baselines/trove/prompts.py
@@ -224,7 +224,10 @@ def build_import_with_tools_prompt(question: str, task_family: str = "default")
     "replace() call string, applying a candidate program list to inputs, or "
     "scoring a program list against input/output pairs). If a helper that "
     "already exists in the toolbox would solve this question, reuse it via "
-    "IMPORT mode instead of defining a near-duplicate here."
+    "IMPORT mode instead of defining a near-duplicate here.\n"
+    "The helper signatures below are examples of useful tool shapes, not "
+    "definitions to copy. If you create a helper, implement the complete "
+    "function body in **Tools**."
 )
 
 _CREATE_EXAMPLE_DEFAULT = """\
@@ -255,6 +258,18 @@ def apply_substitutions(strings, substitutions):
 ```"""
 
 _CREATE_EXAMPLE_PBEBENCH = """\
+## Reusable helper signatures
+These are example shapes for reusable PBEBench tools. Do not copy `...` stubs
+as real tools; implement complete helpers when you decide to create one.
+```python
+def apply_programs(s, programs): ...
+def score_programs(programs, examples): ...
+def search_candidate_programs(examples, max_programs=5): ...
+def prune_search_state(partial_programs, examples): ...
+def debug_program_failure(programs, examples): ...
+def solve_examples(examples, max_programs=5): ...
+```
+
 ## Example
 **Question**
 Produce a sequence of replace() calls that transforms "hello world" into
@@ -265,29 +280,7 @@ def apply_substitutions(strings, substitutions):
 programs = ["replace(' ', '_')", "replace('h', 'H')", "replace('e', 'E')", "replace('l', 'L')", "replace('o', 'O')", "replace('w', 'W')", "replace('r', 'R')", "replace('d', 'D')"]
 print(programs)
 ```
-**Tools**
-```python
-import ast
-
-def parse_replace_call(call_str):
-    \"\"\"Parse a 'replace(old, new)' string into an (old, new) tuple of literals.\"\"\"
-    expr = ast.parse(call_str.strip(), mode="eval").body
-    old = ast.literal_eval(expr.args[0])
-    new = ast.literal_eval(expr.args[1])
-    return old, new
-
-def score_programs(programs, examples):
-    \"\"\"Return the fraction of (input, output) examples that `programs` reproduces.\"\"\"
-    pairs = [parse_replace_call(p) for p in programs]
-    correct = 0
-    for inp, expected in examples:
-        s = inp
-        for old, new in pairs:
-            s = s.replace(old, new)
-        if s == expected:
-            correct += 1
-    return correct / len(examples) if examples else 0.0
-```"""
+"""
 
 _CREATE_TASK_TEMPLATE = """\
 ## Task
diff --git a/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py b/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py
index d4fcc8d3..9d0685ad 100644
--- a/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py
+++ b/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py
@@ -23,12 +23,18 @@ def test_pbebench_create_prompt_models_replace_program_list_stdout():
     assert "**Tools**" in prompt
 
 
-def test_pbebench_create_prompt_uses_pbebench_specific_helpers():
+def test_pbebench_create_prompt_suggests_pbebench_helper_signatures():
     prompt = build_create_prompt("Task", task_family="pbebench")
 
-    assert "def parse_replace_call" in prompt
-    assert "def score_programs" in prompt
+    assert "Reusable helper signatures" in prompt
+    assert "def apply_programs(s, programs): ..." in prompt
+    assert "def score_programs(programs, examples): ..." in prompt
+    assert "def search_candidate_programs(examples, max_programs=5): ..." in prompt
+    assert "def debug_program_failure(programs, examples): ..." in prompt
     assert "def find_replace_chain" not in prompt
+    assert "import ast" not in prompt
+    assert "ast.parse" not in prompt
+    assert "return correct / len(examples)" not in prompt
 
 
 def test_pbebench_create_prompt_warns_against_duplicating_existing_tools():

From 51edc0c6f5375a7a6499d5316abcfe8b0bf37c0e Mon Sep 17 00:00:00 2001
From: mathuryash5 <mathuryash5@gmail.com>
Date: Thu, 30 Apr 2026 13:40:38 -0400
Subject: [PATCH 24/24] chore(trove): remove superpowers planning docs

Made-with: Cursor
---
 .../2026-04-25-trove-native-tool-calling.md   | 2274 -----------------
 ...-04-25-trove-native-tool-calling-design.md |  374 ---
 2 files changed, 2648 deletions(-)
 delete mode 100644 docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md
 delete mode 100644 docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md

diff --git a/docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md b/docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md
deleted file mode 100644
index 76ecb582..00000000
--- a/docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md
+++ /dev/null
@@ -1,2274 +0,0 @@
-# TroVE Native Tool Calling Implementation Plan
-
-> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
-
-**Goal:** Adapt the existing TroVE port so that the IMPORT mode uses native OpenAI tool calling (vLLM-served `gpt-oss`) while CREATE / SKIP / selection / trimming remain faithful to the paper, then run a 50-task PBEBench smoke and report numbers.
-
-**Architecture:** Keep `_multi_way_generation` unchanged for CREATE/SKIP. Replace the IMPORT branch (when toolbox non-empty AND backend is OpenAI) with a multi-turn loop that (a) translates top-k toolbox functions into OpenAI tool schemas, (b) lets the model emit `tool_calls` that are executed in a sandboxed subprocess, and (c) returns the final assistant text + recorded tool-call trajectory. Frequency credit comes from unique `tool_call.function.name` entries, not parsed `from toolbox import`. All other invariants (K-sampling, reward-based selection, AST tie-break, `C·log_{20}(n)` trimming) are unchanged.
-
-**Tech Stack:** Python 3.11, OpenAI Python SDK against a vLLM ≥ v0.16.0 endpoint serving `openai/gpt-oss-20b` (or `120b`), `subprocess`-based executor, `inspect` + `ast` from stdlib.
-
-**Spec:** [docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md](../specs/2026-04-25-trove-native-tool-calling-design.md)
-
----
-
-## File Structure
-
-| File | Status | Purpose |
-|---|---|---|
-| `symbolic_agent/baselines/trove/toolbox.py` | Modify | Trim coefficient `C=1.0` |
-| `symbolic_agent/baselines/trove/executor.py` | Modify | `DEFAULT_TIMEOUT=60` |
-| `symbolic_agent/baselines/trove/llm.py` | Modify | `reasoning_content` fallback in `_call_openai`; new `chat_with_tools` method |
-| `symbolic_agent/baselines/trove/parse.py` | Modify | `imported_callsites` helper; `task_family` parameter on `parse_response` |
-| `symbolic_agent/baselines/trove/prompts.py` | Modify | PBEBench-shaped few-shots; `build_import_with_tools_prompt`; `task_family` dispatch |
-| `symbolic_agent/baselines/trove/controller.py` | Modify | IMPORT-with-tools branch; telemetry fields; `task_family` + `selection` params |
-| `symbolic_agent/baselines/trove/tools_api.py` | Create | `toolbox_to_openai_tools`; `dispatch_tool_call` |
-| `symbolic_agent/baselines/trove/docs/deviations.md` | Create | Algorithmic deviations / faithful elements / infra patches |
-| `symbolic_agent/baselines/trove/tests/__init__.py` | Create | Marker file for the new tests package |
-| `symbolic_agent/baselines/trove/tests/test_tools_api.py` | Create | Unit tests for schema generation + dispatcher |
-| `symbolic_agent/baselines/trove/tests/test_parse_callsites.py` | Create | Unit tests for `imported_callsites` |
-| `main.py` | Modify | `--trove-selection` and `--trove-task-family` flags |
-| `scripts/launch_vllm_gpt_oss_120b.sh` | Modify | Add three vLLM tool-calling flags |
-| `scripts/analyze_trove_run.py` | Create | Post-hoc analysis of TroVE JSONL output |
-
----
-
-## Task 1: Quick infrastructure patches (trim C, executor timeout, reasoning_content fallback)
-
-**Files:**
-- Modify: `symbolic_agent/baselines/trove/toolbox.py:117`
-- Modify: `symbolic_agent/baselines/trove/executor.py:19`
-- Modify: `symbolic_agent/baselines/trove/llm.py:192`
-
-These are three independent one-line changes. Bundling them since each is too small to warrant its own commit and they're all on the "infrastructure" axis.
-
-- [ ] **Step 1.1: Update trim coefficient default**
-
-In `symbolic_agent/baselines/trove/toolbox.py`, change the default of `trim`:
-
-```python
-def trim(self, n_processed: int, C: float = 1.0) -> set:
-    """
-    Remove functions whose frequency is below the threshold
-        C * log_{20}(n_processed)
-    and return the set of example indices that had used those functions.
-
-    Faithful to trim_library() in run_trove.py:
-        threshold = math.log(n, 20)   # log base 20
-    C defaults to 1.0, matching the original implementation (C·log_{20}(n)).
-    Note: the original uses log base-20 not base-10; we keep base-20.
-    """
-```
-
-- [ ] **Step 1.2: Update executor timeout default**
-
-In `symbolic_agent/baselines/trove/executor.py`, change the constant:
-
-```python
-DEFAULT_TIMEOUT = 60  # seconds — generous for PBEBench replace() chains and multi-turn dispatch
-```
-
-- [ ] **Step 1.3: Add reasoning_content fallback in `_call_openai`**
-
-In `symbolic_agent/baselines/trove/llm.py`, replace the line that reads `raw = response.choices[0].message.content or ""` with:
-
-```python
-                msg = response.choices[0].message
-                raw = msg.content or getattr(msg, "reasoning_content", "") or ""
-```
-
-Context (the surrounding `try` block stays unchanged):
-
-```python
-                response = self._client.chat.completions.create(
-                    model=model,
-                    max_tokens=max_tokens,
-                    messages=messages,
-                )
-                msg = response.choices[0].message
-                raw = msg.content or getattr(msg, "reasoning_content", "") or ""
-                u = getattr(response, "usage", None)
-```
-
-- [ ] **Step 1.4: Sanity-check the changes**
-
-Run: `python -c "from symbolic_agent.baselines.trove.toolbox import TroVEToolbox; from symbolic_agent.baselines.trove.executor import DEFAULT_TIMEOUT; import inspect; print(inspect.signature(TroVEToolbox.trim).parameters['C'].default, DEFAULT_TIMEOUT)"`
-
-Expected: `1.0 60`
-
-- [ ] **Step 1.5: Commit**
-
-```bash
-git add symbolic_agent/baselines/trove/toolbox.py symbolic_agent/baselines/trove/executor.py symbolic_agent/baselines/trove/llm.py
-git commit -m "$(cat <<'EOF'
-fix(trove): infra patches for native tool calling
-
-- toolbox.trim default C=1.0 (matches original TroVE)
-- executor DEFAULT_TIMEOUT=60s (PBEBench + multi-turn headroom)
-- llm._call_openai falls back to message.reasoning_content when
-  message.content is empty (gpt-oss Harmony channel split)
-EOF
-)"
-```
-
----
-
-## Task 2: `parse.imported_callsites` helper + `task_family` parameter
-
-**Files:**
-- Modify: `symbolic_agent/baselines/trove/parse.py:86,106-114`
-- Create: `symbolic_agent/baselines/trove/tests/__init__.py`
-- Create: `symbolic_agent/baselines/trove/tests/test_parse_callsites.py`
-
-- [ ] **Step 2.1: Create the tests package marker**
-
-Create `symbolic_agent/baselines/trove/tests/__init__.py` as an empty file.
-
-- [ ] **Step 2.2: Write the failing test for `imported_callsites`**
-
-Create `symbolic_agent/baselines/trove/tests/test_parse_callsites.py`:
-
-```python
-"""Unit tests for parse.imported_callsites and parse_response(task_family=)."""
-
-from symbolic_agent.baselines.trove.parse import imported_callsites, parse_response
-
-
-# ---------------------------------------------------------------------------
-# imported_callsites
-# ---------------------------------------------------------------------------
-
-def test_callsites_bare_name():
-    code = "result = find_replace_chain(s, [('a', 'b')])\nprint(result)"
-    assert imported_callsites(code, tools_code="", candidate_names={"find_replace_chain", "other"}) == {"find_replace_chain"}
-
-
-def test_callsites_attribute_access():
-    code = "result = toolbox.find_replace_chain(s, pairs)\nprint(result)"
-    assert imported_callsites(code, tools_code="", candidate_names={"find_replace_chain"}) == {"find_replace_chain"}
-
-
-def test_callsites_no_match():
-    code = "print(s.replace('a', 'b'))"
-    assert imported_callsites(code, tools_code="", candidate_names={"find_replace_chain"}) == set()
-
-
-def test_callsites_multiple_calls_same_name_dedup():
-    code = "x = f(1)\ny = f(2)\nprint(x, y)"
-    assert imported_callsites(code, tools_code="", candidate_names={"f", "g"}) == {"f"}
-
-
-def test_callsites_syntax_error_returns_empty():
-    code = "this is not valid python ::"
-    assert imported_callsites(code, tools_code="", candidate_names={"f"}) == set()
-
-
-def test_callsites_empty_inputs():
-    assert imported_callsites("", "", set()) == set()
-    assert imported_callsites("print(1)", "", set()) == set()
-
-
-# ---------------------------------------------------------------------------
-# parse_response(task_family=)
-# ---------------------------------------------------------------------------
-
-def test_parse_response_pbebench_strict_no_solution_block():
-    text = "Here is some reasoning.\n```python\nprint('answer')\n```\n"
-    out = parse_response(text, task_family="pbebench")
-    assert out["solution_code"] == ""
-
-
-def test_parse_response_pbebench_with_solution_block():
-    text = "**Solution**\n```python\nprint('answer')\n```\n"
-    out = parse_response(text, task_family="pbebench")
-    assert out["solution_code"] == "print('answer')"
-
-
-def test_parse_response_default_falls_back_to_any_python_block():
-    text = "Here is some reasoning.\n```python\nprint('answer')\n```\n"
-    out = parse_response(text, task_family="default")
-    assert "print('answer')" in out["solution_code"]
-
-
-def test_parse_response_default_call_signature_unchanged():
-    text = "**Solution**\n```python\nprint('answer')\n```\n"
-    out = parse_response(text)
-    assert out["solution_code"] == "print('answer')"
-```
-
-- [ ] **Step 2.3: Run the tests to confirm they fail**
-
-Run: `python -m pytest symbolic_agent/baselines/trove/tests/test_parse_callsites.py -v`
-
-Expected: ImportError on `imported_callsites` (function does not exist) and one or more failures on `parse_response(text, task_family=...)` (unknown kwarg).
-
-- [ ] **Step 2.4: Implement `imported_callsites` and add `task_family` to `parse_response`**
-
-In `symbolic_agent/baselines/trove/parse.py`, add the helper at the end of the AST section (after `count_ast_nodes`):
-
-```python
-def imported_callsites(
-    solution_code: str,
-    tools_code: str,
-    candidate_names: set,
-) -> set:
-    """
-    Return the subset of `candidate_names` that appear as call-sites in
-    `solution_code`. Used for the `actually_called` telemetry field.
-
-    Detects two callee shapes:
-      - bare Name:        find_replace_chain(...)
-      - Attribute(name):  toolbox.find_replace_chain(...)
-
-    `tools_code` is currently unused (kept in the signature so callers can
-    pass through the **Tools** block context if we later want to filter by
-    what was actually imported).
-
-    Returns an empty set on empty input or SyntaxError.
-    """
-    if not solution_code or not candidate_names:
-        return set()
-    try:
-        tree = ast.parse(solution_code)
-    except SyntaxError:
-        return set()
-    found: set = set()
-    for node in ast.walk(tree):
-        if not isinstance(node, ast.Call):
-            continue
-        func = node.func
-        if isinstance(func, ast.Name) and func.id in candidate_names:
-            found.add(func.id)
-        elif isinstance(func, ast.Attribute) and func.attr in candidate_names:
-            found.add(func.attr)
-    return found
-```
-
-Then modify `parse_response` (around line 86) to accept `task_family`:
-
-```python
-def parse_response(text: str, task_family: str = "default") -> dict:
-    """
-    Parse a TroVE-format LLM response.
-
-    Returns
-    -------
-    {
-        "solution_code": str,         # code inside **Solution** block
-        "tools_code":    str,         # code inside **Tools** block
-        "functions":     list[dict],  # parsed tool dicts from the Tools block
-    }
-
-    task_family
-    -----------
-    "default": if no **Solution** block is found, falls back to the first
-    ```python``` block anywhere (legacy behaviour).
-    "pbebench": no fallback. Strict **Solution**-block-only parsing avoids
-    accidentally promoting CoT scratchpad to the answer.
-    """
-    solution_code = _extract_code_block(text, "Solution") or ""
-    tools_code = _extract_code_block(text, "Tools") or ""
-
-    if not solution_code and task_family != "pbebench":
-        raw = _extract_any_python_block(text)
-        if raw:
-            solution_code = _make_executable(raw)
-
-    functions = parse_tools_in_chunk(tools_code) if tools_code else []
-    return {
-        "solution_code": solution_code,
-        "tools_code": tools_code,
-        "functions": functions,
-    }
-```
-
-- [ ] **Step 2.5: Run the tests to confirm they pass**
-
-Run: `python -m pytest symbolic_agent/baselines/trove/tests/test_parse_callsites.py -v`
-
-Expected: 10 passed.
-
-- [ ] **Step 2.6: Commit**
-
-```bash
-git add symbolic_agent/baselines/trove/parse.py symbolic_agent/baselines/trove/tests/__init__.py symbolic_agent/baselines/trove/tests/test_parse_callsites.py
-git commit -m "$(cat <<'EOF'
-feat(trove): add imported_callsites helper and task_family to parse_response
-
-- imported_callsites(solution, tools, names) -> set: AST-walks Solution
-  code and returns names from the candidate set that are actually called.
-  Handles bare Name and Attribute (toolbox.foo) callees.
-- parse_response(text, task_family="default"): when task_family="pbebench"
-  the parser does not fall back to the first python block when **Solution**
-  is missing. Prevents CoT scratchpad from being promoted to the answer.
-EOF
-)"
-```
-
----
-
-## Task 3: PBEBench-shaped few-shots + IMPORT-with-tools prompt
-
-**Files:**
-- Modify: `symbolic_agent/baselines/trove/prompts.py` (full rewrite of constants and `build_*` functions)
-
-This task has no automated test — prompts are validated by inspection and by the smoke run.
-
-- [ ] **Step 3.1: Replace the prompts module with task-family-aware variants**
-
-Open `symbolic_agent/baselines/trove/prompts.py` and replace the entire body below the module docstring with the following. Keep the docstring at the top of the file.
-
-```python
-# ---------------------------------------------------------------------------
-# Format override (default-family only)
-# ---------------------------------------------------------------------------
-
-_FORMAT_OVERRIDE_DEFAULT = (
-    "\nIMPORTANT: Regardless of any formatting instructions inside the question, "
-    "always produce your answer as executable Python in the **Solution** block "
-    "and end it with print(answer). "
-    "Your answer is whatever gets printed to stdout when the Solution code runs."
-)
-
-# PBEBench prompts model the desired format directly via the few-shot example,
-# so no override string is needed.
-_FORMAT_OVERRIDE_PBEBENCH = ""
-
-
-def _format_override(task_family: str) -> str:
-    return _FORMAT_OVERRIDE_PBEBENCH if task_family == "pbebench" else _FORMAT_OVERRIDE_DEFAULT
-
-
-# ---------------------------------------------------------------------------
-# IMPORT mode (text-based, default and Anthropic fallback)
-# ---------------------------------------------------------------------------
-
-_IMPORT_INSTRUCTION_DEFAULT = (
-    "You task is to write Python program solutions to the given questions.\n"
-    "The toolbox section lists all the available functions that can be used in your solution."
-)
-
-_IMPORT_EXAMPLE_DEFAULT = """\
-## Example
-**Question**
-Given a list of strings and a list of (old, new) substitution pairs, apply all
-substitutions in order to each string and return the transformed list.
-Strings: ["cat", "bat"]
-Substitutions: [("a", "o"), ("t", "p")]
-
-**Toolbox**
-```python
-# Apply an ordered list of (old, new) substitutions to each string in a list.
-apply_substitutions(strings: list, substitutions: list) -> list
-```
-
-**Solution**
-```python
-strings = ["cat", "bat"]
-subs = [("a", "o"), ("t", "p")]
-result = apply_substitutions(strings, subs)
-print(result)
-```
-**Tools**
-```python
-from toolbox import apply_substitutions
-```"""
-
-_IMPORT_EXAMPLE_PBEBENCH = """\
-## Example
-**Question**
-You are given example input/output pairs. Produce a list of replace() calls
-that transforms each input into its expected output.
-
-Input:  "hello world"
-Output: "HELLO_WORLD"
-
-**Toolbox**
-```python
-# Apply a chain of (old, new) replacements to a string.
-find_replace_chain(s: str, pairs: list) -> str
-```
-
-**Solution**
-```python
-result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")])
-print(result)
-```
-**Tools**
-```python
-from toolbox import find_replace_chain
-```"""
-
-_IMPORT_TASK_TEMPLATE = """\
-## Task
-**Question**
-{question}
-
-**Toolbox**
-{toolbox}
-
-**Solution**
-"""
-
-
-def build_import_prompt(question: str, toolbox_str: str, task_family: str = "default") -> str:
-    """Build the text-based IMPORT-mode prompt (used for Anthropic and as fallback)."""
-    instruction = _IMPORT_INSTRUCTION_DEFAULT + _format_override(task_family)
-    example = _IMPORT_EXAMPLE_PBEBENCH if task_family == "pbebench" else _IMPORT_EXAMPLE_DEFAULT
-    return (
-        instruction
-        + "\n\n\n"
-        + example
-        + "\n\n\n"
-        + _IMPORT_TASK_TEMPLATE.format(question=question, toolbox=toolbox_str)
-    )
-
-
-# ---------------------------------------------------------------------------
-# IMPORT-with-tools mode (native OpenAI tool calling; no **Toolbox** block)
-# ---------------------------------------------------------------------------
-
-_IMPORT_WITH_TOOLS_INSTRUCTION_DEFAULT = (
-    "You task is to write Python program solutions to the given questions.\n"
-    "You have a set of helper functions available as tools. Call any of them "
-    "when they help you solve the question; otherwise solve directly. After "
-    "you have computed the answer, output it as executable Python in a "
-    "**Solution** block and end with print(answer)."
-)
-
-_IMPORT_WITH_TOOLS_INSTRUCTION_PBEBENCH = (
-    "You task is to produce a list of replace() calls that transforms each "
-    "input into its expected output for a Programming-by-Example task.\n"
-    "You have a set of helper functions available as tools. Call any of them "
-    "to test ideas or compute intermediate results; the final answer must be "
-    "produced as a Python program in the **Solution** block."
-)
-
-_IMPORT_WITH_TOOLS_EXAMPLE_DEFAULT = """\
-## Example
-**Question**
-Apply substitutions [("a","o"),("t","p")] to ["cat","bat"] and return the list.
-
-(After optionally calling `apply_substitutions` as a tool to confirm,
-the assistant produces:)
-
-**Solution**
-```python
-strings = ["cat", "bat"]
-subs = [("a", "o"), ("t", "p")]
-result = apply_substitutions(strings, subs)
-print(result)
-```"""
-
-_IMPORT_WITH_TOOLS_EXAMPLE_PBEBENCH = """\
-## Example
-**Question**
-Produce a sequence of replace() calls that transforms "hello world" into
-"HELLO_WORLD".
-
-(After optionally calling `find_replace_chain` as a tool to verify a
-candidate sequence, the assistant produces:)
-
-**Solution**
-```python
-result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")])
-print(result)
-```"""
-
-_IMPORT_WITH_TOOLS_TASK_TEMPLATE = """\
-## Task
-**Question**
-{question}
-
-**Solution**
-"""
-
-
-def build_import_with_tools_prompt(question: str, task_family: str = "default") -> str:
-    """
-    Build the IMPORT-with-tools prompt. The toolbox is NOT shown as text — it
-    is conveyed via the OpenAI tools=[...] parameter on the chat completion call.
-    """
-    if task_family == "pbebench":
-        instruction = _IMPORT_WITH_TOOLS_INSTRUCTION_PBEBENCH
-        example = _IMPORT_WITH_TOOLS_EXAMPLE_PBEBENCH
-    else:
-        instruction = _IMPORT_WITH_TOOLS_INSTRUCTION_DEFAULT
-        example = _IMPORT_WITH_TOOLS_EXAMPLE_DEFAULT
-    return (
-        instruction
-        + "\n\n\n"
-        + example
-        + "\n\n\n"
-        + _IMPORT_WITH_TOOLS_TASK_TEMPLATE.format(question=question)
-    )
-
-
-# ---------------------------------------------------------------------------
-# CREATE mode
-# ---------------------------------------------------------------------------
-
-_CREATE_INSTRUCTION_DEFAULT = (
-    "You task is to write Python program solutions to the given questions.\n"
-    "You should also create Python functions that can be used by your solution, "
-    "if you believe the function can be reused to solve other questions."
-)
-
-_CREATE_EXAMPLE_DEFAULT = """\
-## Example
-**Question**
-Given a list of strings and a list of (old, new) substitution pairs, apply all
-substitutions in order to each string and return the transformed list.
-Strings: ["hello", "world"]
-Substitutions: [("l", "r"), ("o", "0")]
-
-**Solution**
-```python
-strings = ["hello", "world"]
-subs = [("l", "r"), ("o", "0")]
-result = apply_substitutions(strings, subs)
-print(result)
-```
-**Tools**
-```python
-def apply_substitutions(strings, substitutions):
-    \"\"\"Apply an ordered list of (old, new) substitutions to each string in a list.\"\"\"
-    out = []
-    for s in strings:
-        for old, new in substitutions:
-            s = s.replace(old, new)
-        out.append(s)
-    return out
-```"""
-
-_CREATE_EXAMPLE_PBEBENCH = """\
-## Example
-**Question**
-Produce a sequence of replace() calls that transforms "hello world" into
-"HELLO_WORLD".
-
-**Solution**
-```python
-result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")])
-print(result)
-```
-**Tools**
-```python
-def find_replace_chain(s, pairs):
-    \"\"\"Apply a chain of (old, new) replacements to a string.\"\"\"
-    for old, new in pairs:
-        s = s.replace(old, new)
-    return s
-```"""
-
-_CREATE_TASK_TEMPLATE = """\
-## Task
-**Question**
-{question}
-
-**Solution**
-"""
-
-
-def build_create_prompt(question: str, task_family: str = "default") -> str:
-    """Build the CREATE-mode prompt for a single task."""
-    instruction = _CREATE_INSTRUCTION_DEFAULT + _format_override(task_family)
-    example = _CREATE_EXAMPLE_PBEBENCH if task_family == "pbebench" else _CREATE_EXAMPLE_DEFAULT
-    return (
-        instruction
-        + "\n\n\n"
-        + example
-        + "\n\n\n"
-        + _CREATE_TASK_TEMPLATE.format(question=question)
-    )
-
-
-# ---------------------------------------------------------------------------
-# SKIP mode
-# ---------------------------------------------------------------------------
-
-_SKIP_INSTRUCTION_DEFAULT = (
-    "You task is to write Python program solutions to the given questions."
-)
-
-_SKIP_EXAMPLE_DEFAULT = """\
-## Example
-**Question**
-Given the list of strings ["Hello", "World"], convert each to lowercase and
-return the resulting list.
-
-**Solution**
-```python
-strings = ["Hello", "World"]
-result = [s.lower() for s in strings]
-print(result)
-```
-**Tools**
-```python
-```"""
-
-_SKIP_EXAMPLE_PBEBENCH = """\
-## Example
-**Question**
-Produce a sequence of replace() calls that transforms "hello world" into
-"HELLO_WORLD".
-
-**Solution**
-```python
-s = "hello world"
-s = s.replace(" ", "_")
-s = s.replace("h", "H")
-s = s.replace("e", "E")
-s = s.replace("l", "L")
-s = s.replace("o", "O")
-s = s.replace("w", "W")
-s = s.replace("r", "R")
-s = s.replace("d", "D")
-print(s)
-```
-**Tools**
-```python
-```"""
-
-_SKIP_TASK_TEMPLATE = """\
-## Task
-**Question**
-{question}
-
-**Solution**
-"""
-
-
-def build_skip_prompt(question: str, task_family: str = "default") -> str:
-    """Build the SKIP-mode prompt for a single task."""
-    instruction = _SKIP_INSTRUCTION_DEFAULT + _format_override(task_family)
-    example = _SKIP_EXAMPLE_PBEBENCH if task_family == "pbebench" else _SKIP_EXAMPLE_DEFAULT
-    return (
-        instruction
-        + "\n\n\n"
-        + example
-        + "\n\n\n"
-        + _SKIP_TASK_TEMPLATE.format(question=question)
-    )
-
-
-def get_question(task_input: dict) -> str:
-    """
-    Extract the question/prompt string from a task_input dict.
-
-    Priority: question > prompt > task > str(task_input).
-    """
-    for key in ("question", "prompt", "task"):
-        val = task_input.get(key)
-        if val and isinstance(val, str) and val.strip():
-            return val.strip()
-    return str(task_input)
-```
-
-- [ ] **Step 3.2: Smoke-test the new prompts compile and dispatch correctly**
-
-Run: `python -c "from symbolic_agent.baselines.trove.prompts import build_import_prompt, build_create_prompt, build_skip_prompt, build_import_with_tools_prompt; print('--IMPORT default--'); print(build_import_prompt('Q?', 'TB')[:200]); print('--IMPORT pbebench--'); print(build_import_prompt('Q?', 'TB', task_family='pbebench')[:200]); print('--IMPORT_WITH_TOOLS pbebench--'); print(build_import_with_tools_prompt('Q?', task_family='pbebench')[:200])"`
-
-Expected: three short prompt previews, no exceptions, no `IMPORTANT:` line in the pbebench variant.
-
-- [ ] **Step 3.3: Commit**
-
-```bash
-git add symbolic_agent/baselines/trove/prompts.py
-git commit -m "$(cat <<'EOF'
-feat(trove): PBEBench-shaped few-shots and IMPORT-with-tools prompt
-
-- Add task_family parameter to all build_* prompt builders.
-- Add _CREATE_EXAMPLE_PBEBENCH and _SKIP_EXAMPLE_PBEBENCH demonstrating
-  replace()-chain solutions and a find_replace_chain helper.
-- Add build_import_with_tools_prompt for native tool calling: no
-  **Toolbox** markdown block (toolbox is conveyed via tools=[...]).
-- _FORMAT_OVERRIDE is empty for task_family="pbebench" (the example
-  models the desired format directly).
-EOF
-)"
-```
-
----
-
-## Task 4: New `tools_api.py` (toolbox -> OpenAI schemas, dispatcher)
-
-**Files:**
-- Create: `symbolic_agent/baselines/trove/tools_api.py`
-- Create: `symbolic_agent/baselines/trove/tests/test_tools_api.py`
-
-- [ ] **Step 4.1: Write the failing tests**
-
-Create `symbolic_agent/baselines/trove/tests/test_tools_api.py`:
-
-```python
-"""Unit tests for tools_api.toolbox_to_openai_tools and dispatch_tool_call."""
-
-import json
-from types import SimpleNamespace
-
-from symbolic_agent.baselines.trove.toolbox import TroVEToolbox
-from symbolic_agent.baselines.trove.tools_api import (
-    dispatch_tool_call,
-    toolbox_to_openai_tools,
-)
-
-
-def _make_toolbox_with(func_src: str, name: str, docstr: str = "") -> TroVEToolbox:
-    tb = TroVEToolbox()
-    tb.add(
-        {
-            "name": name,
-            "docstr": docstr,
-            "signature": f"def {name}(...)",
-            "function": func_src,
-            "type": "function",
-        },
-        example_idx=0,
-    )
-    return tb
-
-
-def _tool_call(name: str, args: dict, call_id: str = "call_1"):
-    return SimpleNamespace(
-        id=call_id,
-        function=SimpleNamespace(name=name, arguments=json.dumps(args)),
-    )
-
-
-# ---------------------------------------------------------------------------
-# toolbox_to_openai_tools
-# ---------------------------------------------------------------------------
-
-def test_schema_basic_function():
-    src = (
-        "def find_replace_chain(s: str, pairs: list) -> str:\n"
-        "    \"\"\"Apply a chain of (old, new) replacements to a string.\"\"\"\n"
-        "    for old, new in pairs:\n"
-        "        s = s.replace(old, new)\n"
-        "    return s\n"
-    )
-    tb = _make_toolbox_with(src, "find_replace_chain", docstr="Apply a chain of (old, new) replacements to a string.")
-    tools = toolbox_to_openai_tools(tb, topk=10)
-    assert len(tools) == 1
-    fn = tools[0]
-    assert fn["type"] == "function"
-    assert fn["function"]["name"] == "find_replace_chain"
-    assert fn["function"]["description"] == "Apply a chain of (old, new) replacements to a string."
-    params = fn["function"]["parameters"]
-    assert params["type"] == "object"
-    assert set(params["properties"].keys()) == {"s", "pairs"}
-    assert params["properties"]["s"]["type"] == "string"
-    assert params["properties"]["pairs"]["type"] == "array"
-    assert set(params["required"]) == {"s", "pairs"}
-
-
-def test_schema_unannotated_falls_back_to_string():
-    src = (
-        "def f(x):\n"
-        "    return x\n"
-    )
-    tb = _make_toolbox_with(src, "f")
-    tools = toolbox_to_openai_tools(tb, topk=10)
-    assert tools[0]["function"]["parameters"]["properties"]["x"]["type"] == "string"
-
-
-def test_schema_skips_varargs_kwargs():
-    src = (
-        "def f(*args, **kwargs):\n"
-        "    return args\n"
-    )
-    tb = _make_toolbox_with(src, "f")
-    tools = toolbox_to_openai_tools(tb, topk=10)
-    assert tools == []
-
-
-def test_schema_required_excludes_defaults():
-    src = (
-        "def f(x: int, y: int = 5):\n"
-        "    return x + y\n"
-    )
-    tb = _make_toolbox_with(src, "f")
-    tools = toolbox_to_openai_tools(tb, topk=10)
-    params = tools[0]["function"]["parameters"]
-    assert params["required"] == ["x"]
-    assert params["properties"]["y"]["type"] == "integer"
-
-
-def test_schema_topk_respects_frequency():
-    tb = TroVEToolbox()
-    for n, freq in [("a", 3), ("b", 2), ("c", 1)]:
-        tb.add(
-            {
-                "name": n,
-                "docstr": "",
-                "signature": f"def {n}()",
-                "function": f"def {n}():\n    return 0\n",
-                "type": "function",
-            },
-            example_idx=0,
-        )
-        for _ in range(freq - 1):
-            tb.update_frequency(n, example_idx=0)
-    tools = toolbox_to_openai_tools(tb, topk=2)
-    assert [t["function"]["name"] for t in tools] == ["a", "b"]
-
-
-def test_schema_empty_toolbox():
-    assert toolbox_to_openai_tools(TroVEToolbox(), topk=10) == []
-
-
-# ---------------------------------------------------------------------------
-# dispatch_tool_call
-# ---------------------------------------------------------------------------
-
-def test_dispatch_runs_function_and_returns_stdout():
-    src = (
-        "def reverse_str(s):\n"
-        "    return s[::-1]\n"
-    )
-    tb = _make_toolbox_with(src, "reverse_str")
-    result = dispatch_tool_call(tb, _tool_call("reverse_str", {"s": "hello"}))
-    assert "olleh" in result
-
-
-def test_dispatch_unknown_tool_returns_error():
-    tb = TroVEToolbox()
-    result = dispatch_tool_call(tb, _tool_call("nonexistent", {}))
-    assert "not in toolbox" in result
-
-
-def test_dispatch_bad_json_returns_error():
-    src = "def f(x):\n    return x\n"
-    tb = _make_toolbox_with(src, "f")
-    bad = SimpleNamespace(
-        id="x",
-        function=SimpleNamespace(name="f", arguments="{not json"),
-    )
-    result = dispatch_tool_call(tb, bad)
-    assert "argument JSON parse failed" in result
-
-
-def test_dispatch_sanitizes_harmony_contamination():
-    src = "def reverse_str(s):\n    return s[::-1]\n"
-    tb = _make_toolbox_with(src, "reverse_str")
-    tc = _tool_call("reverse_str<|channel|>commentary", {"s": "abc"})
-    result = dispatch_tool_call(tb, tc)
-    assert "cba" in result
-
-
-def test_dispatch_truncates_long_output():
-    src = (
-        "def long_output(n):\n"
-        "    return 'x' * n\n"
-    )
-    tb = _make_toolbox_with(src, "long_output")
-    result = dispatch_tool_call(tb, _tool_call("long_output", {"n": 10000}))
-    assert len(result) <= 4096 + 100  # +slack for repr quotes and truncation marker
-```
-
-- [ ] **Step 4.2: Run the tests to confirm they fail**
-
-Run: `python -m pytest symbolic_agent/baselines/trove/tests/test_tools_api.py -v`
-
-Expected: ImportError on `tools_api` module.
-
-- [ ] **Step 4.3: Create the `tools_api.py` module**
-
-Create `symbolic_agent/baselines/trove/tools_api.py`:
-
-```python
-"""Translate the TroVE toolbox into OpenAI Chat Completions tool schemas
-and dispatch tool calls back through the executor.
-
-This module is the bridge between TroVE's in-memory toolbox and vLLM's
-native tool-calling protocol. It is invoked only from the IMPORT-with-tools
-controller branch.
-"""
-
-from __future__ import annotations
-
-import inspect
-import json
-import logging
-from typing import Any
-
-from .executor import run_solution
-from .toolbox import TroVEToolbox
-
-logger = logging.getLogger(__name__)
-
-_MAX_RESULT_CHARS = 4096
-
-# Type inference: Python annotation -> JSON Schema type.
-_TYPE_MAP = {
-    int: "integer",
-    float: "number",
-    bool: "boolean",
-    str: "string",
-    list: "array",
-    tuple: "array",
-    dict: "object",
-}
-
-
-def _infer_type(annotation: Any) -> str:
-    if annotation is inspect.Parameter.empty:
-        return "string"
-    # Plain types (int, str, etc.)
-    if annotation in _TYPE_MAP:
-        return _TYPE_MAP[annotation]
-    # typing.List, typing.Dict, etc. — fall through to string if unrecognised.
-    origin = getattr(annotation, "__origin__", None)
-    if origin in _TYPE_MAP:
-        return _TYPE_MAP[origin]
-    return "string"
-
-
-def _function_to_schema(name: str, fn: Any, docstr: str) -> dict | None:
-    """
-    Build one OpenAI tool dict from a callable. Returns None if the function
-    has *args or **kwargs (we cannot generate a meaningful schema).
-    """
-    try:
-        sig = inspect.signature(fn)
-    except (TypeError, ValueError) as exc:
-        logger.debug("Could not introspect %s: %s", name, exc)
-        return None
-
-    properties: dict = {}
-    required: list = []
-
-    for pname, param in sig.parameters.items():
-        if param.kind in (
-            inspect.Parameter.VAR_POSITIONAL,
-            inspect.Parameter.VAR_KEYWORD,
-        ):
-            logger.debug("Skipping %s — has *args/**kwargs", name)
-            return None
-        prop: dict = {"type": _infer_type(param.annotation)}
-        if param.default is not inspect.Parameter.empty:
-            if isinstance(param.default, (int, float, bool, str)):
-                prop["default"] = param.default
-        else:
-            required.append(pname)
-        properties[pname] = prop
-
-    return {
-        "type": "function",
-        "function": {
-            "name": name,
-            "description": docstr or "",
-            "parameters": {
-                "type": "object",
-                "properties": properties,
-                "required": required,
-            },
-        },
-    }
-
-
-def toolbox_to_openai_tools(toolbox: TroVEToolbox, topk: int = 10) -> list:
-    """
-    Convert the top-k toolbox functions (by frequency) into OpenAI Chat
-    Completions tool dicts.
-
-    Functions with *args / **kwargs are silently excluded.
-    Returns [] when the toolbox is empty.
-    """
-    entries = toolbox.snapshot()
-    if not entries:
-        return []
-    entries.sort(key=lambda e: -int(e.get("frequency", 0)))
-    selected = entries[:topk]
-
-    namespace: dict = {}
-    try:
-        exec(toolbox.get_full_code(), namespace)
-    except Exception as exc:
-        logger.warning("Could not exec toolbox source for schema generation: %s", exc)
-        return []
-
-    tools: list = []
-    for entry in selected:
-        name = entry.get("name", "")
-        if not name or name not in namespace:
-            continue
-        fn = namespace[name]
-        schema = _function_to_schema(name, fn, entry.get("docstr", ""))
-        if schema is not None:
-            tools.append(schema)
-    return tools
-
-
-def _sanitize_name(name: str) -> str:
-    """Defensive workaround for vLLM PR #35906 (Harmony control tokens
-    leaking into tool names like `reverse_str<|channel|>commentary`)."""
-    return name.split("<|", 1)[0].strip()
-
-
-def _truncate(s: str, limit: int = _MAX_RESULT_CHARS) -> str:
-    if len(s) <= limit:
-        return s
-    return s[:limit] + f"\n... [truncated {len(s) - limit} chars]"
-
-
-def dispatch_tool_call(toolbox: TroVEToolbox, tool_call) -> str:
-    """
-    Resolve `tool_call` against the toolbox, run it via the sandbox executor,
-    and return the captured stdout (truncated to 4096 chars) or an error
-    message string. Always returns a string — never raises.
-    """
-    name = _sanitize_name(getattr(tool_call.function, "name", "") or "")
-    if not name:
-        return json.dumps({"error": "tool_call has no function name"})
-    if name not in {e["name"] for e in toolbox.snapshot()}:
-        return json.dumps({"error": f"tool '{name}' not in toolbox"})
-
-    raw_args = getattr(tool_call.function, "arguments", "") or "{}"
-    try:
-        args = json.loads(raw_args)
-        if not isinstance(args, dict):
-            return json.dumps({"error": f"argument JSON parse failed: expected object, got {type(args).__name__}"})
-    except json.JSONDecodeError as exc:
-        return json.dumps({"error": f"argument JSON parse failed: {exc}"})
-
-    call_expr = f"print(repr({name}(**{args!r})))"
-    is_ok, output = run_solution(
-        solution_code=call_expr,
-        tools_code="",
-        toolbox_code=toolbox.get_full_code(),
-    )
-    if not is_ok:
-        return json.dumps({"error": "execution failed", "stderr": _truncate(output)})
-    return _truncate(output)
-```
-
-- [ ] **Step 4.4: Run the tests to confirm they pass**
-
-Run: `python -m pytest symbolic_agent/baselines/trove/tests/test_tools_api.py -v`
-
-Expected: 10 passed.
-
-- [ ] **Step 4.5: Commit**
-
-```bash
-git add symbolic_agent/baselines/trove/tools_api.py symbolic_agent/baselines/trove/tests/test_tools_api.py
-git commit -m "$(cat <<'EOF'
-feat(trove): add tools_api for native OpenAI tool calling
-
-- toolbox_to_openai_tools(toolbox, topk=10): converts top-k toolbox
-  functions into OpenAI Chat Completions tool schemas. Infers parameter
-  types from inspect.signature; functions with *args/**kwargs are
-  silently excluded.
-- dispatch_tool_call(toolbox, tool_call): runs the requested function
-  in the sandbox executor, returns stdout truncated to 4096 chars or
-  a JSON error string. Sanitizes Harmony control-token contamination
-  in tool names (defensive vs. open vLLM PR #35906).
-EOF
-)"
-```
-
----
-
-## Task 5: `chat_with_tools` method on `TroVELLMClient`
-
-**Files:**
-- Modify: `symbolic_agent/baselines/trove/llm.py` (add new method, no signature changes to existing methods)
-
-This task has no automated test — the multi-turn loop is validated by the controller-level integration plus the smoke run.
-
-- [ ] **Step 5.1: Add `chat_with_tools` to `TroVELLMClient`**
-
-In `symbolic_agent/baselines/trove/llm.py`, add the following imports near the top (`Callable` may already be implicit via `typing`):
-
-```python
-from typing import Any, Callable, Dict, List, Optional
-```
-
-Then add the new method to the `TroVELLMClient` class (insert after `_call_openai`, before `_record`):
-
-```python
-    # ------------------------------------------------------------------
-    # Native tool calling (OpenAI/vLLM only)
-    # ------------------------------------------------------------------
-
-    def chat_with_tools(
-        self,
-        messages: List[Dict[str, Any]],
-        tools: List[Dict[str, Any]],
-        model: str,
-        max_tokens: int = DEFAULT_MAX_TOKENS,
-        max_tool_iters: int = 8,
-        on_tool_call: Optional[Callable[[Any], str]] = None,
-        tag: str = "",
-    ) -> Dict[str, Any]:
-        """
-        Multi-turn chat completion that supports native OpenAI tool calls.
-
-        Returns
-        -------
-        {
-            "final_text":     str,         # message.content (or reasoning_content fallback)
-            "tool_calls":     list[dict],  # ordered, each {name, args_preview, result_preview, ok}
-            "iterations":     int,         # number of round-trips actually used
-            "stopped_reason": str,         # "no_tool_calls" | "max_iters" | "error"
-        }
-
-        The caller is responsible for providing `on_tool_call(tc) -> str`,
-        which is invoked for every tool_call returned by the model. The
-        return value (already a string) is sent back as the tool message.
-
-        Anthropic backend is not supported — this method exists for the
-        OpenAI/vLLM tool-calling flow only. It raises NotImplementedError
-        on Anthropic as a defensive guard; controllers must check
-        `self.backend == "openai"` before calling.
-        """
-        if self.backend != "openai":
-            raise NotImplementedError("chat_with_tools requires the openai backend")
-
-        if on_tool_call is None:
-            raise ValueError("chat_with_tools requires an on_tool_call callback")
-
-        recorded_calls: List[Dict[str, Any]] = []
-        convo: List[Dict[str, Any]] = list(messages)
-        iterations = 0
-        final_text = ""
-        stopped_reason = "no_tool_calls"
-
-        for it in range(max_tool_iters + 1):
-            iterations = it + 1
-            iter_tag = f"{tag}_iter{it}" if tag else f"iter{it}"
-            response = None
-            last_exc = None
-
-            for attempt in range(3):
-                try:
-                    response = self._client.chat.completions.create(
-                        model=model,
-                        max_tokens=max_tokens,
-                        messages=convo,
-                        tools=tools,
-                        tool_choice="auto",
-                    )
-                    break
-                except Exception as exc:
-                    last_exc = exc
-                    if getattr(exc, "status_code", None) == 400:
-                        logger.warning(
-                            "OpenAI chat_with_tools 400 (tag=%s): %s", iter_tag, exc
-                        )
-                        self._record(iter_tag, model, json.dumps(convo)[:2000], "", max_tokens, {})
-                        return {
-                            "final_text": "",
-                            "tool_calls": recorded_calls,
-                            "iterations": iterations,
-                            "stopped_reason": "error",
-                        }
-                    if attempt < 2:
-                        wait = 5 * (2 ** attempt)
-                        logger.warning(
-                            "chat_with_tools failed (attempt %d/3, tag=%s): %s. Retrying in %ds.",
-                            attempt + 1, iter_tag, exc, wait,
-                        )
-                        time.sleep(wait)
-
-            if response is None:
-                logger.warning("All chat_with_tools retries exhausted (tag=%s): %s", iter_tag, last_exc)
-                stopped_reason = "error"
-                break
-
-            msg = response.choices[0].message
-            content = msg.content or getattr(msg, "reasoning_content", "") or ""
-            tool_calls = getattr(msg, "tool_calls", None) or []
-
-            u = getattr(response, "usage", None)
-            details = getattr(u, "completion_tokens_details", None)
-            usage = {
-                "input_tokens": getattr(u, "prompt_tokens", 0) or 0,
-                "output_tokens": getattr(u, "completion_tokens", 0) or 0,
-                "reasoning_tokens": getattr(details, "reasoning_tokens", 0) or 0 if details else 0,
-            }
-            self._record(
-                iter_tag,
-                model,
-                json.dumps(convo)[:2000],
-                json.dumps({"content": content, "tool_calls_count": len(tool_calls)}),
-                max_tokens,
-                usage,
-            )
-
-            if not tool_calls:
-                final_text = content
-                stopped_reason = "no_tool_calls"
-                break
-
-            assistant_msg: Dict[str, Any] = {
-                "role": "assistant",
-                "content": content,
-                "tool_calls": [
-                    {
-                        "id": tc.id,
-                        "type": "function",
-                        "function": {
-                            "name": tc.function.name,
-                            "arguments": tc.function.arguments,
-                        },
-                    }
-                    for tc in tool_calls
-                ],
-            }
-            convo.append(assistant_msg)
-
-            for tc in tool_calls:
-                try:
-                    result = on_tool_call(tc)
-                    ok = True
-                except Exception as exc:
-                    result = json.dumps({"error": f"on_tool_call raised: {exc}"})
-                    ok = False
-                args_preview = (tc.function.arguments or "")[:200]
-                result_preview = (result or "")[:200]
-                recorded_calls.append(
-                    {
-                        "name": tc.function.name,
-                        "args_preview": args_preview,
-                        "result_preview": result_preview,
-                        "ok": ok,
-                    }
-                )
-                convo.append(
-                    {
-                        "role": "tool",
-                        "tool_call_id": tc.id,
-                        "content": result,
-                    }
-                )
-
-            if it >= max_tool_iters - 1:
-                stopped_reason = "max_iters"
-                final_text = content
-                break
-
-        return {
-            "final_text": final_text,
-            "tool_calls": recorded_calls,
-            "iterations": iterations,
-            "stopped_reason": stopped_reason,
-        }
-```
-
-- [ ] **Step 5.2: Smoke-test the method does not break import**
-
-Run: `python -c "from symbolic_agent.baselines.trove.llm import TroVELLMClient; print(hasattr(TroVELLMClient, 'chat_with_tools'))"`
-
-Expected: `True`.
-
-- [ ] **Step 5.3: Smoke-test the Anthropic guard fires**
-
-Run: `python -c "from symbolic_agent.baselines.trove.llm import TroVELLMClient; c = TroVELLMClient(backend='anthropic', api_key='unused'); 
-try:
-    c.chat_with_tools([], [], model='x', on_tool_call=lambda x: '')
-    print('no exception (BUG)')
-except NotImplementedError as e:
-    print('guard fires:', e)"`
-
-Expected: `guard fires: chat_with_tools requires the openai backend`.
-
-- [ ] **Step 5.4: Commit**
-
-```bash
-git add symbolic_agent/baselines/trove/llm.py
-git commit -m "$(cat <<'EOF'
-feat(trove): add TroVELLMClient.chat_with_tools for native tool calls
-
-Multi-turn loop that handles tool_calls returned by gpt-oss/vLLM:
-appends assistant message + tool result messages until the model returns
-no tool_calls or max_tool_iters is reached. Records each call as
-{name, args_preview, result_preview, ok} for downstream telemetry.
-Reuses the existing 3-attempt retry, debug logging, and token accounting.
-
-Anthropic backend raises NotImplementedError as a defensive guard;
-controllers branch on self.backend == "openai" before calling.
-EOF
-)"
-```
-
----
-
-## Task 6: Controller IMPORT-with-tools branch + telemetry fields
-
-**Files:**
-- Modify: `symbolic_agent/baselines/trove/controller.py`
-
-- [ ] **Step 6.1: Update imports and `__init__` signature**
-
-In `symbolic_agent/baselines/trove/controller.py`, replace the imports block at the top (currently lines 36-44) with:
-
-```python
-import logging
-from collections import Counter
-from typing import Callable, Dict, List, Optional
-
-from . import tools_api
-from .executor import run_solution
-from .llm import TroVELLMClient
-from .parse import count_ast_nodes, imported_callsites, parse_response
-from .prompts import (
-    build_create_prompt,
-    build_import_prompt,
-    build_import_with_tools_prompt,
-    build_skip_prompt,
-    get_question,
-)
-from .toolbox import TroVEToolbox
-```
-
-Then update `TroVEController.__init__` (currently around lines 78-105) to accept the two new parameters:
-
-```python
-    def __init__(
-        self,
-        api_key: Optional[str] = None,
-        model: str = "claude-sonnet-4-5",
-        base_url: Optional[str] = None,
-        debug_dir: Optional[str] = None,
-        k: int = DEFAULT_K,
-        trim_every: int = DEFAULT_TRIM_EVERY,
-        trim_C: float = 1.0,
-        temperature: float = 0.3,
-        top_p: float = 0.95,
-        task_family: str = "default",
-        selection: str = "reward",
-        max_tool_iters: int = 8,
-        tool_schema_topk: int = 10,
-    ):
-        self.model = model
-        self.k = k
-        self.trim_every = trim_every
-        self.trim_C = trim_C
-        self.task_family = task_family
-        self.selection = selection
-        self.max_tool_iters = max_tool_iters
-        self.tool_schema_topk = tool_schema_topk
-
-        self.backend = "openai" if base_url else "anthropic"
-        self.llm = TroVELLMClient(
-            backend=self.backend,
-            base_url=base_url,
-            api_key=api_key,
-            temperature=temperature,
-            top_p=top_p,
-            debug_dir=debug_dir,
-        )
-        self.toolbox = TroVEToolbox()
-        self._n_processed: int = 0
-```
-
-(Note `trim_C` default is now 1.0 to match the toolbox change in Task 1; controllers passing the default get the new behavior.)
-
-- [ ] **Step 6.2: Update existing build_* call-sites to pass `task_family`**
-
-In `_multi_way_generation`, find each call to `build_create_prompt(question)` and `build_skip_prompt(question)` and the legacy `build_import_prompt(question, toolbox_str)`, replacing them with:
-
-```python
-                prompt = build_import_prompt(question, toolbox_str, task_family=self.task_family)
-```
-
-```python
-            prompt = build_create_prompt(question, task_family=self.task_family)
-```
-
-```python
-            prompt = build_skip_prompt(question, task_family=self.task_family)
-```
-
-Also update `parse_response(raw)` calls to `parse_response(raw, task_family=self.task_family)`.
-
-- [ ] **Step 6.3: Insert the IMPORT-with-tools branch in `_multi_way_generation`**
-
-Locate the `# --- IMPORT mode ---` section (currently around lines 254-274). Replace it with:
-
-```python
-        # --- IMPORT mode ---
-        toolbox_nonempty = bool(toolbox_str)
-        use_tools_branch = toolbox_nonempty and self.backend == "openai"
-
-        if use_tools_branch:
-            import_candidates = self._generate_import_with_tools(
-                question, example_idx, reward_fn=reward_fn, entry=entry
-            )
-            best_import_idx, best_import_score = self._select_best(
-                import_candidates, reward_fn=reward_fn, entry=entry
-            )
-            best_import = import_candidates[best_import_idx]
-            best_import["_reward_score"] = best_import_score
-        elif toolbox_nonempty:
-            # Legacy text-based IMPORT (Anthropic or unforeseen non-OpenAI path).
-            import_candidates = []
-            for _ in range(self.k):
-                prompt = build_import_prompt(question, toolbox_str, task_family=self.task_family)
-                raw = self.llm.call(prompt, self.model, max_tokens=DEFAULT_MAX_TOKENS, tag="trove_import")
-                parsed = parse_response(raw, task_family=self.task_family)
-                is_ok, out = run_solution(
-                    parsed["solution_code"],
-                    parsed["tools_code"],
-                    self.toolbox.get_full_code(),
-                )
-                import_candidates.append(
-                    {**parsed, "is_success": is_ok, "exec_output": out, "tool_calls": [], "stopped_reason": "legacy"}
-                )
-            best_import_idx, best_import_score = self._select_best(
-                import_candidates, reward_fn=reward_fn, entry=entry
-            )
-            best_import = import_candidates[best_import_idx]
-            best_import["_reward_score"] = best_import_score
-        else:
-            best_import = {
-                "solution_code": "", "tools_code": "", "functions": [],
-                "is_success": False, "exec_output": "",
-                "tool_calls": [], "stopped_reason": "empty_toolbox",
-                "_reward_score": None,
-            }
-```
-
-- [ ] **Step 6.4: Add the `_generate_import_with_tools` method**
-
-Insert this new method into the `TroVEController` class, after `_multi_way_generation`:
-
-```python
-    def _generate_import_with_tools(
-        self,
-        question: str,
-        example_idx: int,
-        reward_fn: Optional[Callable] = None,
-        entry: Optional[dict] = None,
-    ) -> List[dict]:
-        """
-        IMPORT-mode generation using native OpenAI tool calling.
-        Builds K trajectories; each trajectory may invoke toolbox functions
-        via tool_calls during the multi-turn loop. Returns K candidate dicts
-        compatible with _select_best.
-        """
-        prompt = build_import_with_tools_prompt(question, task_family=self.task_family)
-        tools_schema = tools_api.toolbox_to_openai_tools(self.toolbox, topk=self.tool_schema_topk)
-
-        candidates: List[dict] = []
-        for i in range(self.k):
-            tag = f"trove_import_t{example_idx}_{i}"
-            messages = [{"role": "user", "content": prompt}]
-            on_tc = lambda tc: tools_api.dispatch_tool_call(self.toolbox, tc)
-            traj = self.llm.chat_with_tools(
-                messages=messages,
-                tools=tools_schema,
-                model=self.model,
-                max_tokens=DEFAULT_MAX_TOKENS,
-                max_tool_iters=self.max_tool_iters,
-                on_tool_call=on_tc,
-                tag=tag,
-            )
-            parsed = parse_response(traj["final_text"], task_family=self.task_family)
-            is_ok, out = run_solution(
-                parsed["solution_code"],
-                parsed["tools_code"],
-                self.toolbox.get_full_code(),
-            )
-            candidates.append(
-                {
-                    **parsed,
-                    "is_success": is_ok,
-                    "exec_output": out,
-                    "tool_calls": traj["tool_calls"],
-                    "stopped_reason": traj["stopped_reason"],
-                    "iterations": traj["iterations"],
-                }
-            )
-        return candidates
-```
-
-- [ ] **Step 6.5: Wire `selection="consistency"` to the existing consistency selector**
-
-Replace `_select_best` (currently around lines 337-361) with:
-
-```python
-    def _select_best(
-        self,
-        candidates: List[dict],
-        reward_fn: Optional[Callable] = None,
-        entry: Optional[dict] = None,
-    ):
-        """
-        Select the best candidate from a list of response dicts.
-
-        Returns (best_index, score_or_None) where score is (reward, message)
-        when reward-based selection is used, or None otherwise.
-
-        Selection strategy is governed by self.selection:
-          - "reward" (default): reward-based when reward_fn+entry provided,
-            falls back to consistency when not.
-          - "consistency": original TroVE majority-vote algorithm.
-        """
-        if self.selection == "consistency":
-            return self._select_best_by_consistency(candidates), None
-        if reward_fn is not None and entry is not None:
-            return self._select_best_by_reward(candidates, reward_fn, entry)
-        return self._select_best_by_consistency(candidates), None
-```
-
-- [ ] **Step 6.6: Update `_update_library` to credit frequency from tool_calls**
-
-Replace `_update_library` (currently around lines 419-432) with:
-
-```python
-    def _update_library(self, mode: str, resp: dict, example_idx: int) -> None:
-        """Update toolbox based on winning mode (faithful to run_trove.py)."""
-        if mode == "import":
-            tool_calls = resp.get("tool_calls") or []
-            if tool_calls:
-                # Native tool-calling path: credit by unique tool_call.function.name
-                # (defensive: sanitize and let toolbox.update_frequency filter unknowns).
-                unique_names = {
-                    tc["name"].split("<|", 1)[0].strip()
-                    for tc in tool_calls
-                    if tc.get("name")
-                }
-                for name in unique_names:
-                    if name:
-                        self.toolbox.update_frequency(name, example_idx)
-            else:
-                # Legacy text-based IMPORT: credit functions parsed from **Tools**.
-                for func_dict in resp.get("functions", []):
-                    name = func_dict.get("name", "")
-                    if name:
-                        self.toolbox.update_frequency(name, example_idx)
-        elif mode == "create" and resp.get("is_success"):
-            for func_dict in resp.get("functions", []):
-                self.toolbox.add(func_dict, example_idx)
-
-        # SKIP: no library changes
-```
-
-- [ ] **Step 6.7: Add telemetry fields to `_make_result`**
-
-Replace `_make_result` (currently around lines 438-480) with:
-
-```python
-    def _make_result(
-        self,
-        task_input: dict,
-        task_type: str,
-        best_mode: str,
-        best_resp: dict,
-        is_success: bool,
-        output: str,
-        best_reward_score=None,
-    ) -> dict:
-        """
-        Build a result dict compatible with main.py's _print_result() and
-        _append_task_output(). Adds passive TroVE telemetry fields.
-        """
-        tool_calls = best_resp.get("tool_calls") or []
-        tools_called = sorted({
-            tc["name"].split("<|", 1)[0].strip()
-            for tc in tool_calls
-            if tc.get("name")
-        })
-        candidate_names = {e["name"] for e in self.toolbox.snapshot()}
-        actually_called = sorted(
-            imported_callsites(
-                solution_code=best_resp.get("solution_code", ""),
-                tools_code=best_resp.get("tools_code", ""),
-                candidate_names=candidate_names,
-            )
-        )
-        import_eligible = len(self.toolbox) > 0  # state AFTER this task's update
-        # Note: import_eligible reflects the current toolbox state after
-        # _update_library has already run for this task. The analyzer should
-        # interpret this as "a non-empty toolbox existed at some point during
-        # this task's processing". For pre-task eligibility, infer from
-        # toolbox snapshots in adjacent tasks.
-
-        return {
-            "task_type": task_type,
-            "original_prompt": str(task_input),
-            "solved": is_success,
-            "steps": 1,
-            "trace": [
-                {
-                    "step": 0,
-                    "agent": "trove",
-                    "action": best_mode,
-                    "is_success": is_success,
-                }
-            ],
-            "solution": best_resp.get("solution_code", ""),
-            "library_snapshot": self.toolbox.snapshot(),
-            "cost_summary": {},
-            "final_output": {
-                "answer": output,
-                "explanation": f"TroVE mode={best_mode}",
-                "confidence": "high" if is_success else "low",
-                "execution_result": output,
-            },
-            "agent_messages": self.llm.get_task_log(),
-            "reward_history": [],
-            "best_reward": None,
-            "final_reward": None,
-            "_best_reward_score": best_reward_score,
-            # TroVE native-tool-calling telemetry
-            "won_mode": best_mode,
-            "import_eligible": import_eligible,
-            "import_was_winner": best_mode == "import",
-            "tool_calls": tool_calls,
-            "tool_call_count": len(tool_calls),
-            "tools_called": tools_called,
-            "actually_called": actually_called,
-            "trove_stopped_reason": best_resp.get("stopped_reason", ""),
-        }
-```
-
-- [ ] **Step 6.8: Sanity-check the controller imports and constructs**
-
-Run: `python -c "from symbolic_agent.baselines.trove.controller import TroVEController; c = TroVEController(api_key='unused', model='x', task_family='pbebench', selection='reward'); print(c.task_family, c.selection, c.backend, c.max_tool_iters, c.tool_schema_topk)"`
-
-Expected: `pbebench reward anthropic 8 10`.
-
-- [ ] **Step 6.9: Run all tests to confirm no regressions**
-
-Run: `python -m pytest symbolic_agent/baselines/trove/tests/ -v`
-
-Expected: 16 passed (10 from tools_api + 6 from parse_callsites + 4 more = 20 actually; verify count matches what was added).
-
-Actual expected: 6 (parse_callsites) + 10 (tools_api) = 16 passed.
-
-- [ ] **Step 6.10: Commit**
-
-```bash
-git add symbolic_agent/baselines/trove/controller.py
-git commit -m "$(cat <<'EOF'
-feat(trove): controller branch for native IMPORT tool calling
-
-- Add task_family and selection params to TroVEController.__init__.
-- IMPORT branch dispatches to _generate_import_with_tools when toolbox
-  is non-empty and backend is openai; otherwise falls back to legacy
-  text-based IMPORT.
-- _generate_import_with_tools builds K multi-turn trajectories via
-  TroVELLMClient.chat_with_tools, parses **Solution** strictly for
-  pbebench, and runs the result through the executor.
-- _update_library credits frequency by unique tool_call.function.name
-  for the native path; legacy path still credits parsed functions.
-- _make_result emits won_mode, import_eligible, import_was_winner,
-  tool_calls, tool_call_count, tools_called, actually_called,
-  trove_stopped_reason as passive telemetry.
-- _select_best honors selection="consistency" or "reward" (default).
-EOF
-)"
-```
-
----
-
-## Task 7: `main.py` CLI flags (`--trove-selection`, `--trove-task-family`)
-
-**Files:**
-- Modify: `main.py:794-810` (add new flags) and `main.py:1002-1011` (pass through to controller)
-
-- [ ] **Step 7.1: Add the two new argparse flags**
-
-In `main.py`, after the existing `--trove-trim-every` argument (around line 810), insert:
-
-```python
-    parser.add_argument(
-        "--trove-selection",
-        choices=["reward", "consistency"],
-        default="reward",
-        help="[TroVE] Candidate selection strategy. 'reward' (default) uses "
-             "the per-task reward function with AST tie-breaking. "
-             "'consistency' uses the original TroVE majority-vote algorithm. "
-             "(default: reward)",
-    )
-    parser.add_argument(
-        "--trove-task-family",
-        choices=["default", "pbebench"],
-        default="default",
-        help="[TroVE] Task family for prompt selection and parser strictness. "
-             "'pbebench' uses PBEBench-shaped few-shots and strict **Solution** "
-             "parsing (no fallback to any python block). (default: default)",
-    )
-```
-
-- [ ] **Step 7.2: Plumb the flags into the `TroVEController` constructor**
-
-Find the `elif args.framework == "trove":` block (around line 1002) and replace the `controller = TroVEController(...)` call with:
-
-```python
-    elif args.framework == "trove":
-        controller = TroVEController(
-            api_key=api_key,
-            model=model,
-            base_url=base_url,
-            debug_dir=args.debug_dir,
-            k=args.trove_k,
-            trim_every=args.trove_trim_every,
-            task_family=args.trove_task_family,
-            selection=args.trove_selection,
-        )
-        logger.info(
-            "Framework: TroVE (k=%d, trim_every=%d, task_family=%s, selection=%s)",
-            args.trove_k, args.trove_trim_every, args.trove_task_family, args.trove_selection,
-        )
-```
-
-- [ ] **Step 7.3: Sanity-check the CLI parses both flags**
-
-Run: `python main.py --help 2>&1 | grep -E "trove-selection|trove-task-family"`
-
-Expected: two lines, one for each new flag, both showing the choices and defaults.
-
-- [ ] **Step 7.4: Sanity-check controller wires through**
-
-Construct an empty tasks file so the run finishes immediately after parsing args:
-
-```bash
-echo '[]' > /tmp/_pbebench_empty.json
-VLLM_API_KEY=EMPTY python main.py \
-  --framework trove \
-  --trove-task-family pbebench \
-  --trove-selection reward \
-  --tasks-file /tmp/_pbebench_empty.json \
-  --model openai/gpt-oss-20b \
-  --backend vllm \
-  --base-url http://localhost:8000/v1 \
-  2>&1 | grep -E "Framework: TroVE|ERROR" | head -5
-```
-
-Expected: `Framework: TroVE (k=5, trim_every=500, task_family=pbebench, selection=reward)` then an `ERROR: no records found` from the loader. Both confirm the flags parsed and the controller was constructed.
-
-- [ ] **Step 7.5: Commit**
-
-```bash
-git add main.py
-git commit -m "$(cat <<'EOF'
-feat(trove): CLI flags --trove-selection and --trove-task-family
-
-- --trove-selection {reward,consistency} (default: reward).
-- --trove-task-family {default,pbebench} (default: default). Plumbed
-  through to TroVEController; PBEBench runs should pass --trove-task-family
-  pbebench to enable PBEBench-shaped few-shots and strict **Solution**
-  parsing.
-EOF
-)"
-```
-
----
-
-## Task 8: Update vLLM launcher script with tool-calling flags
-
-**Files:**
-- Modify: `scripts/launch_vllm_gpt_oss_120b.sh`
-
-- [ ] **Step 8.1: Add the three vLLM flags**
-
-Replace the body of `scripts/launch_vllm_gpt_oss_120b.sh` with:
-
-```bash
-#!/bin/bash
-
-mkdir -p /tmp/$USER-tiktoken-cache /tmp/$USER-tmp
-chmod 700 /tmp/$USER-tiktoken-cache /tmp/$USER-tmp
-export TIKTOKEN_CACHE_DIR=/tmp/$USER-tiktoken-cache
-export TMPDIR=/tmp/$USER-tmp
-
-ts=$(date +%Y%m%d_%H%M%S)
-
-# Required vLLM tool-calling flags (vLLM >= v0.16.0 for PR #28729):
-#   --enable-auto-tool-choice  enables tool_choice="auto"
-#   --tool-call-parser openai  parses gpt-oss Harmony commentary channel
-#   --reasoning-parser openai_gptoss  routes analysis-channel content into
-#                                     message.reasoning_content
-nohup python -m vllm.entrypoints.openai.api_server \
-  --model "openai/gpt-oss-120b" \
-  --tokenizer "openai/gpt-oss-120b" \
-  --dtype auto \
-  --port ${1} \
-  --gpu-memory-utilization 0.95 \
-  --tensor-parallel-size 2 \
-  --enable-auto-tool-choice \
-  --tool-call-parser openai \
-  --reasoning-parser openai_gptoss \
-  > vllm_logs/vllm_${1}_${ts}.log 2>&1 & echo $! > vllm_logs/vllm_${1}_${ts}.pid
-```
-
-- [ ] **Step 8.2: Lint the script**
-
-Run: `bash -n scripts/launch_vllm_gpt_oss_120b.sh && echo OK`
-
-Expected: `OK`.
-
-- [ ] **Step 8.3: Commit**
-
-```bash
-git add scripts/launch_vllm_gpt_oss_120b.sh
-git commit -m "$(cat <<'EOF'
-chore(launcher): enable native tool calling for gpt-oss-120b vLLM server
-
-Add three flags required for OpenAI-compatible tool calling on gpt-oss
-served by vLLM >= v0.16.0:
-  --enable-auto-tool-choice
-  --tool-call-parser openai
-  --reasoning-parser openai_gptoss
-
-Without these the controller's chat_with_tools loop sees no tool_calls
-in the response and degrades to no-tool behavior.
-EOF
-)"
-```
-
----
-
-## Task 9: `scripts/analyze_trove_run.py`
-
-**Files:**
-- Create: `scripts/analyze_trove_run.py`
-
-- [ ] **Step 9.1: Create the analysis script**
-
-Create `scripts/analyze_trove_run.py`:
-
-```python
-#!/usr/bin/env python3
-"""Post-hoc analysis of a TroVE run JSONL output.
-
-Reads the per-task JSONL file produced by main.py --output-file and reports:
-  - Overall accuracy
-  - Final toolbox size
-  - Per-mode wins
-  - IMPORT-mode tool-use breakdown
-  - Top-10 most-called toolbox functions
-
-Usage:
-    python scripts/analyze_trove_run.py path/to/results.jsonl
-"""
-
-from __future__ import annotations
-
-import argparse
-import json
-import sys
-from collections import Counter
-from pathlib import Path
-
-
-def _load_rows(path: Path) -> list[dict]:
-    rows = []
-    with path.open() as f:
-        for lineno, line in enumerate(f, 1):
-            line = line.strip()
-            if not line:
-                continue
-            try:
-                rows.append(json.loads(line))
-            except json.JSONDecodeError as exc:
-                print(f"warning: line {lineno} is not valid JSON: {exc}", file=sys.stderr)
-    return rows
-
-
-def _result_dict(row: dict) -> dict:
-    """Tolerant accessor: results are nested under 'result' in main.py's output."""
-    return row.get("result") or row
-
-
-def main() -> None:
-    parser = argparse.ArgumentParser(description=__doc__)
-    parser.add_argument("path", type=Path, help="Path to the TroVE results JSONL file")
-    args = parser.parse_args()
-
-    rows = _load_rows(args.path)
-    if not rows:
-        print("ERROR: no rows loaded", file=sys.stderr)
-        sys.exit(1)
-
-    n = len(rows)
-    results = [_result_dict(r) for r in rows]
-
-    # Overall accuracy
-    solved = sum(1 for r in results if r.get("solved"))
-    print(f"=== Run summary: {args.path.name} ===")
-    print(f"Tasks: {n}")
-    print(f"Solved: {solved}/{n} ({100 * solved / n:.1f}%)")
-
-    # Final toolbox size — take the snapshot from the last row.
-    last_snapshot = results[-1].get("library_snapshot") or []
-    print(f"Final toolbox size: {len(last_snapshot)}")
-
-    # Per-mode wins
-    mode_counter = Counter(r.get("won_mode", "?") for r in results)
-    print(f"Mode wins: {dict(mode_counter)}")
-
-    # IMPORT-mode tool-use breakdown
-    import_eligible = [r for r in results if r.get("import_eligible")]
-    if not import_eligible:
-        print("No IMPORT-eligible tasks observed.")
-    else:
-        with_calls = [r for r in import_eligible if (r.get("tool_call_count") or 0) >= 1]
-        n_eligible = len(import_eligible)
-        n_with = len(with_calls)
-        mean_calls = (
-            sum((r.get("tool_call_count") or 0) for r in import_eligible) / n_eligible
-        )
-        all_calls = [tc for r in import_eligible for tc in (r.get("tool_calls") or [])]
-        n_calls_total = len(all_calls)
-        n_calls_ok = sum(1 for tc in all_calls if tc.get("ok"))
-        success_rate = (100 * n_calls_ok / n_calls_total) if n_calls_total else 0.0
-        print(
-            f"IMPORT-eligible tasks: {n_eligible}\n"
-            f"  Tasks with >=1 tool call: {n_with}/{n_eligible} ({100 * n_with / n_eligible:.1f}%)\n"
-            f"  Mean tool calls / task:   {mean_calls:.2f}\n"
-            f"  Tool-call success rate:   {n_calls_ok}/{n_calls_total} ({success_rate:.1f}%)"
-        )
-
-    # Top-10 most-called functions
-    name_counter: Counter = Counter()
-    for r in results:
-        for tc in r.get("tool_calls") or []:
-            name = (tc.get("name") or "").split("<|", 1)[0].strip()
-            if name:
-                name_counter[name] += 1
-    if name_counter:
-        print("Top-10 most-called toolbox functions:")
-        for name, cnt in name_counter.most_common(10):
-            print(f"  {cnt:4d}  {name}")
-    else:
-        print("No tool calls recorded in this run.")
-
-
-if __name__ == "__main__":
-    main()
-```
-
-- [ ] **Step 9.2: Make the script executable and lint-check**
-
-Run: `chmod +x scripts/analyze_trove_run.py && python -c "import ast; ast.parse(open('scripts/analyze_trove_run.py').read())" && echo OK`
-
-Expected: `OK`.
-
-- [ ] **Step 9.3: Smoke-test on synthetic data**
-
-Run:
-
-```bash
-python -c "
-import json, tempfile, subprocess
-rows = [
-    {'result': {'solved': True,  'won_mode': 'import', 'import_eligible': True,  'tool_call_count': 2, 'tool_calls': [{'name':'find_replace_chain','ok':True},{'name':'find_replace_chain','ok':True}], 'library_snapshot':[{'name':'find_replace_chain'}]}},
-    {'result': {'solved': False, 'won_mode': 'create', 'import_eligible': False, 'tool_call_count': 0, 'tool_calls': [], 'library_snapshot':[{'name':'find_replace_chain'}]}},
-]
-with tempfile.NamedTemporaryFile('w', suffix='.jsonl', delete=False) as f:
-    for r in rows: f.write(json.dumps(r) + '\n')
-    p = f.name
-print(subprocess.check_output(['python','scripts/analyze_trove_run.py', p]).decode())
-"
-```
-
-Expected output contains `Solved: 1/2 (50.0%)`, `Final toolbox size: 1`, `Mode wins: {'import': 1, 'create': 1}`, `IMPORT-eligible tasks: 1`, `Tool-call success rate: 2/2 (100.0%)`, and a row `2  find_replace_chain` in the top-10.
-
-- [ ] **Step 9.4: Commit**
-
-```bash
-git add scripts/analyze_trove_run.py
-git commit -m "$(cat <<'EOF'
-feat(trove): add analyze_trove_run.py for post-hoc telemetry reports
-
-Reads a TroVE JSONL output and reports overall accuracy, final toolbox
-size, per-mode wins, IMPORT-mode tool-use breakdown (>=1 call rate,
-mean calls/task, success rate), and the top-10 most-called toolbox
-functions. Sanitizes Harmony control-token contamination in tool names
-when aggregating.
-EOF
-)"
-```
-
----
-
-## Task 10: Rewrite `docs/deviations.md`
-
-**Files:**
-- Create: `symbolic_agent/baselines/trove/docs/deviations.md`
-
-- [ ] **Step 10.1: Create the directory and the deviations doc**
-
-Create `symbolic_agent/baselines/trove/docs/deviations.md`:
-
-```markdown
-# TroVE Implementation: Deviations and Faithful Elements
-
-This document tracks how this port differs from — and where it stays
-faithful to — the original TroVE algorithm
-([Wang et al., 2024](https://arxiv.org/abs/2401.12869),
-[zorazrw/trove](https://github.com/zorazrw/trove)).
-
-## 1. Algorithmic deviations
-
-### 1.1 Native OpenAI tool calling for IMPORT mode
-The original TroVE shows the model a `**Toolbox**` markdown block
-listing top-k function signatures and asks it to write a `**Solution**`
-plus `**Tools**` block referencing those functions by name. We replace
-this for the IMPORT mode (when `backend == "openai"` and the toolbox is
-non-empty) with **native OpenAI tool calling**: the toolbox is exposed
-via the `tools=[...]` parameter of `chat.completions.create`, the model
-emits structured `tool_calls` during its reasoning, and `dispatch_tool_call`
-runs each one in the sandboxed executor and returns the stdout. This
-makes function usage observable and credit-able from the trajectory
-itself.
-
-### 1.2 Reward-based candidate selection (default)
-The paper uses self-consistency (majority vote on stdout, AST tie-break)
-to pick the best of K samples per mode. We default to **reward-based
-selection**: every candidate is scored by the per-task reward function,
-ties broken by minimum AST node count. This is more reliable on
-PBEBench (program-list outputs rarely tie as strings). The original
-self-consistency selector remains available via `--trove-selection consistency`.
-
-### 1.3 PBEBench-shaped few-shot examples
-For `task_family="pbebench"` we replace the generic CREATE / SKIP / IMPORT
-example pairs with PBEBench-shaped pairs that demonstrate `replace()`
-chains and a small reusable helper (`find_replace_chain`). The legacy
-default examples remain for `task_family="default"`.
-
-### 1.4 Strict **Solution** parsing for PBEBench
-The legacy parser falls back to "first ```python``` block anywhere" when
-no `**Solution**` block is present. For `task_family="pbebench"` this
-fallback is disabled, preventing CoT scratchpad from being accidentally
-promoted to the answer.
-
-## 2. Faithful elements
-
-- 3-mode generation (IMPORT, CREATE, SKIP).
-- K samples per mode (default K=5, paper).
-- AST-tie-breaking by node count (simplest solution wins).
-- Periodic toolbox trimming with threshold `C·log_{20}(n)`, default
-  `C=1.0`, matching the original implementation.
-- Frequency-based top-k retrieval for the toolbox view.
-- Dict-keyed toolbox structure mirroring `utils/code.py`.
-- Library updates: IMPORT credits frequency, CREATE adds new functions
-  on success, SKIP makes no library changes.
-
-## 3. Infrastructural patches
-
-- **JSONL-per-task checkpointing** via `--output-file`, with crash
-  resumption.
-- **`reasoning_content` fallback** in `_call_openai` for `gpt-oss` Harmony
-  channel splits where the answer text lives in `message.reasoning_content`.
-- **Executor timeout 60s** (vs. 10s in earlier versions of this port),
-  closer to the original's ~100s.
-- **`<|`-truncation sanitizer** in `dispatch_tool_call` and
-  `_update_library`. Defensive workaround for the open vLLM
-  [PR #35906](https://github.com/vllm-project/vllm/pull/35906) covering
-  Harmony control-token leakage into tool names. When that PR lands
-  upstream the sanitizer becomes a no-op and is left in place.
-
-## 4. Backend coverage caveat
-
-Anthropic backend code paths exist and are exercised by CREATE / SKIP and
-the legacy text-based IMPORT fallback, but **the smoke run and reported
-numbers are vLLM-served `gpt-oss` only**. IMPORT-with-tools requires
-the OpenAI/vLLM backend and is the only path we test end-to-end.
-
-## 5. vLLM version requirement
-
-- Minimum vLLM: **v0.16.0** (branch-cut 2026-02-08).
-- Required upstream change: [PR #28729](https://github.com/vllm-project/vllm/pull/28729)
-  ("Multiple fixes for gpt-oss Chat Completion prompting"), merged
-  2025-12-12. v0.16.0 is the first stable release branch-cut after the merge.
-- Known open caveat: [PR #35906](https://github.com/vllm-project/vllm/pull/35906)
-  ("Sanitize leaked Harmony control tokens"), still open as of late
-  March 2026 — see §3 for the sanitizer mitigation.
-```
-
-- [ ] **Step 10.2: Verify the file renders**
-
-Run: `head -20 symbolic_agent/baselines/trove/docs/deviations.md`
-
-Expected: the document renders with the title on the first line.
-
-- [ ] **Step 10.3: Commit**
-
-```bash
-git add symbolic_agent/baselines/trove/docs/deviations.md
-git commit -m "$(cat <<'EOF'
-docs(trove): rewrite deviations.md for native tool calling
-
-Document algorithmic deviations (native OpenAI tool calling for IMPORT,
-reward-based selection by default, PBEBench-shaped few-shots, strict
-**Solution** parsing for pbebench), faithful elements (3-mode generation,
-K-sampling, AST tie-break, C*log_20(n) trimming with C=1.0), and
-infrastructural patches (JSONL checkpointing, reasoning_content
-fallback, 60s executor timeout, defensive <|-truncation sanitizer).
-
-Includes vLLM version requirement (>= v0.16.0 for PR #28729) and the
-backend coverage caveat (smoke run is vLLM-served gpt-oss only).
-EOF
-)"
-```
-
----
-
-## Task 11: Pre-flight sanity check + 50-task smoke run + report
-
-**Files:** none modified. This is the validation task.
-
-- [ ] **Step 11.1: Re-launch vLLM with the new flags**
-
-The existing launcher is named `launch_vllm_gpt_oss_120b.sh` but the spec calls for `gpt-oss-20b`. Two options — pick one:
-
-(a) **Smoke on 120b directly** (no script change beyond Task 8). Run:
-
-```bash
-bash scripts/launch_vllm_gpt_oss_120b.sh 8000
-```
-
-Then in Tasks 11.2 and 11.4, replace `--model openai/gpt-oss-20b` with `--model openai/gpt-oss-120b`.
-
-(b) **Smoke on 20b** (one-line edit). In `scripts/launch_vllm_gpt_oss_120b.sh`, change `openai/gpt-oss-120b` → `openai/gpt-oss-20b` for both `--model` and `--tokenizer`, and lower `--tensor-parallel-size 2` → `--tensor-parallel-size 1` (20b fits on one GPU). Then:
-
-```bash
-bash scripts/launch_vllm_gpt_oss_120b.sh 8000
-```
-
-(Do not commit the edit — restore the file before the final commit, or rename the script if you want the 20b variant kept.)
-
-Then wait 60–120 seconds and confirm the server is up:
-
-Run: `curl -sS http://localhost:8000/v1/models | head -5`
-
-Expected: a JSON response listing the model you launched.
-
-- [ ] **Step 11.2: Pre-flight: one-task smoke**
-
-Run a single task to verify the tool-calling round-trip works end-to-end. The codebase has no `--num-tasks` flag, so we slice the first row out of the 50-task PBEBench-Lite file:
-
-```bash
-mkdir -p outputs/trove_pbebench_preflight
-head -n 1 data/pbebench/lite_pilot_tasks.jsonl > /tmp/_pbebench_one.jsonl
-VLLM_API_KEY=EMPTY python main.py \
-  --framework trove \
-  --tasks-file /tmp/_pbebench_one.jsonl \
-  --output-file outputs/trove_pbebench_preflight/results.jsonl \
-  --model openai/gpt-oss-20b \
-  --backend vllm \
-  --base-url http://localhost:8000/v1 \
-  --trove-task-family pbebench \
-  --trove-selection reward \
-  --trove-k 3 \
-  --trove-trim-every 9999 \
-  --max-tokens 4096 \
-  --debug-dir outputs/trove_pbebench_preflight/debug
-```
-
-Expected: the run completes without crashing. The output file should contain one row.
-
-- [ ] **Step 11.3: Verify the tool-calling pre-flight check**
-
-This task starts with an empty toolbox so the IMPORT-with-tools branch will not run. Inspect the most recent debug-dir log file with `trove_create` or `trove_skip` in the name and confirm it contains a non-empty response:
-
-Run: `ls -t outputs/trove_pbebench_preflight/debug/trove_run_*/0001_*.json | head -1 | xargs python -c "import json,sys; d=json.load(open(sys.argv[1])); print('content length:', len(d['response']['content']))"`
-
-Expected: non-zero content length. If zero, the `reasoning_content` fallback (Task 1.3) is not engaging — debug before proceeding.
-
-- [ ] **Step 11.4: Run the 50-task smoke**
-
-`data/pbebench/lite_pilot_tasks.jsonl` is exactly 50 PBEBench-Lite tasks with per-task `reward: pbebench`, so no slicing or `--default-reward` flag is required.
-
-```bash
-mkdir -p outputs/trove_pbebench_smoke
-VLLM_API_KEY=EMPTY python main.py \
-  --framework trove \
-  --tasks-file data/pbebench/lite_pilot_tasks.jsonl \
-  --output-file outputs/trove_pbebench_smoke/results.jsonl \
-  --model openai/gpt-oss-20b \
-  --backend vllm \
-  --base-url http://localhost:8000/v1 \
-  --trove-task-family pbebench \
-  --trove-selection reward \
-  --trove-k 3 \
-  --trove-trim-every 9999 \
-  --max-tokens 4096 \
-  --debug-dir outputs/trove_pbebench_smoke/debug
-```
-
-Expected: ~30–60 minutes wall-clock on local vLLM. Run completes without crashes. Auto-resume from checkpoint is supported by `--output-file` if the run is interrupted.
-
-- [ ] **Step 11.5: Run the analysis script and capture the report**
-
-Run: `python scripts/analyze_trove_run.py outputs/trove_pbebench_smoke/results.jsonl | tee outputs/trove_pbebench_smoke/report.txt`
-
-Expected: the report shows accuracy, toolbox size, mode wins, IMPORT-mode tool-use breakdown, and top-10 functions.
-
-- [ ] **Step 11.6: Report numbers to the user (no prompt iteration)**
-
-Per the spec's "done criteria", report the contents of `outputs/trove_pbebench_smoke/report.txt` plus a short narrative paragraph noting any anomalies (e.g. `<|channel|>` contamination from PR #35906, `max_iters` stops, JSON-arg parse failures).
-
-**No prompt iteration. No threshold tuning. The numbers are what they are.**
-
----
-
-## Self-Review
-
-### 1. Spec coverage
-
-| Spec section | Implementing task |
-|---|---|
-| §3 Architecture overview | Tasks 1–8 collectively |
-| §4 Data flow for IMPORT-with-tools | Tasks 4–6 |
-| §5.1 New `tools_api.py` | Task 4 |
-| §5.2 `_call_openai` reasoning fallback | Task 1 |
-| §5.2 `chat_with_tools` method | Task 5 |
-| §5.3 Controller `__init__` params, IMPORT branch, `_update_library`, `_make_result` | Task 6 |
-| §5.4 `imported_callsites`, `task_family` in parse_response | Task 2 |
-| §5.5 PBEBench prompts and IMPORT-with-tools prompt | Task 3 |
-| §5.6 Trim `C=1.0` | Task 1 |
-| §5.7 Executor timeout 60s | Task 1 |
-| §5.8 main.py CLI flags | Task 7 |
-| §5.9 vLLM launcher flags | Task 8 |
-| §5.10 `analyze_trove_run.py` | Task 9 |
-| §5.11 deviations.md rewrite | Task 10 |
-| §6 Telemetry fields | Task 6.7 |
-| §7 Implementation defaults | Tasks 4–6 |
-| §8 Smoke run + done criteria | Task 11 |
-
-All sections accounted for.
-
-### 2. Placeholder scan
-
-No `TBD`, `TODO`, `implement later`, "appropriate", "various", or "fill in details" in any task. All test code is fully written (not "write tests for the above"). All file paths are exact. All commit messages are pre-written.
-
-### 3. Type and signature consistency
-
-- `imported_callsites(solution_code, tools_code, candidate_names)` — defined in Task 2, called in Task 6.7 with matching kwargs.
-- `toolbox_to_openai_tools(toolbox, topk=10)` — defined in Task 4, called in Task 6.4.
-- `dispatch_tool_call(toolbox, tool_call) -> str` — defined in Task 4, called via the `on_tc` closure in Task 6.4.
-- `chat_with_tools(messages, tools, model, max_tokens, max_tool_iters, on_tool_call, tag)` — defined in Task 5, called in Task 6.4 with matching kwargs.
-- `build_import_with_tools_prompt(question, task_family)` — defined in Task 3, called in Task 6.4.
-- `build_import_prompt(question, toolbox_str, task_family)` — extended in Task 3, called in Task 6.3.
-- `parse_response(text, task_family)` — extended in Task 2, called in Tasks 6.3 and 6.4.
-- `TroVEController(__init__)` new params (`task_family`, `selection`, `max_tool_iters`, `tool_schema_topk`) — defined in Task 6.1, passed in Task 7.2 (only `task_family` and `selection` from CLI; the other two use defaults, which matches the spec's defaults table).
-
-All consistent.
-
-### 4. Plan quirks worth noting to the executor
-
-- Task 11.4 relies on the user's `task_index_25_direct_feedback.json` having at least 50 tasks. If it has fewer, swap to whichever PBEBench-Lite tasks file is available (the spec calls for "50 PBEBench-Lite tasks"; the exact filename is not load-bearing).
-- Task 11.5 `tee` output captures the report for the user-facing message in 11.6.
-- The `import_eligible` field in `_make_result` is computed *after* `_update_library` runs for the current task. The doc-comment in Task 6.7 explains the consequence; the analyzer in Task 9 doesn't depend on the pre-task value.
-- Task 6.5's `_select_best` change wraps the existing reward/consistency selectors. When `selection="consistency"` is set, the `reward_fn` and `entry` arguments are ignored — that is intentional and matches the user's choice to keep both flags as opt-ins.
-
----
-
-## Execution Handoff
-
-Plan complete and saved to `docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md`. Two execution options:
-
-**1. Subagent-Driven (recommended)** — I dispatch a fresh subagent per task, review between tasks, fast iteration.
-
-**2. Inline Execution** — Execute tasks in this session using executing-plans, batch execution with checkpoints.
-
-Which approach?
diff --git a/docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md b/docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md
deleted file mode 100644
index ff0fc0a1..00000000
--- a/docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md
+++ /dev/null
@@ -1,374 +0,0 @@
-# TroVE Native Tool Calling — Design Spec
-
-**Date:** 2026-04-25
-**Branch:** `trove_baseline`
-**Status:** Approved (sectional review complete; self-review applied)
-
----
-
-## 1. Problem statement
-
-Our existing TroVE port (`symbolic_agent/baselines/trove/*`) faithfully implements the original 3-mode generation (IMPORT / CREATE / SKIP) via free-form text prompts. When run on PBEBench with `gpt-oss-20b` / `gpt-oss-120b` served via vLLM, two failure modes are observed:
-
-1. **The toolbox is populated but never used.** The model emits `**Toolbox**` and `**Solution**` blocks that ignore previously-induced functions even when those functions match the task family. CoT shows the model "discovering" the same primitive sequence repeatedly.
-2. **CoT and final code are decoupled.** Even when the prompt names a toolbox helper, the model's reasoning channel does not interleave concrete invocations of that helper — calls only appear (if at all) in the final code, with no per-call signal we can audit.
-
-The user requirement is: **the model's chain-of-thought should interleave with concrete function calls into the induced toolbox**. The mechanism for this is **native OpenAI tool calling**: the toolbox is exposed via the `tools=[...]` parameter of `chat.completions.create`, and the model emits structured `tool_calls` during its reasoning that vLLM dispatches back to us. Each tool call is real, auditable, and credited toward toolbox frequency.
-
-This spec adapts the IMPORT mode of TroVE to use that mechanism, while keeping CREATE and SKIP modes text-based and preserving the rest of the algorithm faithful to the paper.
-
----
-
-## 2. Goals and non-goals
-
-### Goals
-
-- IMPORT-mode trajectories are produced by a multi-turn loop where `gpt-oss` calls toolbox functions natively via `tool_calls`.
-- Frequency credit reflects what the model actually called, not what appeared in text.
-- The 3-way generation (IMPORT, CREATE, SKIP), K-sampling, reward-based candidate selection, AST tie-breaking, and `C·log_{20}(n)` trimming all remain faithful to the original TroVE algorithm.
-- The smoke run produces enough telemetry (per-task `tool_calls` lists, per-mode wins, function-frequency table) to attribute any accuracy delta vs. the no-toolbox baseline to actual tool usage.
-- "Done" = code complete + 50-task PBEBench-Lite smoke run on `gpt-oss-20b` + numbers reported. **No prompt iteration to chase performance targets.**
-
-### Non-goals
-
-- We do not change CREATE or SKIP mode generation. They remain single-shot text completions exactly as today.
-- We do not pre-seed the toolbox.
-- We do not change reward semantics, the PBEBench harness, or the executor's I/O contract.
-- We do not chase a specific accuracy target. Per the original TroVE methodology, we report what the algorithm produces.
-- We do not test or report Anthropic backend numbers — only vLLM-served `gpt-oss`.
-
----
-
-## 3. Architecture overview
-
-```mermaid
-flowchart TD
-    Task[PBEBench task] --> Controller[TroVEController._multi_way_generation]
-    Controller --> ImportBranch{toolbox non-empty<br/>AND backend == openai?}
-    Controller --> Create[CREATE mode<br/>K text-only completions]
-    Controller --> Skip[SKIP mode<br/>K text-only completions]
-
-    ImportBranch -->|yes| ImportTools[_generate_import_with_tools<br/>K multi-turn tool-calling trajectories]
-    ImportBranch -->|no, Anthropic or empty| LegacyImport[legacy text-based IMPORT<br/>defensive fallback path]
-
-    ImportTools --> ToolsApi[tools_api.toolbox_to_openai_tools<br/>top-k toolbox -> OpenAI tool schemas]
-    ImportTools --> ChatLoop[llm.chat_with_tools<br/>multi-turn loop, max_tool_iters=8]
-    ChatLoop --> Vllm[vLLM /v1/chat/completions<br/>--tool-call-parser openai<br/>--reasoning-parser openai_gptoss]
-    Vllm --> Dispatcher[tools_api.dispatch_tool_call<br/>sandbox execute via executor.run_solution]
-    Dispatcher --> ChatLoop
-
-    ImportTools --> ImportCands[K IMPORT candidates<br/>final assistant text + tool_call trajectory]
-    Create --> CreateCands[K CREATE candidates]
-    Skip --> SkipCands[K SKIP candidates]
-    LegacyImport --> ImportCands
-
-    ImportCands --> Pick[_select_best_by_reward<br/>tie-break by AST node count]
-    CreateCands --> Pick
-    SkipCands --> Pick
-
-    Pick --> Library[_update_library<br/>credit frequency from tool_calls]
-    Library --> Trim[periodic toolbox.trim<br/>C * log_20 n_processed, C=1.0]
-```
-
-**One-line summary:** Only the IMPORT branch changes. Everything else (CREATE, SKIP, K-sampling, selection, library updates, trimming) stays where the existing port already has it.
-
----
-
-## 4. Data flow for IMPORT-with-tools
-
-```mermaid
-sequenceDiagram
-    participant Ctrl as TroVEController
-    participant Tools as tools_api
-    participant LLM as TroVELLMClient.chat_with_tools
-    participant vLLM
-    participant Exec as executor.run_solution
-
-    Ctrl->>Tools: toolbox_to_openai_tools(toolbox, topk=10)
-    Tools-->>Ctrl: tools_schema (list[dict])
-    Ctrl->>LLM: chat_with_tools(messages, tools_schema, model, max_tool_iters=8)
-    loop iter 1..N (N <= 8)
-        LLM->>vLLM: chat.completions.create(messages, tools=tools_schema)
-        vLLM-->>LLM: assistant message (content + reasoning_content + tool_calls)
-        alt tool_calls present
-            LLM->>Tools: dispatch_tool_call(toolbox, tool_call)
-            Tools->>Exec: run_solution(toolbox_src + call_expr, task_inputs)
-            Exec-->>Tools: stdout (truncated to 4096 chars) or error
-            Tools-->>LLM: tool result string
-            LLM->>LLM: append assistant + tool messages, loop
-        else no tool_calls
-            LLM-->>Ctrl: trajectory (final text + recorded tool_calls)
-        end
-    end
-    Ctrl->>Ctrl: parse **Solution** block from final text
-    Ctrl->>Ctrl: credit frequency by unique tool_call.function.name
-```
-
----
-
-## 5. Components
-
-### 5.1 New file: `symbolic_agent/baselines/trove/tools_api.py`
-
-Two pure functions; no state.
-
-**`toolbox_to_openai_tools(toolbox: TroVEToolbox, topk: int = 10) -> list[dict]`**
-
-- Selects the top-k entries by frequency (matching the existing `format_toolbox(topk=10)` view).
-- For each entry, executes the toolbox source via `exec(toolbox.get_full_code(), namespace)` into a fresh dict, then reads `inspect.signature(namespace[fn_name])` to enumerate parameters and annotations.
-- Builds an OpenAI `chat.completions` tool dict:
-  ```json
-  {
-    "type": "function",
-    "function": {
-      "name": "<fn_name>",
-      "description": "<docstring or empty string>",
-      "parameters": {
-        "type": "object",
-        "properties": {"<param>": {"type": "<inferred>"}, ...},
-        "required": [<all params without defaults>]
-      }
-    }
-  }
-  ```
-- Type inference: `int → integer`, `float → number`, `bool → boolean`, `list/tuple → array`, `dict → object`, anything else (or unannotated) → `string`. Numeric and string defaults: pass through to the schema as `default`. Anything else: omit the default.
-- Functions with `*args` or `**kwargs` are excluded from the tool list (we cannot generate a meaningful schema; this is rare for induced TroVE helpers and is logged to the debug dir for inspection).
-
-**`dispatch_tool_call(toolbox: TroVEToolbox, tool_call) -> str`**
-
-- Sanitizes the tool name: `name = tool_call.function.name.split("<|", 1)[0]`. This is a defensive 2-line workaround for the open vLLM bug tracked by PR #35906 (Harmony control tokens leaking into tool names like `find_replace_chain<|channel|>commentary`). If/when #35906 lands upstream, this becomes a no-op.
-- If `name` is not in `toolbox`, returns the JSON string `{"error": "tool '<name>' not in toolbox"}` (the model can recover).
-- Parses `tool_call.function.arguments` as JSON; on parse error returns `{"error": "argument JSON parse failed: <msg>"}`.
-- Builds a one-liner call expression: `print(repr(<name>(**<args>)))`.
-- Runs `executor.run_solution` with `toolbox.get_full_code() + "\n" + call_expr` and `inputs={}`. (PBEBench task inputs are not needed at the function-call level — the model passes inputs as arguments.)
-- Returns the captured stdout truncated to **4096 characters** (UTF-8 codepoints, not bytes — simpler to truncate without splitting a codepoint), or the error message on non-zero exit.
-
-### 5.2 Modify: `symbolic_agent/baselines/trove/llm.py`
-
-**`TroVELLMClient._call_openai`**
-
-- After reading `response.choices[0].message.content`, fall back to `getattr(response.choices[0].message, "reasoning_content", "")` when `content` is empty/None. This handles `gpt-oss` Harmony channel splits where the answer lands in the reasoning channel for non-tool-calling text completions (CREATE, SKIP, and legacy IMPORT). No change to the function signature.
-
-**New method: `TroVELLMClient.chat_with_tools(messages, tools, model, max_tokens, max_tool_iters=8, on_tool_call, tag) -> dict`**
-
-- Returns `{"final_text": str, "tool_calls": list[dict], "iterations": int, "stopped_reason": str}`.
-  - `final_text` is the assistant message content from the final iteration (`""` if none).
-  - `tool_calls` is the ordered list of recorded calls, each `{"name": str, "args_preview": str (≤200 chars), "result_preview": str (≤200 chars), "ok": bool}`.
-- Implements the multi-turn loop:
-  1. Append the user message.
-  2. POST `chat.completions.create(model, messages, tools, tool_choice="auto", max_tokens)`.
-  3. If `message.tool_calls` is empty: record `final_text` (with `reasoning_content` fallback) and return.
-  4. Otherwise: append the assistant message verbatim, then for each `tool_call` invoke `on_tool_call(tool_call)` (the controller passes a closure that calls `tools_api.dispatch_tool_call`). Append a `{"role": "tool", "tool_call_id": ..., "content": <result>}` message per call.
-  5. Increment iteration counter; if `iterations >= max_tool_iters`, stop with `stopped_reason="max_iters"` and return what we have.
-- Defensive guard: raises `NotImplementedError("chat_with_tools requires the openai backend")` on `self.backend == "anthropic"`. This guard is **never tripped in normal flow** because the controller branches on `self.backend == "openai"` before calling. It exists only to fail loudly if a future caller invokes the method directly.
-- Uses the same 3-attempt retry, the same per-call debug logging (writing one JSON file per LLM round-trip into the existing `_debug_dir` with the tag suffixed by `_iter{n}`), and the same token accounting as `_call_openai`.
-
-### 5.3 Modify: `symbolic_agent/baselines/trove/controller.py`
-
-**`__init__`** — add two parameters:
-- `task_family: str = "default"` — passed through to `prompts.build_*_prompt` and `parse.parse_response`.
-- `selection: str = "reward"` — `"reward"` (default) uses the existing `_select_best_by_reward`; `"consistency"` uses the existing `_select_best_by_consistency`.
-
-**`_multi_way_generation`** — change the IMPORT branch only:
-- If `self.backend == "openai"` AND `len(self.toolbox) > 0`: call new `_generate_import_with_tools(task, K)`.
-- Else: call the existing legacy text-based IMPORT path. (Anthropic and empty-toolbox both fall through here; the latter is correct because there are no tools to expose anyway.)
-- CREATE and SKIP branches: unchanged.
-
-**New method: `_generate_import_with_tools(task, K) -> list[Candidate]`**
-
-- Builds the IMPORT-with-tools prompt via `prompts.build_import_with_tools_prompt(task, task_family=self.task_family)` (no `**Toolbox**` markdown — the toolbox is conveyed via the `tools=[...]` parameter).
-- Builds the tool schema once per task: `tools_schema = tools_api.toolbox_to_openai_tools(self.toolbox, topk=10)`.
-- For `i in range(K)`, calls `self.llm.chat_with_tools(...)` with the tag `f"trove_import_{task.id}_{i}"`.
-- Each returned trajectory becomes one Candidate. Solution code is parsed from the final text via `parse.parse_response(final_text, task_family="pbebench")` (strict `**Solution**` block; no fallback to "any python block").
-- Empty `final_text` → empty solution code → reward=0 → naturally loses in selection.
-
-**`_update_library`** — for `mode == "import"`, credit frequency by **unique `tool_call.function.name`** entries in the trajectory:
-- `unique_names = {sanitize(tc["name"]) for tc in trajectory.tool_calls}` where `sanitize` is the same `<|`-truncation used in `dispatch_tool_call` (defensive symmetry).
-- For each name, call `self.toolbox.update_frequency(name, example_idx)`. Names not present in the toolbox are silently no-ops thanks to the existing filter at `toolbox.py:68` — hallucinated tool names contribute nothing to frequency. Real tool calls (names matching a toolbox entry) get one credit per task per unique name.
-
-**`_make_result`** — emit passive telemetry fields per task. Add to the result dict (no behavior changes):
-- `won_mode: "import" | "create" | "skip"`
-- `import_eligible: bool` (true iff toolbox was non-empty when the task ran)
-- `import_was_winner: bool`
-- `tool_calls: list[{name, args_preview, result_preview, ok}]` (only populated when the IMPORT-with-tools path ran)
-- `tool_call_count: int`
-- `tools_called: list[str]` (unique names actually called)
-- `actually_called: list[str]` (functions from `toolbox` that appear as call-sites in the AST of the winning `**Solution**` code; computed via `parse.imported_callsites`)
-
-### 5.4 Modify: `symbolic_agent/baselines/trove/parse.py`
-
-**New helper: `imported_callsites(solution_code: str, tools_code: str, candidate_names: set[str]) -> set[str]`**
-- AST-walks `solution_code`, returns the subset of `candidate_names` that appear as `Call` targets (handles bare `Name` and `Attribute` callees like `toolbox.find_replace_chain`).
-- Used by `_make_result.actually_called`.
-
-**Modify `parse_response`** — add `task_family: str = "default"` parameter:
-- For `task_family == "pbebench"`, do not fall back to `_extract_any_python_block` if the `**Solution**` block is missing — return empty solution code instead. This enforces strict format adherence and prevents the parser from accidentally promoting CoT scratchpad to the answer.
-- For all other families, behavior is unchanged.
-
-### 5.5 Modify: `symbolic_agent/baselines/trove/prompts.py`
-
-- Add PBEBench-shaped few-shot examples: `_CREATE_EXAMPLE_PBEBENCH` and `_SKIP_EXAMPLE_PBEBENCH`. Each demonstrates a sequence of `replace()` operations and (in CREATE's case) a small reusable helper such as `find_replace_chain(s, pairs)` so the model has a concrete pattern to imitate.
-- Add **`_IMPORT_INSTRUCTION_FOR_TOOLS`** and **`_IMPORT_EXAMPLE_FOR_TOOLS`**: the prompt for IMPORT-with-tools mode. These do *not* include a `**Toolbox**` markdown block (the toolbox is conveyed via the `tools=[...]` parameter). They instruct the model to use the available tools when helpful and to produce a final answer in a `**Solution**` block.
-- Add **`build_import_with_tools_prompt(task, task_family)`** and refactor `build_import_prompt`, `build_create_prompt`, `build_skip_prompt` to accept `task_family` and dispatch to the appropriate example set.
-- Make `_FORMAT_OVERRIDE` conditional: empty string for `task_family == "pbebench"` (the new PBEBench examples model the desired format directly); existing override for other families.
-
-### 5.6 Modify: `symbolic_agent/baselines/trove/toolbox.py`
-
-- `TroVEToolbox.trim`: change default `C: float = 0.5` → `C: float = 1.0` to match the original TroVE implementation.
-
-### 5.7 Modify: `symbolic_agent/baselines/trove/executor.py`
-
-- `DEFAULT_TIMEOUT = 10` → `DEFAULT_TIMEOUT = 60`. Closer to the original TroVE's ~100s; gives PBEBench's `replace()`-chain solutions and the multi-turn tool dispatch enough headroom on local vLLM.
-
-### 5.8 Modify: `main.py`
-
-- Add CLI flag `--trove-selection {reward,consistency}` with `default="reward"`. Plumb to `TroVEController(selection=args.trove_selection)`.
-- When `--dataset pbebench` is specified, pass `task_family="pbebench"` to the controller. Otherwise pass `"default"`.
-
-### 5.9 Modify: `scripts/launch_vllm_gpt_oss_120b.sh`
-
-Add three flags to the `vllm.entrypoints.openai.api_server` invocation:
-- `--enable-auto-tool-choice` — enables `tool_choice="auto"` to actually fire tool calls.
-- `--tool-call-parser openai` — the parser that knows how to extract `tool_calls` from the `gpt-oss` Harmony commentary channel.
-- `--reasoning-parser openai_gptoss` — routes Harmony analysis-channel content into `message.reasoning_content` rather than dropping it.
-
-### 5.10 New file: `scripts/analyze_trove_run.py`
-
-Read a TroVE JSONL output and print:
-- Overall accuracy (pass rate).
-- Final toolbox size.
-- Per-mode wins (counts of `won_mode == "import"`, `"create"`, `"skip"`).
-- IMPORT-mode behavior breakdown:
-  - Tasks with `import_eligible == True` and `tool_call_count >= 1` (rate).
-  - Mean `tool_call_count` across IMPORT-eligible tasks.
-  - Tool-call success rate: fraction of `tool_calls` entries with `ok == True`.
-- Top-10 most-called toolbox functions (by total call count across the run).
-
-### 5.11 Rewrite: `symbolic_agent/baselines/trove/docs/deviations.md`
-
-(Path may need creation if it doesn't exist.) Three sections:
-
-1. **Algorithmic deviations:**
-   - Native OpenAI tool calling for IMPORT mode (replaces the original text-based "model selects from `**Toolbox**` markdown" mechanism).
-   - Reward-based candidate selection by default (vs. self-consistency in the paper); self-consistency available via `--trove-selection consistency`.
-   - PBEBench-shaped few-shot examples in CREATE and SKIP prompts.
-
-2. **Faithful elements:** 3-mode generation, K-sampling per mode, AST-tie-breaking by node count, `C·log_{20}(n)` periodic trimming with `C=1.0`, frequency-based top-k retrieval for the toolbox view, dict-keyed toolbox structure mirroring `utils/code.py`.
-
-3. **Infrastructural patches:** JSONL-per-task checkpointing, `reasoning_content` fallback in `_call_openai`, executor timeout 60s, defensive `<|`-truncation sanitizer in the tool-call dispatcher (workaround for open vLLM PR #35906 covering Harmony control-token leakage).
-
-4. **Backend coverage caveat:** Anthropic backend code paths are still present and exercised by CREATE / SKIP / legacy IMPORT, but the smoke run and reported numbers are vLLM-served `gpt-oss` only. IMPORT-with-tools requires the OpenAI/vLLM backend.
-
----
-
-## 6. Telemetry to be collected
-
-Per task (in the JSONL row):
-
-| Field | Type | Source |
-|---|---|---|
-| `won_mode` | string | controller `_make_result` |
-| `import_eligible` | bool | `len(toolbox) > 0` at task start |
-| `import_was_winner` | bool | `won_mode == "import"` |
-| `tool_calls` | list[dict] | `chat_with_tools` recorded list |
-| `tool_call_count` | int | `len(tool_calls)` |
-| `tools_called` | list[str] | unique names from `tool_calls` |
-| `actually_called` | list[str] | `parse.imported_callsites(winning_solution, ...)` |
-
-Per run (computed by `scripts/analyze_trove_run.py`):
-
-- Overall accuracy
-- Final toolbox size
-- Mode-win histogram
-- IMPORT-mode tool-use rate, mean calls/task, success rate
-- Top-10 most-called functions
-
----
-
-## 7. Implementation defaults
-
-| Choice | Value | Rationale |
-|---|---|---|
-| `K` (samples per mode) | 3 | Matches existing controller; matches paper |
-| Tool schema top-k | 10 | Matches existing `format_toolbox(topk=10)` |
-| `max_tool_iters` | 8 | Allows multi-step compositions; bounded for safety |
-| Tool result truncation | 4096 characters | Avoids truncating mid-codepoint; safe for JSON |
-| Trim coefficient `C` | 1.0 | Matches the original TroVE `λ = log_{20}(n)` |
-| Executor timeout | 60s | PBEBench `replace()`-chains + multi-turn dispatch |
-| Selection default | `reward` | Existing PBEBench reward signal is reliable |
-| Tool name sanitization | `name.split("<\|", 1)[0]` | Defensive vs. open vLLM PR #35906 |
-
----
-
-## 8. Smoke run
-
-**Command (filled when ready to execute):**
-
-```bash
-# Launch vLLM (after script is updated with the three new flags)
-bash scripts/launch_vllm_gpt_oss_120b.sh 8000
-
-# Run TroVE on 50 PBEBench-Lite tasks with gpt-oss-20b
-python main.py \
-  --dataset pbebench \
-  --baseline trove \
-  --model gpt-oss-20b \
-  --backend openai \
-  --base-url http://localhost:8000/v1 \
-  --num-tasks 50 \
-  --trove-selection reward \
-  --debug-dir ./outputs/trove_pbebench_smoke
-
-# Analyze
-python scripts/analyze_trove_run.py outputs/trove_pbebench_smoke/results.jsonl
-```
-
-**Pre-flight check.** Before kicking off the full 50-task run, run a single one-task smoke and verify:
-1. The OpenAI client request payload contains `tools=[...]` with at least one entry once the toolbox has been populated.
-2. The first response with a non-empty toolbox returns at least one `tool_call` from vLLM (visible in the debug log JSON for that round-trip).
-
-If `message.tool_calls` is None or missing on a non-empty-toolbox task, **verify all three vLLM flags (`--enable-auto-tool-choice`, `--tool-call-parser openai`, `--reasoning-parser openai_gptoss`) are present in the launcher script**, restart vLLM, and re-run the sanity check before proceeding.
-
-**Done criteria.**
-
-- All code changes merged on `trove_baseline`.
-- Smoke run completes without crashes.
-- Reported numbers (in plain text or a brief markdown summary):
-  - Overall accuracy (pass rate over 50 tasks)
-  - Final toolbox size
-  - Mode-win counts
-  - IMPORT tool-use rate among IMPORT-eligible tasks
-  - Top-10 most-called functions
-  - A short narrative of any anomalies observed (e.g. `<|channel|>` contamination from PR #35906, `max_iters` stops, JSON-arg parse failures).
-
-We **do not** iterate on prompts, schemas, or thresholds to chase a target number. The numbers are what they are.
-
----
-
-## 9. vLLM version requirement and known caveats
-
-- **Minimum vLLM:** v0.16.0 (branch-cut 2026-02-08). Latest as of writing is v0.20.0.
-- **Required upstream change:** PR #28729 ("Multiple fixes for gpt-oss Chat Completion prompting"), merged 2025-12-12 by `@chaunceyjiang`. Without this, multi-turn tool-call flows fail to round-trip the analysis/commentary channels correctly. v0.16.0 is the first stable release branch-cut after the merge.
-- **Known open caveat:** PR #35906 ("Sanitize leaked Harmony control tokens in tool names and recipients") is **still open** as of late March 2026. Symptoms when this hits us: tool names contain Harmony tags, e.g. `find_replace_chain<|channel|>commentary`. Mitigation: the `<|`-truncation sanitizer in `dispatch_tool_call` and `_update_library`. If/when #35906 lands upstream, the sanitizer becomes a no-op and we leave it in place.
-
----
-
-## 10. Cost envelope (smoke run upper bound)
-
-Per task baseline (no IMPORT branch, e.g. first ~10 tasks before the toolbox is populated): K=3 across CREATE and SKIP only = 6 single-shot calls + 3 legacy IMPORT (no-op when toolbox empty, but the call is still made) = 9 round-trips.
-
-Per IMPORT-eligible task (~40 of 50): K=3 multi-turn IMPORT trajectories × up to 8 iterations each + 1 final no-tool turn = up to 27 calls; plus 6 for CREATE and SKIP = up to 33 round-trips.
-
-Total upper bound: 40·33 + 10·9 = **1410 round-trips** for the 50-task smoke. Acceptable for local vLLM.
-
----
-
-## 11. Out of scope (explicit)
-
-- Any change to PBEBench dataset/loader/scoring.
-- Any change to CREATE or SKIP generation paths.
-- Pre-seeding the toolbox.
-- Toolbox persistence across runs.
-- Any change to reward semantics.
-- Any per-task or per-prompt iteration after the smoke run lands.
-- Anthropic backend smoke runs.