From c647358ec9bc4bbacd1121419739362fd2f7dbb4 Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 17:19:42 -0400 Subject: [PATCH 01/24] docs(trove): add native tool calling design spec Specifies the architecture for adapting TroVE's IMPORT mode to use native OpenAI tool calling with vLLM-served gpt-oss models. CREATE and SKIP remain text-based; reward selection, K-sampling, and trimming stay faithful to the paper. Includes telemetry plan, vLLM version requirements (>= v0.16.0 for PR #28729), defensive sanitizer for the open Harmony control-token leakage bug (PR #35906), and the smoke-run done criteria. Made-with: Cursor --- ...-04-25-trove-native-tool-calling-design.md | 374 ++++++++++++++++++ 1 file changed, 374 insertions(+) create mode 100644 docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md diff --git a/docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md b/docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md new file mode 100644 index 00000000..ff0fc0a1 --- /dev/null +++ b/docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md @@ -0,0 +1,374 @@ +# TroVE Native Tool Calling — Design Spec + +**Date:** 2026-04-25 +**Branch:** `trove_baseline` +**Status:** Approved (sectional review complete; self-review applied) + +--- + +## 1. Problem statement + +Our existing TroVE port (`symbolic_agent/baselines/trove/*`) faithfully implements the original 3-mode generation (IMPORT / CREATE / SKIP) via free-form text prompts. When run on PBEBench with `gpt-oss-20b` / `gpt-oss-120b` served via vLLM, two failure modes are observed: + +1. **The toolbox is populated but never used.** The model emits `**Toolbox**` and `**Solution**` blocks that ignore previously-induced functions even when those functions match the task family. CoT shows the model "discovering" the same primitive sequence repeatedly. +2. **CoT and final code are decoupled.** Even when the prompt names a toolbox helper, the model's reasoning channel does not interleave concrete invocations of that helper — calls only appear (if at all) in the final code, with no per-call signal we can audit. + +The user requirement is: **the model's chain-of-thought should interleave with concrete function calls into the induced toolbox**. The mechanism for this is **native OpenAI tool calling**: the toolbox is exposed via the `tools=[...]` parameter of `chat.completions.create`, and the model emits structured `tool_calls` during its reasoning that vLLM dispatches back to us. Each tool call is real, auditable, and credited toward toolbox frequency. + +This spec adapts the IMPORT mode of TroVE to use that mechanism, while keeping CREATE and SKIP modes text-based and preserving the rest of the algorithm faithful to the paper. + +--- + +## 2. Goals and non-goals + +### Goals + +- IMPORT-mode trajectories are produced by a multi-turn loop where `gpt-oss` calls toolbox functions natively via `tool_calls`. +- Frequency credit reflects what the model actually called, not what appeared in text. +- The 3-way generation (IMPORT, CREATE, SKIP), K-sampling, reward-based candidate selection, AST tie-breaking, and `C·log_{20}(n)` trimming all remain faithful to the original TroVE algorithm. +- The smoke run produces enough telemetry (per-task `tool_calls` lists, per-mode wins, function-frequency table) to attribute any accuracy delta vs. the no-toolbox baseline to actual tool usage. +- "Done" = code complete + 50-task PBEBench-Lite smoke run on `gpt-oss-20b` + numbers reported. **No prompt iteration to chase performance targets.** + +### Non-goals + +- We do not change CREATE or SKIP mode generation. They remain single-shot text completions exactly as today. +- We do not pre-seed the toolbox. +- We do not change reward semantics, the PBEBench harness, or the executor's I/O contract. +- We do not chase a specific accuracy target. Per the original TroVE methodology, we report what the algorithm produces. +- We do not test or report Anthropic backend numbers — only vLLM-served `gpt-oss`. + +--- + +## 3. Architecture overview + +```mermaid +flowchart TD + Task[PBEBench task] --> Controller[TroVEController._multi_way_generation] + Controller --> ImportBranch{toolbox non-empty
AND backend == openai?} + Controller --> Create[CREATE mode
K text-only completions] + Controller --> Skip[SKIP mode
K text-only completions] + + ImportBranch -->|yes| ImportTools[_generate_import_with_tools
K multi-turn tool-calling trajectories] + ImportBranch -->|no, Anthropic or empty| LegacyImport[legacy text-based IMPORT
defensive fallback path] + + ImportTools --> ToolsApi[tools_api.toolbox_to_openai_tools
top-k toolbox -> OpenAI tool schemas] + ImportTools --> ChatLoop[llm.chat_with_tools
multi-turn loop, max_tool_iters=8] + ChatLoop --> Vllm[vLLM /v1/chat/completions
--tool-call-parser openai
--reasoning-parser openai_gptoss] + Vllm --> Dispatcher[tools_api.dispatch_tool_call
sandbox execute via executor.run_solution] + Dispatcher --> ChatLoop + + ImportTools --> ImportCands[K IMPORT candidates
final assistant text + tool_call trajectory] + Create --> CreateCands[K CREATE candidates] + Skip --> SkipCands[K SKIP candidates] + LegacyImport --> ImportCands + + ImportCands --> Pick[_select_best_by_reward
tie-break by AST node count] + CreateCands --> Pick + SkipCands --> Pick + + Pick --> Library[_update_library
credit frequency from tool_calls] + Library --> Trim[periodic toolbox.trim
C * log_20 n_processed, C=1.0] +``` + +**One-line summary:** Only the IMPORT branch changes. Everything else (CREATE, SKIP, K-sampling, selection, library updates, trimming) stays where the existing port already has it. + +--- + +## 4. Data flow for IMPORT-with-tools + +```mermaid +sequenceDiagram + participant Ctrl as TroVEController + participant Tools as tools_api + participant LLM as TroVELLMClient.chat_with_tools + participant vLLM + participant Exec as executor.run_solution + + Ctrl->>Tools: toolbox_to_openai_tools(toolbox, topk=10) + Tools-->>Ctrl: tools_schema (list[dict]) + Ctrl->>LLM: chat_with_tools(messages, tools_schema, model, max_tool_iters=8) + loop iter 1..N (N <= 8) + LLM->>vLLM: chat.completions.create(messages, tools=tools_schema) + vLLM-->>LLM: assistant message (content + reasoning_content + tool_calls) + alt tool_calls present + LLM->>Tools: dispatch_tool_call(toolbox, tool_call) + Tools->>Exec: run_solution(toolbox_src + call_expr, task_inputs) + Exec-->>Tools: stdout (truncated to 4096 chars) or error + Tools-->>LLM: tool result string + LLM->>LLM: append assistant + tool messages, loop + else no tool_calls + LLM-->>Ctrl: trajectory (final text + recorded tool_calls) + end + end + Ctrl->>Ctrl: parse **Solution** block from final text + Ctrl->>Ctrl: credit frequency by unique tool_call.function.name +``` + +--- + +## 5. Components + +### 5.1 New file: `symbolic_agent/baselines/trove/tools_api.py` + +Two pure functions; no state. + +**`toolbox_to_openai_tools(toolbox: TroVEToolbox, topk: int = 10) -> list[dict]`** + +- Selects the top-k entries by frequency (matching the existing `format_toolbox(topk=10)` view). +- For each entry, executes the toolbox source via `exec(toolbox.get_full_code(), namespace)` into a fresh dict, then reads `inspect.signature(namespace[fn_name])` to enumerate parameters and annotations. +- Builds an OpenAI `chat.completions` tool dict: + ```json + { + "type": "function", + "function": { + "name": "", + "description": "", + "parameters": { + "type": "object", + "properties": {"": {"type": ""}, ...}, + "required": [] + } + } + } + ``` +- Type inference: `int → integer`, `float → number`, `bool → boolean`, `list/tuple → array`, `dict → object`, anything else (or unannotated) → `string`. Numeric and string defaults: pass through to the schema as `default`. Anything else: omit the default. +- Functions with `*args` or `**kwargs` are excluded from the tool list (we cannot generate a meaningful schema; this is rare for induced TroVE helpers and is logged to the debug dir for inspection). + +**`dispatch_tool_call(toolbox: TroVEToolbox, tool_call) -> str`** + +- Sanitizes the tool name: `name = tool_call.function.name.split("<|", 1)[0]`. This is a defensive 2-line workaround for the open vLLM bug tracked by PR #35906 (Harmony control tokens leaking into tool names like `find_replace_chain<|channel|>commentary`). If/when #35906 lands upstream, this becomes a no-op. +- If `name` is not in `toolbox`, returns the JSON string `{"error": "tool '' not in toolbox"}` (the model can recover). +- Parses `tool_call.function.arguments` as JSON; on parse error returns `{"error": "argument JSON parse failed: "}`. +- Builds a one-liner call expression: `print(repr((**)))`. +- Runs `executor.run_solution` with `toolbox.get_full_code() + "\n" + call_expr` and `inputs={}`. (PBEBench task inputs are not needed at the function-call level — the model passes inputs as arguments.) +- Returns the captured stdout truncated to **4096 characters** (UTF-8 codepoints, not bytes — simpler to truncate without splitting a codepoint), or the error message on non-zero exit. + +### 5.2 Modify: `symbolic_agent/baselines/trove/llm.py` + +**`TroVELLMClient._call_openai`** + +- After reading `response.choices[0].message.content`, fall back to `getattr(response.choices[0].message, "reasoning_content", "")` when `content` is empty/None. This handles `gpt-oss` Harmony channel splits where the answer lands in the reasoning channel for non-tool-calling text completions (CREATE, SKIP, and legacy IMPORT). No change to the function signature. + +**New method: `TroVELLMClient.chat_with_tools(messages, tools, model, max_tokens, max_tool_iters=8, on_tool_call, tag) -> dict`** + +- Returns `{"final_text": str, "tool_calls": list[dict], "iterations": int, "stopped_reason": str}`. + - `final_text` is the assistant message content from the final iteration (`""` if none). + - `tool_calls` is the ordered list of recorded calls, each `{"name": str, "args_preview": str (≤200 chars), "result_preview": str (≤200 chars), "ok": bool}`. +- Implements the multi-turn loop: + 1. Append the user message. + 2. POST `chat.completions.create(model, messages, tools, tool_choice="auto", max_tokens)`. + 3. If `message.tool_calls` is empty: record `final_text` (with `reasoning_content` fallback) and return. + 4. Otherwise: append the assistant message verbatim, then for each `tool_call` invoke `on_tool_call(tool_call)` (the controller passes a closure that calls `tools_api.dispatch_tool_call`). Append a `{"role": "tool", "tool_call_id": ..., "content": }` message per call. + 5. Increment iteration counter; if `iterations >= max_tool_iters`, stop with `stopped_reason="max_iters"` and return what we have. +- Defensive guard: raises `NotImplementedError("chat_with_tools requires the openai backend")` on `self.backend == "anthropic"`. This guard is **never tripped in normal flow** because the controller branches on `self.backend == "openai"` before calling. It exists only to fail loudly if a future caller invokes the method directly. +- Uses the same 3-attempt retry, the same per-call debug logging (writing one JSON file per LLM round-trip into the existing `_debug_dir` with the tag suffixed by `_iter{n}`), and the same token accounting as `_call_openai`. + +### 5.3 Modify: `symbolic_agent/baselines/trove/controller.py` + +**`__init__`** — add two parameters: +- `task_family: str = "default"` — passed through to `prompts.build_*_prompt` and `parse.parse_response`. +- `selection: str = "reward"` — `"reward"` (default) uses the existing `_select_best_by_reward`; `"consistency"` uses the existing `_select_best_by_consistency`. + +**`_multi_way_generation`** — change the IMPORT branch only: +- If `self.backend == "openai"` AND `len(self.toolbox) > 0`: call new `_generate_import_with_tools(task, K)`. +- Else: call the existing legacy text-based IMPORT path. (Anthropic and empty-toolbox both fall through here; the latter is correct because there are no tools to expose anyway.) +- CREATE and SKIP branches: unchanged. + +**New method: `_generate_import_with_tools(task, K) -> list[Candidate]`** + +- Builds the IMPORT-with-tools prompt via `prompts.build_import_with_tools_prompt(task, task_family=self.task_family)` (no `**Toolbox**` markdown — the toolbox is conveyed via the `tools=[...]` parameter). +- Builds the tool schema once per task: `tools_schema = tools_api.toolbox_to_openai_tools(self.toolbox, topk=10)`. +- For `i in range(K)`, calls `self.llm.chat_with_tools(...)` with the tag `f"trove_import_{task.id}_{i}"`. +- Each returned trajectory becomes one Candidate. Solution code is parsed from the final text via `parse.parse_response(final_text, task_family="pbebench")` (strict `**Solution**` block; no fallback to "any python block"). +- Empty `final_text` → empty solution code → reward=0 → naturally loses in selection. + +**`_update_library`** — for `mode == "import"`, credit frequency by **unique `tool_call.function.name`** entries in the trajectory: +- `unique_names = {sanitize(tc["name"]) for tc in trajectory.tool_calls}` where `sanitize` is the same `<|`-truncation used in `dispatch_tool_call` (defensive symmetry). +- For each name, call `self.toolbox.update_frequency(name, example_idx)`. Names not present in the toolbox are silently no-ops thanks to the existing filter at `toolbox.py:68` — hallucinated tool names contribute nothing to frequency. Real tool calls (names matching a toolbox entry) get one credit per task per unique name. + +**`_make_result`** — emit passive telemetry fields per task. Add to the result dict (no behavior changes): +- `won_mode: "import" | "create" | "skip"` +- `import_eligible: bool` (true iff toolbox was non-empty when the task ran) +- `import_was_winner: bool` +- `tool_calls: list[{name, args_preview, result_preview, ok}]` (only populated when the IMPORT-with-tools path ran) +- `tool_call_count: int` +- `tools_called: list[str]` (unique names actually called) +- `actually_called: list[str]` (functions from `toolbox` that appear as call-sites in the AST of the winning `**Solution**` code; computed via `parse.imported_callsites`) + +### 5.4 Modify: `symbolic_agent/baselines/trove/parse.py` + +**New helper: `imported_callsites(solution_code: str, tools_code: str, candidate_names: set[str]) -> set[str]`** +- AST-walks `solution_code`, returns the subset of `candidate_names` that appear as `Call` targets (handles bare `Name` and `Attribute` callees like `toolbox.find_replace_chain`). +- Used by `_make_result.actually_called`. + +**Modify `parse_response`** — add `task_family: str = "default"` parameter: +- For `task_family == "pbebench"`, do not fall back to `_extract_any_python_block` if the `**Solution**` block is missing — return empty solution code instead. This enforces strict format adherence and prevents the parser from accidentally promoting CoT scratchpad to the answer. +- For all other families, behavior is unchanged. + +### 5.5 Modify: `symbolic_agent/baselines/trove/prompts.py` + +- Add PBEBench-shaped few-shot examples: `_CREATE_EXAMPLE_PBEBENCH` and `_SKIP_EXAMPLE_PBEBENCH`. Each demonstrates a sequence of `replace()` operations and (in CREATE's case) a small reusable helper such as `find_replace_chain(s, pairs)` so the model has a concrete pattern to imitate. +- Add **`_IMPORT_INSTRUCTION_FOR_TOOLS`** and **`_IMPORT_EXAMPLE_FOR_TOOLS`**: the prompt for IMPORT-with-tools mode. These do *not* include a `**Toolbox**` markdown block (the toolbox is conveyed via the `tools=[...]` parameter). They instruct the model to use the available tools when helpful and to produce a final answer in a `**Solution**` block. +- Add **`build_import_with_tools_prompt(task, task_family)`** and refactor `build_import_prompt`, `build_create_prompt`, `build_skip_prompt` to accept `task_family` and dispatch to the appropriate example set. +- Make `_FORMAT_OVERRIDE` conditional: empty string for `task_family == "pbebench"` (the new PBEBench examples model the desired format directly); existing override for other families. + +### 5.6 Modify: `symbolic_agent/baselines/trove/toolbox.py` + +- `TroVEToolbox.trim`: change default `C: float = 0.5` → `C: float = 1.0` to match the original TroVE implementation. + +### 5.7 Modify: `symbolic_agent/baselines/trove/executor.py` + +- `DEFAULT_TIMEOUT = 10` → `DEFAULT_TIMEOUT = 60`. Closer to the original TroVE's ~100s; gives PBEBench's `replace()`-chain solutions and the multi-turn tool dispatch enough headroom on local vLLM. + +### 5.8 Modify: `main.py` + +- Add CLI flag `--trove-selection {reward,consistency}` with `default="reward"`. Plumb to `TroVEController(selection=args.trove_selection)`. +- When `--dataset pbebench` is specified, pass `task_family="pbebench"` to the controller. Otherwise pass `"default"`. + +### 5.9 Modify: `scripts/launch_vllm_gpt_oss_120b.sh` + +Add three flags to the `vllm.entrypoints.openai.api_server` invocation: +- `--enable-auto-tool-choice` — enables `tool_choice="auto"` to actually fire tool calls. +- `--tool-call-parser openai` — the parser that knows how to extract `tool_calls` from the `gpt-oss` Harmony commentary channel. +- `--reasoning-parser openai_gptoss` — routes Harmony analysis-channel content into `message.reasoning_content` rather than dropping it. + +### 5.10 New file: `scripts/analyze_trove_run.py` + +Read a TroVE JSONL output and print: +- Overall accuracy (pass rate). +- Final toolbox size. +- Per-mode wins (counts of `won_mode == "import"`, `"create"`, `"skip"`). +- IMPORT-mode behavior breakdown: + - Tasks with `import_eligible == True` and `tool_call_count >= 1` (rate). + - Mean `tool_call_count` across IMPORT-eligible tasks. + - Tool-call success rate: fraction of `tool_calls` entries with `ok == True`. +- Top-10 most-called toolbox functions (by total call count across the run). + +### 5.11 Rewrite: `symbolic_agent/baselines/trove/docs/deviations.md` + +(Path may need creation if it doesn't exist.) Three sections: + +1. **Algorithmic deviations:** + - Native OpenAI tool calling for IMPORT mode (replaces the original text-based "model selects from `**Toolbox**` markdown" mechanism). + - Reward-based candidate selection by default (vs. self-consistency in the paper); self-consistency available via `--trove-selection consistency`. + - PBEBench-shaped few-shot examples in CREATE and SKIP prompts. + +2. **Faithful elements:** 3-mode generation, K-sampling per mode, AST-tie-breaking by node count, `C·log_{20}(n)` periodic trimming with `C=1.0`, frequency-based top-k retrieval for the toolbox view, dict-keyed toolbox structure mirroring `utils/code.py`. + +3. **Infrastructural patches:** JSONL-per-task checkpointing, `reasoning_content` fallback in `_call_openai`, executor timeout 60s, defensive `<|`-truncation sanitizer in the tool-call dispatcher (workaround for open vLLM PR #35906 covering Harmony control-token leakage). + +4. **Backend coverage caveat:** Anthropic backend code paths are still present and exercised by CREATE / SKIP / legacy IMPORT, but the smoke run and reported numbers are vLLM-served `gpt-oss` only. IMPORT-with-tools requires the OpenAI/vLLM backend. + +--- + +## 6. Telemetry to be collected + +Per task (in the JSONL row): + +| Field | Type | Source | +|---|---|---| +| `won_mode` | string | controller `_make_result` | +| `import_eligible` | bool | `len(toolbox) > 0` at task start | +| `import_was_winner` | bool | `won_mode == "import"` | +| `tool_calls` | list[dict] | `chat_with_tools` recorded list | +| `tool_call_count` | int | `len(tool_calls)` | +| `tools_called` | list[str] | unique names from `tool_calls` | +| `actually_called` | list[str] | `parse.imported_callsites(winning_solution, ...)` | + +Per run (computed by `scripts/analyze_trove_run.py`): + +- Overall accuracy +- Final toolbox size +- Mode-win histogram +- IMPORT-mode tool-use rate, mean calls/task, success rate +- Top-10 most-called functions + +--- + +## 7. Implementation defaults + +| Choice | Value | Rationale | +|---|---|---| +| `K` (samples per mode) | 3 | Matches existing controller; matches paper | +| Tool schema top-k | 10 | Matches existing `format_toolbox(topk=10)` | +| `max_tool_iters` | 8 | Allows multi-step compositions; bounded for safety | +| Tool result truncation | 4096 characters | Avoids truncating mid-codepoint; safe for JSON | +| Trim coefficient `C` | 1.0 | Matches the original TroVE `λ = log_{20}(n)` | +| Executor timeout | 60s | PBEBench `replace()`-chains + multi-turn dispatch | +| Selection default | `reward` | Existing PBEBench reward signal is reliable | +| Tool name sanitization | `name.split("<\|", 1)[0]` | Defensive vs. open vLLM PR #35906 | + +--- + +## 8. Smoke run + +**Command (filled when ready to execute):** + +```bash +# Launch vLLM (after script is updated with the three new flags) +bash scripts/launch_vllm_gpt_oss_120b.sh 8000 + +# Run TroVE on 50 PBEBench-Lite tasks with gpt-oss-20b +python main.py \ + --dataset pbebench \ + --baseline trove \ + --model gpt-oss-20b \ + --backend openai \ + --base-url http://localhost:8000/v1 \ + --num-tasks 50 \ + --trove-selection reward \ + --debug-dir ./outputs/trove_pbebench_smoke + +# Analyze +python scripts/analyze_trove_run.py outputs/trove_pbebench_smoke/results.jsonl +``` + +**Pre-flight check.** Before kicking off the full 50-task run, run a single one-task smoke and verify: +1. The OpenAI client request payload contains `tools=[...]` with at least one entry once the toolbox has been populated. +2. The first response with a non-empty toolbox returns at least one `tool_call` from vLLM (visible in the debug log JSON for that round-trip). + +If `message.tool_calls` is None or missing on a non-empty-toolbox task, **verify all three vLLM flags (`--enable-auto-tool-choice`, `--tool-call-parser openai`, `--reasoning-parser openai_gptoss`) are present in the launcher script**, restart vLLM, and re-run the sanity check before proceeding. + +**Done criteria.** + +- All code changes merged on `trove_baseline`. +- Smoke run completes without crashes. +- Reported numbers (in plain text or a brief markdown summary): + - Overall accuracy (pass rate over 50 tasks) + - Final toolbox size + - Mode-win counts + - IMPORT tool-use rate among IMPORT-eligible tasks + - Top-10 most-called functions + - A short narrative of any anomalies observed (e.g. `<|channel|>` contamination from PR #35906, `max_iters` stops, JSON-arg parse failures). + +We **do not** iterate on prompts, schemas, or thresholds to chase a target number. The numbers are what they are. + +--- + +## 9. vLLM version requirement and known caveats + +- **Minimum vLLM:** v0.16.0 (branch-cut 2026-02-08). Latest as of writing is v0.20.0. +- **Required upstream change:** PR #28729 ("Multiple fixes for gpt-oss Chat Completion prompting"), merged 2025-12-12 by `@chaunceyjiang`. Without this, multi-turn tool-call flows fail to round-trip the analysis/commentary channels correctly. v0.16.0 is the first stable release branch-cut after the merge. +- **Known open caveat:** PR #35906 ("Sanitize leaked Harmony control tokens in tool names and recipients") is **still open** as of late March 2026. Symptoms when this hits us: tool names contain Harmony tags, e.g. `find_replace_chain<|channel|>commentary`. Mitigation: the `<|`-truncation sanitizer in `dispatch_tool_call` and `_update_library`. If/when #35906 lands upstream, the sanitizer becomes a no-op and we leave it in place. + +--- + +## 10. Cost envelope (smoke run upper bound) + +Per task baseline (no IMPORT branch, e.g. first ~10 tasks before the toolbox is populated): K=3 across CREATE and SKIP only = 6 single-shot calls + 3 legacy IMPORT (no-op when toolbox empty, but the call is still made) = 9 round-trips. + +Per IMPORT-eligible task (~40 of 50): K=3 multi-turn IMPORT trajectories × up to 8 iterations each + 1 final no-tool turn = up to 27 calls; plus 6 for CREATE and SKIP = up to 33 round-trips. + +Total upper bound: 40·33 + 10·9 = **1410 round-trips** for the 50-task smoke. Acceptable for local vLLM. + +--- + +## 11. Out of scope (explicit) + +- Any change to PBEBench dataset/loader/scoring. +- Any change to CREATE or SKIP generation paths. +- Pre-seeding the toolbox. +- Toolbox persistence across runs. +- Any change to reward semantics. +- Any per-task or per-prompt iteration after the smoke run lands. +- Anthropic backend smoke runs. From 0f872ac5278c4fd2730263901ca699e653a2791e Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 17:51:15 -0400 Subject: [PATCH 02/24] docs(trove): add native tool calling implementation plan Step-by-step plan for implementing the design spec (2026-04-25-trove-native-tool-calling-design.md). Eleven tasks covering infra patches, tools_api module, chat_with_tools, controller branch, prompts, CLI flags, vLLM launcher, analyzer script, deviations doc, and the 50-task PBEBench-Lite smoke run + report. Made-with: Cursor --- .../2026-04-25-trove-native-tool-calling.md | 2274 +++++++++++++++++ 1 file changed, 2274 insertions(+) create mode 100644 docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md diff --git a/docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md b/docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md new file mode 100644 index 00000000..76ecb582 --- /dev/null +++ b/docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md @@ -0,0 +1,2274 @@ +# TroVE Native Tool Calling Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Adapt the existing TroVE port so that the IMPORT mode uses native OpenAI tool calling (vLLM-served `gpt-oss`) while CREATE / SKIP / selection / trimming remain faithful to the paper, then run a 50-task PBEBench smoke and report numbers. + +**Architecture:** Keep `_multi_way_generation` unchanged for CREATE/SKIP. Replace the IMPORT branch (when toolbox non-empty AND backend is OpenAI) with a multi-turn loop that (a) translates top-k toolbox functions into OpenAI tool schemas, (b) lets the model emit `tool_calls` that are executed in a sandboxed subprocess, and (c) returns the final assistant text + recorded tool-call trajectory. Frequency credit comes from unique `tool_call.function.name` entries, not parsed `from toolbox import`. All other invariants (K-sampling, reward-based selection, AST tie-break, `C·log_{20}(n)` trimming) are unchanged. + +**Tech Stack:** Python 3.11, OpenAI Python SDK against a vLLM ≥ v0.16.0 endpoint serving `openai/gpt-oss-20b` (or `120b`), `subprocess`-based executor, `inspect` + `ast` from stdlib. + +**Spec:** [docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md](../specs/2026-04-25-trove-native-tool-calling-design.md) + +--- + +## File Structure + +| File | Status | Purpose | +|---|---|---| +| `symbolic_agent/baselines/trove/toolbox.py` | Modify | Trim coefficient `C=1.0` | +| `symbolic_agent/baselines/trove/executor.py` | Modify | `DEFAULT_TIMEOUT=60` | +| `symbolic_agent/baselines/trove/llm.py` | Modify | `reasoning_content` fallback in `_call_openai`; new `chat_with_tools` method | +| `symbolic_agent/baselines/trove/parse.py` | Modify | `imported_callsites` helper; `task_family` parameter on `parse_response` | +| `symbolic_agent/baselines/trove/prompts.py` | Modify | PBEBench-shaped few-shots; `build_import_with_tools_prompt`; `task_family` dispatch | +| `symbolic_agent/baselines/trove/controller.py` | Modify | IMPORT-with-tools branch; telemetry fields; `task_family` + `selection` params | +| `symbolic_agent/baselines/trove/tools_api.py` | Create | `toolbox_to_openai_tools`; `dispatch_tool_call` | +| `symbolic_agent/baselines/trove/docs/deviations.md` | Create | Algorithmic deviations / faithful elements / infra patches | +| `symbolic_agent/baselines/trove/tests/__init__.py` | Create | Marker file for the new tests package | +| `symbolic_agent/baselines/trove/tests/test_tools_api.py` | Create | Unit tests for schema generation + dispatcher | +| `symbolic_agent/baselines/trove/tests/test_parse_callsites.py` | Create | Unit tests for `imported_callsites` | +| `main.py` | Modify | `--trove-selection` and `--trove-task-family` flags | +| `scripts/launch_vllm_gpt_oss_120b.sh` | Modify | Add three vLLM tool-calling flags | +| `scripts/analyze_trove_run.py` | Create | Post-hoc analysis of TroVE JSONL output | + +--- + +## Task 1: Quick infrastructure patches (trim C, executor timeout, reasoning_content fallback) + +**Files:** +- Modify: `symbolic_agent/baselines/trove/toolbox.py:117` +- Modify: `symbolic_agent/baselines/trove/executor.py:19` +- Modify: `symbolic_agent/baselines/trove/llm.py:192` + +These are three independent one-line changes. Bundling them since each is too small to warrant its own commit and they're all on the "infrastructure" axis. + +- [ ] **Step 1.1: Update trim coefficient default** + +In `symbolic_agent/baselines/trove/toolbox.py`, change the default of `trim`: + +```python +def trim(self, n_processed: int, C: float = 1.0) -> set: + """ + Remove functions whose frequency is below the threshold + C * log_{20}(n_processed) + and return the set of example indices that had used those functions. + + Faithful to trim_library() in run_trove.py: + threshold = math.log(n, 20) # log base 20 + C defaults to 1.0, matching the original implementation (C·log_{20}(n)). + Note: the original uses log base-20 not base-10; we keep base-20. + """ +``` + +- [ ] **Step 1.2: Update executor timeout default** + +In `symbolic_agent/baselines/trove/executor.py`, change the constant: + +```python +DEFAULT_TIMEOUT = 60 # seconds — generous for PBEBench replace() chains and multi-turn dispatch +``` + +- [ ] **Step 1.3: Add reasoning_content fallback in `_call_openai`** + +In `symbolic_agent/baselines/trove/llm.py`, replace the line that reads `raw = response.choices[0].message.content or ""` with: + +```python + msg = response.choices[0].message + raw = msg.content or getattr(msg, "reasoning_content", "") or "" +``` + +Context (the surrounding `try` block stays unchanged): + +```python + response = self._client.chat.completions.create( + model=model, + max_tokens=max_tokens, + messages=messages, + ) + msg = response.choices[0].message + raw = msg.content or getattr(msg, "reasoning_content", "") or "" + u = getattr(response, "usage", None) +``` + +- [ ] **Step 1.4: Sanity-check the changes** + +Run: `python -c "from symbolic_agent.baselines.trove.toolbox import TroVEToolbox; from symbolic_agent.baselines.trove.executor import DEFAULT_TIMEOUT; import inspect; print(inspect.signature(TroVEToolbox.trim).parameters['C'].default, DEFAULT_TIMEOUT)"` + +Expected: `1.0 60` + +- [ ] **Step 1.5: Commit** + +```bash +git add symbolic_agent/baselines/trove/toolbox.py symbolic_agent/baselines/trove/executor.py symbolic_agent/baselines/trove/llm.py +git commit -m "$(cat <<'EOF' +fix(trove): infra patches for native tool calling + +- toolbox.trim default C=1.0 (matches original TroVE) +- executor DEFAULT_TIMEOUT=60s (PBEBench + multi-turn headroom) +- llm._call_openai falls back to message.reasoning_content when + message.content is empty (gpt-oss Harmony channel split) +EOF +)" +``` + +--- + +## Task 2: `parse.imported_callsites` helper + `task_family` parameter + +**Files:** +- Modify: `symbolic_agent/baselines/trove/parse.py:86,106-114` +- Create: `symbolic_agent/baselines/trove/tests/__init__.py` +- Create: `symbolic_agent/baselines/trove/tests/test_parse_callsites.py` + +- [ ] **Step 2.1: Create the tests package marker** + +Create `symbolic_agent/baselines/trove/tests/__init__.py` as an empty file. + +- [ ] **Step 2.2: Write the failing test for `imported_callsites`** + +Create `symbolic_agent/baselines/trove/tests/test_parse_callsites.py`: + +```python +"""Unit tests for parse.imported_callsites and parse_response(task_family=).""" + +from symbolic_agent.baselines.trove.parse import imported_callsites, parse_response + + +# --------------------------------------------------------------------------- +# imported_callsites +# --------------------------------------------------------------------------- + +def test_callsites_bare_name(): + code = "result = find_replace_chain(s, [('a', 'b')])\nprint(result)" + assert imported_callsites(code, tools_code="", candidate_names={"find_replace_chain", "other"}) == {"find_replace_chain"} + + +def test_callsites_attribute_access(): + code = "result = toolbox.find_replace_chain(s, pairs)\nprint(result)" + assert imported_callsites(code, tools_code="", candidate_names={"find_replace_chain"}) == {"find_replace_chain"} + + +def test_callsites_no_match(): + code = "print(s.replace('a', 'b'))" + assert imported_callsites(code, tools_code="", candidate_names={"find_replace_chain"}) == set() + + +def test_callsites_multiple_calls_same_name_dedup(): + code = "x = f(1)\ny = f(2)\nprint(x, y)" + assert imported_callsites(code, tools_code="", candidate_names={"f", "g"}) == {"f"} + + +def test_callsites_syntax_error_returns_empty(): + code = "this is not valid python ::" + assert imported_callsites(code, tools_code="", candidate_names={"f"}) == set() + + +def test_callsites_empty_inputs(): + assert imported_callsites("", "", set()) == set() + assert imported_callsites("print(1)", "", set()) == set() + + +# --------------------------------------------------------------------------- +# parse_response(task_family=) +# --------------------------------------------------------------------------- + +def test_parse_response_pbebench_strict_no_solution_block(): + text = "Here is some reasoning.\n```python\nprint('answer')\n```\n" + out = parse_response(text, task_family="pbebench") + assert out["solution_code"] == "" + + +def test_parse_response_pbebench_with_solution_block(): + text = "**Solution**\n```python\nprint('answer')\n```\n" + out = parse_response(text, task_family="pbebench") + assert out["solution_code"] == "print('answer')" + + +def test_parse_response_default_falls_back_to_any_python_block(): + text = "Here is some reasoning.\n```python\nprint('answer')\n```\n" + out = parse_response(text, task_family="default") + assert "print('answer')" in out["solution_code"] + + +def test_parse_response_default_call_signature_unchanged(): + text = "**Solution**\n```python\nprint('answer')\n```\n" + out = parse_response(text) + assert out["solution_code"] == "print('answer')" +``` + +- [ ] **Step 2.3: Run the tests to confirm they fail** + +Run: `python -m pytest symbolic_agent/baselines/trove/tests/test_parse_callsites.py -v` + +Expected: ImportError on `imported_callsites` (function does not exist) and one or more failures on `parse_response(text, task_family=...)` (unknown kwarg). + +- [ ] **Step 2.4: Implement `imported_callsites` and add `task_family` to `parse_response`** + +In `symbolic_agent/baselines/trove/parse.py`, add the helper at the end of the AST section (after `count_ast_nodes`): + +```python +def imported_callsites( + solution_code: str, + tools_code: str, + candidate_names: set, +) -> set: + """ + Return the subset of `candidate_names` that appear as call-sites in + `solution_code`. Used for the `actually_called` telemetry field. + + Detects two callee shapes: + - bare Name: find_replace_chain(...) + - Attribute(name): toolbox.find_replace_chain(...) + + `tools_code` is currently unused (kept in the signature so callers can + pass through the **Tools** block context if we later want to filter by + what was actually imported). + + Returns an empty set on empty input or SyntaxError. + """ + if not solution_code or not candidate_names: + return set() + try: + tree = ast.parse(solution_code) + except SyntaxError: + return set() + found: set = set() + for node in ast.walk(tree): + if not isinstance(node, ast.Call): + continue + func = node.func + if isinstance(func, ast.Name) and func.id in candidate_names: + found.add(func.id) + elif isinstance(func, ast.Attribute) and func.attr in candidate_names: + found.add(func.attr) + return found +``` + +Then modify `parse_response` (around line 86) to accept `task_family`: + +```python +def parse_response(text: str, task_family: str = "default") -> dict: + """ + Parse a TroVE-format LLM response. + + Returns + ------- + { + "solution_code": str, # code inside **Solution** block + "tools_code": str, # code inside **Tools** block + "functions": list[dict], # parsed tool dicts from the Tools block + } + + task_family + ----------- + "default": if no **Solution** block is found, falls back to the first + ```python``` block anywhere (legacy behaviour). + "pbebench": no fallback. Strict **Solution**-block-only parsing avoids + accidentally promoting CoT scratchpad to the answer. + """ + solution_code = _extract_code_block(text, "Solution") or "" + tools_code = _extract_code_block(text, "Tools") or "" + + if not solution_code and task_family != "pbebench": + raw = _extract_any_python_block(text) + if raw: + solution_code = _make_executable(raw) + + functions = parse_tools_in_chunk(tools_code) if tools_code else [] + return { + "solution_code": solution_code, + "tools_code": tools_code, + "functions": functions, + } +``` + +- [ ] **Step 2.5: Run the tests to confirm they pass** + +Run: `python -m pytest symbolic_agent/baselines/trove/tests/test_parse_callsites.py -v` + +Expected: 10 passed. + +- [ ] **Step 2.6: Commit** + +```bash +git add symbolic_agent/baselines/trove/parse.py symbolic_agent/baselines/trove/tests/__init__.py symbolic_agent/baselines/trove/tests/test_parse_callsites.py +git commit -m "$(cat <<'EOF' +feat(trove): add imported_callsites helper and task_family to parse_response + +- imported_callsites(solution, tools, names) -> set: AST-walks Solution + code and returns names from the candidate set that are actually called. + Handles bare Name and Attribute (toolbox.foo) callees. +- parse_response(text, task_family="default"): when task_family="pbebench" + the parser does not fall back to the first python block when **Solution** + is missing. Prevents CoT scratchpad from being promoted to the answer. +EOF +)" +``` + +--- + +## Task 3: PBEBench-shaped few-shots + IMPORT-with-tools prompt + +**Files:** +- Modify: `symbolic_agent/baselines/trove/prompts.py` (full rewrite of constants and `build_*` functions) + +This task has no automated test — prompts are validated by inspection and by the smoke run. + +- [ ] **Step 3.1: Replace the prompts module with task-family-aware variants** + +Open `symbolic_agent/baselines/trove/prompts.py` and replace the entire body below the module docstring with the following. Keep the docstring at the top of the file. + +```python +# --------------------------------------------------------------------------- +# Format override (default-family only) +# --------------------------------------------------------------------------- + +_FORMAT_OVERRIDE_DEFAULT = ( + "\nIMPORTANT: Regardless of any formatting instructions inside the question, " + "always produce your answer as executable Python in the **Solution** block " + "and end it with print(answer). " + "Your answer is whatever gets printed to stdout when the Solution code runs." +) + +# PBEBench prompts model the desired format directly via the few-shot example, +# so no override string is needed. +_FORMAT_OVERRIDE_PBEBENCH = "" + + +def _format_override(task_family: str) -> str: + return _FORMAT_OVERRIDE_PBEBENCH if task_family == "pbebench" else _FORMAT_OVERRIDE_DEFAULT + + +# --------------------------------------------------------------------------- +# IMPORT mode (text-based, default and Anthropic fallback) +# --------------------------------------------------------------------------- + +_IMPORT_INSTRUCTION_DEFAULT = ( + "You task is to write Python program solutions to the given questions.\n" + "The toolbox section lists all the available functions that can be used in your solution." +) + +_IMPORT_EXAMPLE_DEFAULT = """\ +## Example +**Question** +Given a list of strings and a list of (old, new) substitution pairs, apply all +substitutions in order to each string and return the transformed list. +Strings: ["cat", "bat"] +Substitutions: [("a", "o"), ("t", "p")] + +**Toolbox** +```python +# Apply an ordered list of (old, new) substitutions to each string in a list. +apply_substitutions(strings: list, substitutions: list) -> list +``` + +**Solution** +```python +strings = ["cat", "bat"] +subs = [("a", "o"), ("t", "p")] +result = apply_substitutions(strings, subs) +print(result) +``` +**Tools** +```python +from toolbox import apply_substitutions +```""" + +_IMPORT_EXAMPLE_PBEBENCH = """\ +## Example +**Question** +You are given example input/output pairs. Produce a list of replace() calls +that transforms each input into its expected output. + +Input: "hello world" +Output: "HELLO_WORLD" + +**Toolbox** +```python +# Apply a chain of (old, new) replacements to a string. +find_replace_chain(s: str, pairs: list) -> str +``` + +**Solution** +```python +result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")]) +print(result) +``` +**Tools** +```python +from toolbox import find_replace_chain +```""" + +_IMPORT_TASK_TEMPLATE = """\ +## Task +**Question** +{question} + +**Toolbox** +{toolbox} + +**Solution** +""" + + +def build_import_prompt(question: str, toolbox_str: str, task_family: str = "default") -> str: + """Build the text-based IMPORT-mode prompt (used for Anthropic and as fallback).""" + instruction = _IMPORT_INSTRUCTION_DEFAULT + _format_override(task_family) + example = _IMPORT_EXAMPLE_PBEBENCH if task_family == "pbebench" else _IMPORT_EXAMPLE_DEFAULT + return ( + instruction + + "\n\n\n" + + example + + "\n\n\n" + + _IMPORT_TASK_TEMPLATE.format(question=question, toolbox=toolbox_str) + ) + + +# --------------------------------------------------------------------------- +# IMPORT-with-tools mode (native OpenAI tool calling; no **Toolbox** block) +# --------------------------------------------------------------------------- + +_IMPORT_WITH_TOOLS_INSTRUCTION_DEFAULT = ( + "You task is to write Python program solutions to the given questions.\n" + "You have a set of helper functions available as tools. Call any of them " + "when they help you solve the question; otherwise solve directly. After " + "you have computed the answer, output it as executable Python in a " + "**Solution** block and end with print(answer)." +) + +_IMPORT_WITH_TOOLS_INSTRUCTION_PBEBENCH = ( + "You task is to produce a list of replace() calls that transforms each " + "input into its expected output for a Programming-by-Example task.\n" + "You have a set of helper functions available as tools. Call any of them " + "to test ideas or compute intermediate results; the final answer must be " + "produced as a Python program in the **Solution** block." +) + +_IMPORT_WITH_TOOLS_EXAMPLE_DEFAULT = """\ +## Example +**Question** +Apply substitutions [("a","o"),("t","p")] to ["cat","bat"] and return the list. + +(After optionally calling `apply_substitutions` as a tool to confirm, +the assistant produces:) + +**Solution** +```python +strings = ["cat", "bat"] +subs = [("a", "o"), ("t", "p")] +result = apply_substitutions(strings, subs) +print(result) +```""" + +_IMPORT_WITH_TOOLS_EXAMPLE_PBEBENCH = """\ +## Example +**Question** +Produce a sequence of replace() calls that transforms "hello world" into +"HELLO_WORLD". + +(After optionally calling `find_replace_chain` as a tool to verify a +candidate sequence, the assistant produces:) + +**Solution** +```python +result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")]) +print(result) +```""" + +_IMPORT_WITH_TOOLS_TASK_TEMPLATE = """\ +## Task +**Question** +{question} + +**Solution** +""" + + +def build_import_with_tools_prompt(question: str, task_family: str = "default") -> str: + """ + Build the IMPORT-with-tools prompt. The toolbox is NOT shown as text — it + is conveyed via the OpenAI tools=[...] parameter on the chat completion call. + """ + if task_family == "pbebench": + instruction = _IMPORT_WITH_TOOLS_INSTRUCTION_PBEBENCH + example = _IMPORT_WITH_TOOLS_EXAMPLE_PBEBENCH + else: + instruction = _IMPORT_WITH_TOOLS_INSTRUCTION_DEFAULT + example = _IMPORT_WITH_TOOLS_EXAMPLE_DEFAULT + return ( + instruction + + "\n\n\n" + + example + + "\n\n\n" + + _IMPORT_WITH_TOOLS_TASK_TEMPLATE.format(question=question) + ) + + +# --------------------------------------------------------------------------- +# CREATE mode +# --------------------------------------------------------------------------- + +_CREATE_INSTRUCTION_DEFAULT = ( + "You task is to write Python program solutions to the given questions.\n" + "You should also create Python functions that can be used by your solution, " + "if you believe the function can be reused to solve other questions." +) + +_CREATE_EXAMPLE_DEFAULT = """\ +## Example +**Question** +Given a list of strings and a list of (old, new) substitution pairs, apply all +substitutions in order to each string and return the transformed list. +Strings: ["hello", "world"] +Substitutions: [("l", "r"), ("o", "0")] + +**Solution** +```python +strings = ["hello", "world"] +subs = [("l", "r"), ("o", "0")] +result = apply_substitutions(strings, subs) +print(result) +``` +**Tools** +```python +def apply_substitutions(strings, substitutions): + \"\"\"Apply an ordered list of (old, new) substitutions to each string in a list.\"\"\" + out = [] + for s in strings: + for old, new in substitutions: + s = s.replace(old, new) + out.append(s) + return out +```""" + +_CREATE_EXAMPLE_PBEBENCH = """\ +## Example +**Question** +Produce a sequence of replace() calls that transforms "hello world" into +"HELLO_WORLD". + +**Solution** +```python +result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")]) +print(result) +``` +**Tools** +```python +def find_replace_chain(s, pairs): + \"\"\"Apply a chain of (old, new) replacements to a string.\"\"\" + for old, new in pairs: + s = s.replace(old, new) + return s +```""" + +_CREATE_TASK_TEMPLATE = """\ +## Task +**Question** +{question} + +**Solution** +""" + + +def build_create_prompt(question: str, task_family: str = "default") -> str: + """Build the CREATE-mode prompt for a single task.""" + instruction = _CREATE_INSTRUCTION_DEFAULT + _format_override(task_family) + example = _CREATE_EXAMPLE_PBEBENCH if task_family == "pbebench" else _CREATE_EXAMPLE_DEFAULT + return ( + instruction + + "\n\n\n" + + example + + "\n\n\n" + + _CREATE_TASK_TEMPLATE.format(question=question) + ) + + +# --------------------------------------------------------------------------- +# SKIP mode +# --------------------------------------------------------------------------- + +_SKIP_INSTRUCTION_DEFAULT = ( + "You task is to write Python program solutions to the given questions." +) + +_SKIP_EXAMPLE_DEFAULT = """\ +## Example +**Question** +Given the list of strings ["Hello", "World"], convert each to lowercase and +return the resulting list. + +**Solution** +```python +strings = ["Hello", "World"] +result = [s.lower() for s in strings] +print(result) +``` +**Tools** +```python +```""" + +_SKIP_EXAMPLE_PBEBENCH = """\ +## Example +**Question** +Produce a sequence of replace() calls that transforms "hello world" into +"HELLO_WORLD". + +**Solution** +```python +s = "hello world" +s = s.replace(" ", "_") +s = s.replace("h", "H") +s = s.replace("e", "E") +s = s.replace("l", "L") +s = s.replace("o", "O") +s = s.replace("w", "W") +s = s.replace("r", "R") +s = s.replace("d", "D") +print(s) +``` +**Tools** +```python +```""" + +_SKIP_TASK_TEMPLATE = """\ +## Task +**Question** +{question} + +**Solution** +""" + + +def build_skip_prompt(question: str, task_family: str = "default") -> str: + """Build the SKIP-mode prompt for a single task.""" + instruction = _SKIP_INSTRUCTION_DEFAULT + _format_override(task_family) + example = _SKIP_EXAMPLE_PBEBENCH if task_family == "pbebench" else _SKIP_EXAMPLE_DEFAULT + return ( + instruction + + "\n\n\n" + + example + + "\n\n\n" + + _SKIP_TASK_TEMPLATE.format(question=question) + ) + + +def get_question(task_input: dict) -> str: + """ + Extract the question/prompt string from a task_input dict. + + Priority: question > prompt > task > str(task_input). + """ + for key in ("question", "prompt", "task"): + val = task_input.get(key) + if val and isinstance(val, str) and val.strip(): + return val.strip() + return str(task_input) +``` + +- [ ] **Step 3.2: Smoke-test the new prompts compile and dispatch correctly** + +Run: `python -c "from symbolic_agent.baselines.trove.prompts import build_import_prompt, build_create_prompt, build_skip_prompt, build_import_with_tools_prompt; print('--IMPORT default--'); print(build_import_prompt('Q?', 'TB')[:200]); print('--IMPORT pbebench--'); print(build_import_prompt('Q?', 'TB', task_family='pbebench')[:200]); print('--IMPORT_WITH_TOOLS pbebench--'); print(build_import_with_tools_prompt('Q?', task_family='pbebench')[:200])"` + +Expected: three short prompt previews, no exceptions, no `IMPORTANT:` line in the pbebench variant. + +- [ ] **Step 3.3: Commit** + +```bash +git add symbolic_agent/baselines/trove/prompts.py +git commit -m "$(cat <<'EOF' +feat(trove): PBEBench-shaped few-shots and IMPORT-with-tools prompt + +- Add task_family parameter to all build_* prompt builders. +- Add _CREATE_EXAMPLE_PBEBENCH and _SKIP_EXAMPLE_PBEBENCH demonstrating + replace()-chain solutions and a find_replace_chain helper. +- Add build_import_with_tools_prompt for native tool calling: no + **Toolbox** markdown block (toolbox is conveyed via tools=[...]). +- _FORMAT_OVERRIDE is empty for task_family="pbebench" (the example + models the desired format directly). +EOF +)" +``` + +--- + +## Task 4: New `tools_api.py` (toolbox -> OpenAI schemas, dispatcher) + +**Files:** +- Create: `symbolic_agent/baselines/trove/tools_api.py` +- Create: `symbolic_agent/baselines/trove/tests/test_tools_api.py` + +- [ ] **Step 4.1: Write the failing tests** + +Create `symbolic_agent/baselines/trove/tests/test_tools_api.py`: + +```python +"""Unit tests for tools_api.toolbox_to_openai_tools and dispatch_tool_call.""" + +import json +from types import SimpleNamespace + +from symbolic_agent.baselines.trove.toolbox import TroVEToolbox +from symbolic_agent.baselines.trove.tools_api import ( + dispatch_tool_call, + toolbox_to_openai_tools, +) + + +def _make_toolbox_with(func_src: str, name: str, docstr: str = "") -> TroVEToolbox: + tb = TroVEToolbox() + tb.add( + { + "name": name, + "docstr": docstr, + "signature": f"def {name}(...)", + "function": func_src, + "type": "function", + }, + example_idx=0, + ) + return tb + + +def _tool_call(name: str, args: dict, call_id: str = "call_1"): + return SimpleNamespace( + id=call_id, + function=SimpleNamespace(name=name, arguments=json.dumps(args)), + ) + + +# --------------------------------------------------------------------------- +# toolbox_to_openai_tools +# --------------------------------------------------------------------------- + +def test_schema_basic_function(): + src = ( + "def find_replace_chain(s: str, pairs: list) -> str:\n" + " \"\"\"Apply a chain of (old, new) replacements to a string.\"\"\"\n" + " for old, new in pairs:\n" + " s = s.replace(old, new)\n" + " return s\n" + ) + tb = _make_toolbox_with(src, "find_replace_chain", docstr="Apply a chain of (old, new) replacements to a string.") + tools = toolbox_to_openai_tools(tb, topk=10) + assert len(tools) == 1 + fn = tools[0] + assert fn["type"] == "function" + assert fn["function"]["name"] == "find_replace_chain" + assert fn["function"]["description"] == "Apply a chain of (old, new) replacements to a string." + params = fn["function"]["parameters"] + assert params["type"] == "object" + assert set(params["properties"].keys()) == {"s", "pairs"} + assert params["properties"]["s"]["type"] == "string" + assert params["properties"]["pairs"]["type"] == "array" + assert set(params["required"]) == {"s", "pairs"} + + +def test_schema_unannotated_falls_back_to_string(): + src = ( + "def f(x):\n" + " return x\n" + ) + tb = _make_toolbox_with(src, "f") + tools = toolbox_to_openai_tools(tb, topk=10) + assert tools[0]["function"]["parameters"]["properties"]["x"]["type"] == "string" + + +def test_schema_skips_varargs_kwargs(): + src = ( + "def f(*args, **kwargs):\n" + " return args\n" + ) + tb = _make_toolbox_with(src, "f") + tools = toolbox_to_openai_tools(tb, topk=10) + assert tools == [] + + +def test_schema_required_excludes_defaults(): + src = ( + "def f(x: int, y: int = 5):\n" + " return x + y\n" + ) + tb = _make_toolbox_with(src, "f") + tools = toolbox_to_openai_tools(tb, topk=10) + params = tools[0]["function"]["parameters"] + assert params["required"] == ["x"] + assert params["properties"]["y"]["type"] == "integer" + + +def test_schema_topk_respects_frequency(): + tb = TroVEToolbox() + for n, freq in [("a", 3), ("b", 2), ("c", 1)]: + tb.add( + { + "name": n, + "docstr": "", + "signature": f"def {n}()", + "function": f"def {n}():\n return 0\n", + "type": "function", + }, + example_idx=0, + ) + for _ in range(freq - 1): + tb.update_frequency(n, example_idx=0) + tools = toolbox_to_openai_tools(tb, topk=2) + assert [t["function"]["name"] for t in tools] == ["a", "b"] + + +def test_schema_empty_toolbox(): + assert toolbox_to_openai_tools(TroVEToolbox(), topk=10) == [] + + +# --------------------------------------------------------------------------- +# dispatch_tool_call +# --------------------------------------------------------------------------- + +def test_dispatch_runs_function_and_returns_stdout(): + src = ( + "def reverse_str(s):\n" + " return s[::-1]\n" + ) + tb = _make_toolbox_with(src, "reverse_str") + result = dispatch_tool_call(tb, _tool_call("reverse_str", {"s": "hello"})) + assert "olleh" in result + + +def test_dispatch_unknown_tool_returns_error(): + tb = TroVEToolbox() + result = dispatch_tool_call(tb, _tool_call("nonexistent", {})) + assert "not in toolbox" in result + + +def test_dispatch_bad_json_returns_error(): + src = "def f(x):\n return x\n" + tb = _make_toolbox_with(src, "f") + bad = SimpleNamespace( + id="x", + function=SimpleNamespace(name="f", arguments="{not json"), + ) + result = dispatch_tool_call(tb, bad) + assert "argument JSON parse failed" in result + + +def test_dispatch_sanitizes_harmony_contamination(): + src = "def reverse_str(s):\n return s[::-1]\n" + tb = _make_toolbox_with(src, "reverse_str") + tc = _tool_call("reverse_str<|channel|>commentary", {"s": "abc"}) + result = dispatch_tool_call(tb, tc) + assert "cba" in result + + +def test_dispatch_truncates_long_output(): + src = ( + "def long_output(n):\n" + " return 'x' * n\n" + ) + tb = _make_toolbox_with(src, "long_output") + result = dispatch_tool_call(tb, _tool_call("long_output", {"n": 10000})) + assert len(result) <= 4096 + 100 # +slack for repr quotes and truncation marker +``` + +- [ ] **Step 4.2: Run the tests to confirm they fail** + +Run: `python -m pytest symbolic_agent/baselines/trove/tests/test_tools_api.py -v` + +Expected: ImportError on `tools_api` module. + +- [ ] **Step 4.3: Create the `tools_api.py` module** + +Create `symbolic_agent/baselines/trove/tools_api.py`: + +```python +"""Translate the TroVE toolbox into OpenAI Chat Completions tool schemas +and dispatch tool calls back through the executor. + +This module is the bridge between TroVE's in-memory toolbox and vLLM's +native tool-calling protocol. It is invoked only from the IMPORT-with-tools +controller branch. +""" + +from __future__ import annotations + +import inspect +import json +import logging +from typing import Any + +from .executor import run_solution +from .toolbox import TroVEToolbox + +logger = logging.getLogger(__name__) + +_MAX_RESULT_CHARS = 4096 + +# Type inference: Python annotation -> JSON Schema type. +_TYPE_MAP = { + int: "integer", + float: "number", + bool: "boolean", + str: "string", + list: "array", + tuple: "array", + dict: "object", +} + + +def _infer_type(annotation: Any) -> str: + if annotation is inspect.Parameter.empty: + return "string" + # Plain types (int, str, etc.) + if annotation in _TYPE_MAP: + return _TYPE_MAP[annotation] + # typing.List, typing.Dict, etc. — fall through to string if unrecognised. + origin = getattr(annotation, "__origin__", None) + if origin in _TYPE_MAP: + return _TYPE_MAP[origin] + return "string" + + +def _function_to_schema(name: str, fn: Any, docstr: str) -> dict | None: + """ + Build one OpenAI tool dict from a callable. Returns None if the function + has *args or **kwargs (we cannot generate a meaningful schema). + """ + try: + sig = inspect.signature(fn) + except (TypeError, ValueError) as exc: + logger.debug("Could not introspect %s: %s", name, exc) + return None + + properties: dict = {} + required: list = [] + + for pname, param in sig.parameters.items(): + if param.kind in ( + inspect.Parameter.VAR_POSITIONAL, + inspect.Parameter.VAR_KEYWORD, + ): + logger.debug("Skipping %s — has *args/**kwargs", name) + return None + prop: dict = {"type": _infer_type(param.annotation)} + if param.default is not inspect.Parameter.empty: + if isinstance(param.default, (int, float, bool, str)): + prop["default"] = param.default + else: + required.append(pname) + properties[pname] = prop + + return { + "type": "function", + "function": { + "name": name, + "description": docstr or "", + "parameters": { + "type": "object", + "properties": properties, + "required": required, + }, + }, + } + + +def toolbox_to_openai_tools(toolbox: TroVEToolbox, topk: int = 10) -> list: + """ + Convert the top-k toolbox functions (by frequency) into OpenAI Chat + Completions tool dicts. + + Functions with *args / **kwargs are silently excluded. + Returns [] when the toolbox is empty. + """ + entries = toolbox.snapshot() + if not entries: + return [] + entries.sort(key=lambda e: -int(e.get("frequency", 0))) + selected = entries[:topk] + + namespace: dict = {} + try: + exec(toolbox.get_full_code(), namespace) + except Exception as exc: + logger.warning("Could not exec toolbox source for schema generation: %s", exc) + return [] + + tools: list = [] + for entry in selected: + name = entry.get("name", "") + if not name or name not in namespace: + continue + fn = namespace[name] + schema = _function_to_schema(name, fn, entry.get("docstr", "")) + if schema is not None: + tools.append(schema) + return tools + + +def _sanitize_name(name: str) -> str: + """Defensive workaround for vLLM PR #35906 (Harmony control tokens + leaking into tool names like `reverse_str<|channel|>commentary`).""" + return name.split("<|", 1)[0].strip() + + +def _truncate(s: str, limit: int = _MAX_RESULT_CHARS) -> str: + if len(s) <= limit: + return s + return s[:limit] + f"\n... [truncated {len(s) - limit} chars]" + + +def dispatch_tool_call(toolbox: TroVEToolbox, tool_call) -> str: + """ + Resolve `tool_call` against the toolbox, run it via the sandbox executor, + and return the captured stdout (truncated to 4096 chars) or an error + message string. Always returns a string — never raises. + """ + name = _sanitize_name(getattr(tool_call.function, "name", "") or "") + if not name: + return json.dumps({"error": "tool_call has no function name"}) + if name not in {e["name"] for e in toolbox.snapshot()}: + return json.dumps({"error": f"tool '{name}' not in toolbox"}) + + raw_args = getattr(tool_call.function, "arguments", "") or "{}" + try: + args = json.loads(raw_args) + if not isinstance(args, dict): + return json.dumps({"error": f"argument JSON parse failed: expected object, got {type(args).__name__}"}) + except json.JSONDecodeError as exc: + return json.dumps({"error": f"argument JSON parse failed: {exc}"}) + + call_expr = f"print(repr({name}(**{args!r})))" + is_ok, output = run_solution( + solution_code=call_expr, + tools_code="", + toolbox_code=toolbox.get_full_code(), + ) + if not is_ok: + return json.dumps({"error": "execution failed", "stderr": _truncate(output)}) + return _truncate(output) +``` + +- [ ] **Step 4.4: Run the tests to confirm they pass** + +Run: `python -m pytest symbolic_agent/baselines/trove/tests/test_tools_api.py -v` + +Expected: 10 passed. + +- [ ] **Step 4.5: Commit** + +```bash +git add symbolic_agent/baselines/trove/tools_api.py symbolic_agent/baselines/trove/tests/test_tools_api.py +git commit -m "$(cat <<'EOF' +feat(trove): add tools_api for native OpenAI tool calling + +- toolbox_to_openai_tools(toolbox, topk=10): converts top-k toolbox + functions into OpenAI Chat Completions tool schemas. Infers parameter + types from inspect.signature; functions with *args/**kwargs are + silently excluded. +- dispatch_tool_call(toolbox, tool_call): runs the requested function + in the sandbox executor, returns stdout truncated to 4096 chars or + a JSON error string. Sanitizes Harmony control-token contamination + in tool names (defensive vs. open vLLM PR #35906). +EOF +)" +``` + +--- + +## Task 5: `chat_with_tools` method on `TroVELLMClient` + +**Files:** +- Modify: `symbolic_agent/baselines/trove/llm.py` (add new method, no signature changes to existing methods) + +This task has no automated test — the multi-turn loop is validated by the controller-level integration plus the smoke run. + +- [ ] **Step 5.1: Add `chat_with_tools` to `TroVELLMClient`** + +In `symbolic_agent/baselines/trove/llm.py`, add the following imports near the top (`Callable` may already be implicit via `typing`): + +```python +from typing import Any, Callable, Dict, List, Optional +``` + +Then add the new method to the `TroVELLMClient` class (insert after `_call_openai`, before `_record`): + +```python + # ------------------------------------------------------------------ + # Native tool calling (OpenAI/vLLM only) + # ------------------------------------------------------------------ + + def chat_with_tools( + self, + messages: List[Dict[str, Any]], + tools: List[Dict[str, Any]], + model: str, + max_tokens: int = DEFAULT_MAX_TOKENS, + max_tool_iters: int = 8, + on_tool_call: Optional[Callable[[Any], str]] = None, + tag: str = "", + ) -> Dict[str, Any]: + """ + Multi-turn chat completion that supports native OpenAI tool calls. + + Returns + ------- + { + "final_text": str, # message.content (or reasoning_content fallback) + "tool_calls": list[dict], # ordered, each {name, args_preview, result_preview, ok} + "iterations": int, # number of round-trips actually used + "stopped_reason": str, # "no_tool_calls" | "max_iters" | "error" + } + + The caller is responsible for providing `on_tool_call(tc) -> str`, + which is invoked for every tool_call returned by the model. The + return value (already a string) is sent back as the tool message. + + Anthropic backend is not supported — this method exists for the + OpenAI/vLLM tool-calling flow only. It raises NotImplementedError + on Anthropic as a defensive guard; controllers must check + `self.backend == "openai"` before calling. + """ + if self.backend != "openai": + raise NotImplementedError("chat_with_tools requires the openai backend") + + if on_tool_call is None: + raise ValueError("chat_with_tools requires an on_tool_call callback") + + recorded_calls: List[Dict[str, Any]] = [] + convo: List[Dict[str, Any]] = list(messages) + iterations = 0 + final_text = "" + stopped_reason = "no_tool_calls" + + for it in range(max_tool_iters + 1): + iterations = it + 1 + iter_tag = f"{tag}_iter{it}" if tag else f"iter{it}" + response = None + last_exc = None + + for attempt in range(3): + try: + response = self._client.chat.completions.create( + model=model, + max_tokens=max_tokens, + messages=convo, + tools=tools, + tool_choice="auto", + ) + break + except Exception as exc: + last_exc = exc + if getattr(exc, "status_code", None) == 400: + logger.warning( + "OpenAI chat_with_tools 400 (tag=%s): %s", iter_tag, exc + ) + self._record(iter_tag, model, json.dumps(convo)[:2000], "", max_tokens, {}) + return { + "final_text": "", + "tool_calls": recorded_calls, + "iterations": iterations, + "stopped_reason": "error", + } + if attempt < 2: + wait = 5 * (2 ** attempt) + logger.warning( + "chat_with_tools failed (attempt %d/3, tag=%s): %s. Retrying in %ds.", + attempt + 1, iter_tag, exc, wait, + ) + time.sleep(wait) + + if response is None: + logger.warning("All chat_with_tools retries exhausted (tag=%s): %s", iter_tag, last_exc) + stopped_reason = "error" + break + + msg = response.choices[0].message + content = msg.content or getattr(msg, "reasoning_content", "") or "" + tool_calls = getattr(msg, "tool_calls", None) or [] + + u = getattr(response, "usage", None) + details = getattr(u, "completion_tokens_details", None) + usage = { + "input_tokens": getattr(u, "prompt_tokens", 0) or 0, + "output_tokens": getattr(u, "completion_tokens", 0) or 0, + "reasoning_tokens": getattr(details, "reasoning_tokens", 0) or 0 if details else 0, + } + self._record( + iter_tag, + model, + json.dumps(convo)[:2000], + json.dumps({"content": content, "tool_calls_count": len(tool_calls)}), + max_tokens, + usage, + ) + + if not tool_calls: + final_text = content + stopped_reason = "no_tool_calls" + break + + assistant_msg: Dict[str, Any] = { + "role": "assistant", + "content": content, + "tool_calls": [ + { + "id": tc.id, + "type": "function", + "function": { + "name": tc.function.name, + "arguments": tc.function.arguments, + }, + } + for tc in tool_calls + ], + } + convo.append(assistant_msg) + + for tc in tool_calls: + try: + result = on_tool_call(tc) + ok = True + except Exception as exc: + result = json.dumps({"error": f"on_tool_call raised: {exc}"}) + ok = False + args_preview = (tc.function.arguments or "")[:200] + result_preview = (result or "")[:200] + recorded_calls.append( + { + "name": tc.function.name, + "args_preview": args_preview, + "result_preview": result_preview, + "ok": ok, + } + ) + convo.append( + { + "role": "tool", + "tool_call_id": tc.id, + "content": result, + } + ) + + if it >= max_tool_iters - 1: + stopped_reason = "max_iters" + final_text = content + break + + return { + "final_text": final_text, + "tool_calls": recorded_calls, + "iterations": iterations, + "stopped_reason": stopped_reason, + } +``` + +- [ ] **Step 5.2: Smoke-test the method does not break import** + +Run: `python -c "from symbolic_agent.baselines.trove.llm import TroVELLMClient; print(hasattr(TroVELLMClient, 'chat_with_tools'))"` + +Expected: `True`. + +- [ ] **Step 5.3: Smoke-test the Anthropic guard fires** + +Run: `python -c "from symbolic_agent.baselines.trove.llm import TroVELLMClient; c = TroVELLMClient(backend='anthropic', api_key='unused'); +try: + c.chat_with_tools([], [], model='x', on_tool_call=lambda x: '') + print('no exception (BUG)') +except NotImplementedError as e: + print('guard fires:', e)"` + +Expected: `guard fires: chat_with_tools requires the openai backend`. + +- [ ] **Step 5.4: Commit** + +```bash +git add symbolic_agent/baselines/trove/llm.py +git commit -m "$(cat <<'EOF' +feat(trove): add TroVELLMClient.chat_with_tools for native tool calls + +Multi-turn loop that handles tool_calls returned by gpt-oss/vLLM: +appends assistant message + tool result messages until the model returns +no tool_calls or max_tool_iters is reached. Records each call as +{name, args_preview, result_preview, ok} for downstream telemetry. +Reuses the existing 3-attempt retry, debug logging, and token accounting. + +Anthropic backend raises NotImplementedError as a defensive guard; +controllers branch on self.backend == "openai" before calling. +EOF +)" +``` + +--- + +## Task 6: Controller IMPORT-with-tools branch + telemetry fields + +**Files:** +- Modify: `symbolic_agent/baselines/trove/controller.py` + +- [ ] **Step 6.1: Update imports and `__init__` signature** + +In `symbolic_agent/baselines/trove/controller.py`, replace the imports block at the top (currently lines 36-44) with: + +```python +import logging +from collections import Counter +from typing import Callable, Dict, List, Optional + +from . import tools_api +from .executor import run_solution +from .llm import TroVELLMClient +from .parse import count_ast_nodes, imported_callsites, parse_response +from .prompts import ( + build_create_prompt, + build_import_prompt, + build_import_with_tools_prompt, + build_skip_prompt, + get_question, +) +from .toolbox import TroVEToolbox +``` + +Then update `TroVEController.__init__` (currently around lines 78-105) to accept the two new parameters: + +```python + def __init__( + self, + api_key: Optional[str] = None, + model: str = "claude-sonnet-4-5", + base_url: Optional[str] = None, + debug_dir: Optional[str] = None, + k: int = DEFAULT_K, + trim_every: int = DEFAULT_TRIM_EVERY, + trim_C: float = 1.0, + temperature: float = 0.3, + top_p: float = 0.95, + task_family: str = "default", + selection: str = "reward", + max_tool_iters: int = 8, + tool_schema_topk: int = 10, + ): + self.model = model + self.k = k + self.trim_every = trim_every + self.trim_C = trim_C + self.task_family = task_family + self.selection = selection + self.max_tool_iters = max_tool_iters + self.tool_schema_topk = tool_schema_topk + + self.backend = "openai" if base_url else "anthropic" + self.llm = TroVELLMClient( + backend=self.backend, + base_url=base_url, + api_key=api_key, + temperature=temperature, + top_p=top_p, + debug_dir=debug_dir, + ) + self.toolbox = TroVEToolbox() + self._n_processed: int = 0 +``` + +(Note `trim_C` default is now 1.0 to match the toolbox change in Task 1; controllers passing the default get the new behavior.) + +- [ ] **Step 6.2: Update existing build_* call-sites to pass `task_family`** + +In `_multi_way_generation`, find each call to `build_create_prompt(question)` and `build_skip_prompt(question)` and the legacy `build_import_prompt(question, toolbox_str)`, replacing them with: + +```python + prompt = build_import_prompt(question, toolbox_str, task_family=self.task_family) +``` + +```python + prompt = build_create_prompt(question, task_family=self.task_family) +``` + +```python + prompt = build_skip_prompt(question, task_family=self.task_family) +``` + +Also update `parse_response(raw)` calls to `parse_response(raw, task_family=self.task_family)`. + +- [ ] **Step 6.3: Insert the IMPORT-with-tools branch in `_multi_way_generation`** + +Locate the `# --- IMPORT mode ---` section (currently around lines 254-274). Replace it with: + +```python + # --- IMPORT mode --- + toolbox_nonempty = bool(toolbox_str) + use_tools_branch = toolbox_nonempty and self.backend == "openai" + + if use_tools_branch: + import_candidates = self._generate_import_with_tools( + question, example_idx, reward_fn=reward_fn, entry=entry + ) + best_import_idx, best_import_score = self._select_best( + import_candidates, reward_fn=reward_fn, entry=entry + ) + best_import = import_candidates[best_import_idx] + best_import["_reward_score"] = best_import_score + elif toolbox_nonempty: + # Legacy text-based IMPORT (Anthropic or unforeseen non-OpenAI path). + import_candidates = [] + for _ in range(self.k): + prompt = build_import_prompt(question, toolbox_str, task_family=self.task_family) + raw = self.llm.call(prompt, self.model, max_tokens=DEFAULT_MAX_TOKENS, tag="trove_import") + parsed = parse_response(raw, task_family=self.task_family) + is_ok, out = run_solution( + parsed["solution_code"], + parsed["tools_code"], + self.toolbox.get_full_code(), + ) + import_candidates.append( + {**parsed, "is_success": is_ok, "exec_output": out, "tool_calls": [], "stopped_reason": "legacy"} + ) + best_import_idx, best_import_score = self._select_best( + import_candidates, reward_fn=reward_fn, entry=entry + ) + best_import = import_candidates[best_import_idx] + best_import["_reward_score"] = best_import_score + else: + best_import = { + "solution_code": "", "tools_code": "", "functions": [], + "is_success": False, "exec_output": "", + "tool_calls": [], "stopped_reason": "empty_toolbox", + "_reward_score": None, + } +``` + +- [ ] **Step 6.4: Add the `_generate_import_with_tools` method** + +Insert this new method into the `TroVEController` class, after `_multi_way_generation`: + +```python + def _generate_import_with_tools( + self, + question: str, + example_idx: int, + reward_fn: Optional[Callable] = None, + entry: Optional[dict] = None, + ) -> List[dict]: + """ + IMPORT-mode generation using native OpenAI tool calling. + Builds K trajectories; each trajectory may invoke toolbox functions + via tool_calls during the multi-turn loop. Returns K candidate dicts + compatible with _select_best. + """ + prompt = build_import_with_tools_prompt(question, task_family=self.task_family) + tools_schema = tools_api.toolbox_to_openai_tools(self.toolbox, topk=self.tool_schema_topk) + + candidates: List[dict] = [] + for i in range(self.k): + tag = f"trove_import_t{example_idx}_{i}" + messages = [{"role": "user", "content": prompt}] + on_tc = lambda tc: tools_api.dispatch_tool_call(self.toolbox, tc) + traj = self.llm.chat_with_tools( + messages=messages, + tools=tools_schema, + model=self.model, + max_tokens=DEFAULT_MAX_TOKENS, + max_tool_iters=self.max_tool_iters, + on_tool_call=on_tc, + tag=tag, + ) + parsed = parse_response(traj["final_text"], task_family=self.task_family) + is_ok, out = run_solution( + parsed["solution_code"], + parsed["tools_code"], + self.toolbox.get_full_code(), + ) + candidates.append( + { + **parsed, + "is_success": is_ok, + "exec_output": out, + "tool_calls": traj["tool_calls"], + "stopped_reason": traj["stopped_reason"], + "iterations": traj["iterations"], + } + ) + return candidates +``` + +- [ ] **Step 6.5: Wire `selection="consistency"` to the existing consistency selector** + +Replace `_select_best` (currently around lines 337-361) with: + +```python + def _select_best( + self, + candidates: List[dict], + reward_fn: Optional[Callable] = None, + entry: Optional[dict] = None, + ): + """ + Select the best candidate from a list of response dicts. + + Returns (best_index, score_or_None) where score is (reward, message) + when reward-based selection is used, or None otherwise. + + Selection strategy is governed by self.selection: + - "reward" (default): reward-based when reward_fn+entry provided, + falls back to consistency when not. + - "consistency": original TroVE majority-vote algorithm. + """ + if self.selection == "consistency": + return self._select_best_by_consistency(candidates), None + if reward_fn is not None and entry is not None: + return self._select_best_by_reward(candidates, reward_fn, entry) + return self._select_best_by_consistency(candidates), None +``` + +- [ ] **Step 6.6: Update `_update_library` to credit frequency from tool_calls** + +Replace `_update_library` (currently around lines 419-432) with: + +```python + def _update_library(self, mode: str, resp: dict, example_idx: int) -> None: + """Update toolbox based on winning mode (faithful to run_trove.py).""" + if mode == "import": + tool_calls = resp.get("tool_calls") or [] + if tool_calls: + # Native tool-calling path: credit by unique tool_call.function.name + # (defensive: sanitize and let toolbox.update_frequency filter unknowns). + unique_names = { + tc["name"].split("<|", 1)[0].strip() + for tc in tool_calls + if tc.get("name") + } + for name in unique_names: + if name: + self.toolbox.update_frequency(name, example_idx) + else: + # Legacy text-based IMPORT: credit functions parsed from **Tools**. + for func_dict in resp.get("functions", []): + name = func_dict.get("name", "") + if name: + self.toolbox.update_frequency(name, example_idx) + elif mode == "create" and resp.get("is_success"): + for func_dict in resp.get("functions", []): + self.toolbox.add(func_dict, example_idx) + + # SKIP: no library changes +``` + +- [ ] **Step 6.7: Add telemetry fields to `_make_result`** + +Replace `_make_result` (currently around lines 438-480) with: + +```python + def _make_result( + self, + task_input: dict, + task_type: str, + best_mode: str, + best_resp: dict, + is_success: bool, + output: str, + best_reward_score=None, + ) -> dict: + """ + Build a result dict compatible with main.py's _print_result() and + _append_task_output(). Adds passive TroVE telemetry fields. + """ + tool_calls = best_resp.get("tool_calls") or [] + tools_called = sorted({ + tc["name"].split("<|", 1)[0].strip() + for tc in tool_calls + if tc.get("name") + }) + candidate_names = {e["name"] for e in self.toolbox.snapshot()} + actually_called = sorted( + imported_callsites( + solution_code=best_resp.get("solution_code", ""), + tools_code=best_resp.get("tools_code", ""), + candidate_names=candidate_names, + ) + ) + import_eligible = len(self.toolbox) > 0 # state AFTER this task's update + # Note: import_eligible reflects the current toolbox state after + # _update_library has already run for this task. The analyzer should + # interpret this as "a non-empty toolbox existed at some point during + # this task's processing". For pre-task eligibility, infer from + # toolbox snapshots in adjacent tasks. + + return { + "task_type": task_type, + "original_prompt": str(task_input), + "solved": is_success, + "steps": 1, + "trace": [ + { + "step": 0, + "agent": "trove", + "action": best_mode, + "is_success": is_success, + } + ], + "solution": best_resp.get("solution_code", ""), + "library_snapshot": self.toolbox.snapshot(), + "cost_summary": {}, + "final_output": { + "answer": output, + "explanation": f"TroVE mode={best_mode}", + "confidence": "high" if is_success else "low", + "execution_result": output, + }, + "agent_messages": self.llm.get_task_log(), + "reward_history": [], + "best_reward": None, + "final_reward": None, + "_best_reward_score": best_reward_score, + # TroVE native-tool-calling telemetry + "won_mode": best_mode, + "import_eligible": import_eligible, + "import_was_winner": best_mode == "import", + "tool_calls": tool_calls, + "tool_call_count": len(tool_calls), + "tools_called": tools_called, + "actually_called": actually_called, + "trove_stopped_reason": best_resp.get("stopped_reason", ""), + } +``` + +- [ ] **Step 6.8: Sanity-check the controller imports and constructs** + +Run: `python -c "from symbolic_agent.baselines.trove.controller import TroVEController; c = TroVEController(api_key='unused', model='x', task_family='pbebench', selection='reward'); print(c.task_family, c.selection, c.backend, c.max_tool_iters, c.tool_schema_topk)"` + +Expected: `pbebench reward anthropic 8 10`. + +- [ ] **Step 6.9: Run all tests to confirm no regressions** + +Run: `python -m pytest symbolic_agent/baselines/trove/tests/ -v` + +Expected: 16 passed (10 from tools_api + 6 from parse_callsites + 4 more = 20 actually; verify count matches what was added). + +Actual expected: 6 (parse_callsites) + 10 (tools_api) = 16 passed. + +- [ ] **Step 6.10: Commit** + +```bash +git add symbolic_agent/baselines/trove/controller.py +git commit -m "$(cat <<'EOF' +feat(trove): controller branch for native IMPORT tool calling + +- Add task_family and selection params to TroVEController.__init__. +- IMPORT branch dispatches to _generate_import_with_tools when toolbox + is non-empty and backend is openai; otherwise falls back to legacy + text-based IMPORT. +- _generate_import_with_tools builds K multi-turn trajectories via + TroVELLMClient.chat_with_tools, parses **Solution** strictly for + pbebench, and runs the result through the executor. +- _update_library credits frequency by unique tool_call.function.name + for the native path; legacy path still credits parsed functions. +- _make_result emits won_mode, import_eligible, import_was_winner, + tool_calls, tool_call_count, tools_called, actually_called, + trove_stopped_reason as passive telemetry. +- _select_best honors selection="consistency" or "reward" (default). +EOF +)" +``` + +--- + +## Task 7: `main.py` CLI flags (`--trove-selection`, `--trove-task-family`) + +**Files:** +- Modify: `main.py:794-810` (add new flags) and `main.py:1002-1011` (pass through to controller) + +- [ ] **Step 7.1: Add the two new argparse flags** + +In `main.py`, after the existing `--trove-trim-every` argument (around line 810), insert: + +```python + parser.add_argument( + "--trove-selection", + choices=["reward", "consistency"], + default="reward", + help="[TroVE] Candidate selection strategy. 'reward' (default) uses " + "the per-task reward function with AST tie-breaking. " + "'consistency' uses the original TroVE majority-vote algorithm. " + "(default: reward)", + ) + parser.add_argument( + "--trove-task-family", + choices=["default", "pbebench"], + default="default", + help="[TroVE] Task family for prompt selection and parser strictness. " + "'pbebench' uses PBEBench-shaped few-shots and strict **Solution** " + "parsing (no fallback to any python block). (default: default)", + ) +``` + +- [ ] **Step 7.2: Plumb the flags into the `TroVEController` constructor** + +Find the `elif args.framework == "trove":` block (around line 1002) and replace the `controller = TroVEController(...)` call with: + +```python + elif args.framework == "trove": + controller = TroVEController( + api_key=api_key, + model=model, + base_url=base_url, + debug_dir=args.debug_dir, + k=args.trove_k, + trim_every=args.trove_trim_every, + task_family=args.trove_task_family, + selection=args.trove_selection, + ) + logger.info( + "Framework: TroVE (k=%d, trim_every=%d, task_family=%s, selection=%s)", + args.trove_k, args.trove_trim_every, args.trove_task_family, args.trove_selection, + ) +``` + +- [ ] **Step 7.3: Sanity-check the CLI parses both flags** + +Run: `python main.py --help 2>&1 | grep -E "trove-selection|trove-task-family"` + +Expected: two lines, one for each new flag, both showing the choices and defaults. + +- [ ] **Step 7.4: Sanity-check controller wires through** + +Construct an empty tasks file so the run finishes immediately after parsing args: + +```bash +echo '[]' > /tmp/_pbebench_empty.json +VLLM_API_KEY=EMPTY python main.py \ + --framework trove \ + --trove-task-family pbebench \ + --trove-selection reward \ + --tasks-file /tmp/_pbebench_empty.json \ + --model openai/gpt-oss-20b \ + --backend vllm \ + --base-url http://localhost:8000/v1 \ + 2>&1 | grep -E "Framework: TroVE|ERROR" | head -5 +``` + +Expected: `Framework: TroVE (k=5, trim_every=500, task_family=pbebench, selection=reward)` then an `ERROR: no records found` from the loader. Both confirm the flags parsed and the controller was constructed. + +- [ ] **Step 7.5: Commit** + +```bash +git add main.py +git commit -m "$(cat <<'EOF' +feat(trove): CLI flags --trove-selection and --trove-task-family + +- --trove-selection {reward,consistency} (default: reward). +- --trove-task-family {default,pbebench} (default: default). Plumbed + through to TroVEController; PBEBench runs should pass --trove-task-family + pbebench to enable PBEBench-shaped few-shots and strict **Solution** + parsing. +EOF +)" +``` + +--- + +## Task 8: Update vLLM launcher script with tool-calling flags + +**Files:** +- Modify: `scripts/launch_vllm_gpt_oss_120b.sh` + +- [ ] **Step 8.1: Add the three vLLM flags** + +Replace the body of `scripts/launch_vllm_gpt_oss_120b.sh` with: + +```bash +#!/bin/bash + +mkdir -p /tmp/$USER-tiktoken-cache /tmp/$USER-tmp +chmod 700 /tmp/$USER-tiktoken-cache /tmp/$USER-tmp +export TIKTOKEN_CACHE_DIR=/tmp/$USER-tiktoken-cache +export TMPDIR=/tmp/$USER-tmp + +ts=$(date +%Y%m%d_%H%M%S) + +# Required vLLM tool-calling flags (vLLM >= v0.16.0 for PR #28729): +# --enable-auto-tool-choice enables tool_choice="auto" +# --tool-call-parser openai parses gpt-oss Harmony commentary channel +# --reasoning-parser openai_gptoss routes analysis-channel content into +# message.reasoning_content +nohup python -m vllm.entrypoints.openai.api_server \ + --model "openai/gpt-oss-120b" \ + --tokenizer "openai/gpt-oss-120b" \ + --dtype auto \ + --port ${1} \ + --gpu-memory-utilization 0.95 \ + --tensor-parallel-size 2 \ + --enable-auto-tool-choice \ + --tool-call-parser openai \ + --reasoning-parser openai_gptoss \ + > vllm_logs/vllm_${1}_${ts}.log 2>&1 & echo $! > vllm_logs/vllm_${1}_${ts}.pid +``` + +- [ ] **Step 8.2: Lint the script** + +Run: `bash -n scripts/launch_vllm_gpt_oss_120b.sh && echo OK` + +Expected: `OK`. + +- [ ] **Step 8.3: Commit** + +```bash +git add scripts/launch_vllm_gpt_oss_120b.sh +git commit -m "$(cat <<'EOF' +chore(launcher): enable native tool calling for gpt-oss-120b vLLM server + +Add three flags required for OpenAI-compatible tool calling on gpt-oss +served by vLLM >= v0.16.0: + --enable-auto-tool-choice + --tool-call-parser openai + --reasoning-parser openai_gptoss + +Without these the controller's chat_with_tools loop sees no tool_calls +in the response and degrades to no-tool behavior. +EOF +)" +``` + +--- + +## Task 9: `scripts/analyze_trove_run.py` + +**Files:** +- Create: `scripts/analyze_trove_run.py` + +- [ ] **Step 9.1: Create the analysis script** + +Create `scripts/analyze_trove_run.py`: + +```python +#!/usr/bin/env python3 +"""Post-hoc analysis of a TroVE run JSONL output. + +Reads the per-task JSONL file produced by main.py --output-file and reports: + - Overall accuracy + - Final toolbox size + - Per-mode wins + - IMPORT-mode tool-use breakdown + - Top-10 most-called toolbox functions + +Usage: + python scripts/analyze_trove_run.py path/to/results.jsonl +""" + +from __future__ import annotations + +import argparse +import json +import sys +from collections import Counter +from pathlib import Path + + +def _load_rows(path: Path) -> list[dict]: + rows = [] + with path.open() as f: + for lineno, line in enumerate(f, 1): + line = line.strip() + if not line: + continue + try: + rows.append(json.loads(line)) + except json.JSONDecodeError as exc: + print(f"warning: line {lineno} is not valid JSON: {exc}", file=sys.stderr) + return rows + + +def _result_dict(row: dict) -> dict: + """Tolerant accessor: results are nested under 'result' in main.py's output.""" + return row.get("result") or row + + +def main() -> None: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("path", type=Path, help="Path to the TroVE results JSONL file") + args = parser.parse_args() + + rows = _load_rows(args.path) + if not rows: + print("ERROR: no rows loaded", file=sys.stderr) + sys.exit(1) + + n = len(rows) + results = [_result_dict(r) for r in rows] + + # Overall accuracy + solved = sum(1 for r in results if r.get("solved")) + print(f"=== Run summary: {args.path.name} ===") + print(f"Tasks: {n}") + print(f"Solved: {solved}/{n} ({100 * solved / n:.1f}%)") + + # Final toolbox size — take the snapshot from the last row. + last_snapshot = results[-1].get("library_snapshot") or [] + print(f"Final toolbox size: {len(last_snapshot)}") + + # Per-mode wins + mode_counter = Counter(r.get("won_mode", "?") for r in results) + print(f"Mode wins: {dict(mode_counter)}") + + # IMPORT-mode tool-use breakdown + import_eligible = [r for r in results if r.get("import_eligible")] + if not import_eligible: + print("No IMPORT-eligible tasks observed.") + else: + with_calls = [r for r in import_eligible if (r.get("tool_call_count") or 0) >= 1] + n_eligible = len(import_eligible) + n_with = len(with_calls) + mean_calls = ( + sum((r.get("tool_call_count") or 0) for r in import_eligible) / n_eligible + ) + all_calls = [tc for r in import_eligible for tc in (r.get("tool_calls") or [])] + n_calls_total = len(all_calls) + n_calls_ok = sum(1 for tc in all_calls if tc.get("ok")) + success_rate = (100 * n_calls_ok / n_calls_total) if n_calls_total else 0.0 + print( + f"IMPORT-eligible tasks: {n_eligible}\n" + f" Tasks with >=1 tool call: {n_with}/{n_eligible} ({100 * n_with / n_eligible:.1f}%)\n" + f" Mean tool calls / task: {mean_calls:.2f}\n" + f" Tool-call success rate: {n_calls_ok}/{n_calls_total} ({success_rate:.1f}%)" + ) + + # Top-10 most-called functions + name_counter: Counter = Counter() + for r in results: + for tc in r.get("tool_calls") or []: + name = (tc.get("name") or "").split("<|", 1)[0].strip() + if name: + name_counter[name] += 1 + if name_counter: + print("Top-10 most-called toolbox functions:") + for name, cnt in name_counter.most_common(10): + print(f" {cnt:4d} {name}") + else: + print("No tool calls recorded in this run.") + + +if __name__ == "__main__": + main() +``` + +- [ ] **Step 9.2: Make the script executable and lint-check** + +Run: `chmod +x scripts/analyze_trove_run.py && python -c "import ast; ast.parse(open('scripts/analyze_trove_run.py').read())" && echo OK` + +Expected: `OK`. + +- [ ] **Step 9.3: Smoke-test on synthetic data** + +Run: + +```bash +python -c " +import json, tempfile, subprocess +rows = [ + {'result': {'solved': True, 'won_mode': 'import', 'import_eligible': True, 'tool_call_count': 2, 'tool_calls': [{'name':'find_replace_chain','ok':True},{'name':'find_replace_chain','ok':True}], 'library_snapshot':[{'name':'find_replace_chain'}]}}, + {'result': {'solved': False, 'won_mode': 'create', 'import_eligible': False, 'tool_call_count': 0, 'tool_calls': [], 'library_snapshot':[{'name':'find_replace_chain'}]}}, +] +with tempfile.NamedTemporaryFile('w', suffix='.jsonl', delete=False) as f: + for r in rows: f.write(json.dumps(r) + '\n') + p = f.name +print(subprocess.check_output(['python','scripts/analyze_trove_run.py', p]).decode()) +" +``` + +Expected output contains `Solved: 1/2 (50.0%)`, `Final toolbox size: 1`, `Mode wins: {'import': 1, 'create': 1}`, `IMPORT-eligible tasks: 1`, `Tool-call success rate: 2/2 (100.0%)`, and a row `2 find_replace_chain` in the top-10. + +- [ ] **Step 9.4: Commit** + +```bash +git add scripts/analyze_trove_run.py +git commit -m "$(cat <<'EOF' +feat(trove): add analyze_trove_run.py for post-hoc telemetry reports + +Reads a TroVE JSONL output and reports overall accuracy, final toolbox +size, per-mode wins, IMPORT-mode tool-use breakdown (>=1 call rate, +mean calls/task, success rate), and the top-10 most-called toolbox +functions. Sanitizes Harmony control-token contamination in tool names +when aggregating. +EOF +)" +``` + +--- + +## Task 10: Rewrite `docs/deviations.md` + +**Files:** +- Create: `symbolic_agent/baselines/trove/docs/deviations.md` + +- [ ] **Step 10.1: Create the directory and the deviations doc** + +Create `symbolic_agent/baselines/trove/docs/deviations.md`: + +```markdown +# TroVE Implementation: Deviations and Faithful Elements + +This document tracks how this port differs from — and where it stays +faithful to — the original TroVE algorithm +([Wang et al., 2024](https://arxiv.org/abs/2401.12869), +[zorazrw/trove](https://github.com/zorazrw/trove)). + +## 1. Algorithmic deviations + +### 1.1 Native OpenAI tool calling for IMPORT mode +The original TroVE shows the model a `**Toolbox**` markdown block +listing top-k function signatures and asks it to write a `**Solution**` +plus `**Tools**` block referencing those functions by name. We replace +this for the IMPORT mode (when `backend == "openai"` and the toolbox is +non-empty) with **native OpenAI tool calling**: the toolbox is exposed +via the `tools=[...]` parameter of `chat.completions.create`, the model +emits structured `tool_calls` during its reasoning, and `dispatch_tool_call` +runs each one in the sandboxed executor and returns the stdout. This +makes function usage observable and credit-able from the trajectory +itself. + +### 1.2 Reward-based candidate selection (default) +The paper uses self-consistency (majority vote on stdout, AST tie-break) +to pick the best of K samples per mode. We default to **reward-based +selection**: every candidate is scored by the per-task reward function, +ties broken by minimum AST node count. This is more reliable on +PBEBench (program-list outputs rarely tie as strings). The original +self-consistency selector remains available via `--trove-selection consistency`. + +### 1.3 PBEBench-shaped few-shot examples +For `task_family="pbebench"` we replace the generic CREATE / SKIP / IMPORT +example pairs with PBEBench-shaped pairs that demonstrate `replace()` +chains and a small reusable helper (`find_replace_chain`). The legacy +default examples remain for `task_family="default"`. + +### 1.4 Strict **Solution** parsing for PBEBench +The legacy parser falls back to "first ```python``` block anywhere" when +no `**Solution**` block is present. For `task_family="pbebench"` this +fallback is disabled, preventing CoT scratchpad from being accidentally +promoted to the answer. + +## 2. Faithful elements + +- 3-mode generation (IMPORT, CREATE, SKIP). +- K samples per mode (default K=5, paper). +- AST-tie-breaking by node count (simplest solution wins). +- Periodic toolbox trimming with threshold `C·log_{20}(n)`, default + `C=1.0`, matching the original implementation. +- Frequency-based top-k retrieval for the toolbox view. +- Dict-keyed toolbox structure mirroring `utils/code.py`. +- Library updates: IMPORT credits frequency, CREATE adds new functions + on success, SKIP makes no library changes. + +## 3. Infrastructural patches + +- **JSONL-per-task checkpointing** via `--output-file`, with crash + resumption. +- **`reasoning_content` fallback** in `_call_openai` for `gpt-oss` Harmony + channel splits where the answer text lives in `message.reasoning_content`. +- **Executor timeout 60s** (vs. 10s in earlier versions of this port), + closer to the original's ~100s. +- **`<|`-truncation sanitizer** in `dispatch_tool_call` and + `_update_library`. Defensive workaround for the open vLLM + [PR #35906](https://github.com/vllm-project/vllm/pull/35906) covering + Harmony control-token leakage into tool names. When that PR lands + upstream the sanitizer becomes a no-op and is left in place. + +## 4. Backend coverage caveat + +Anthropic backend code paths exist and are exercised by CREATE / SKIP and +the legacy text-based IMPORT fallback, but **the smoke run and reported +numbers are vLLM-served `gpt-oss` only**. IMPORT-with-tools requires +the OpenAI/vLLM backend and is the only path we test end-to-end. + +## 5. vLLM version requirement + +- Minimum vLLM: **v0.16.0** (branch-cut 2026-02-08). +- Required upstream change: [PR #28729](https://github.com/vllm-project/vllm/pull/28729) + ("Multiple fixes for gpt-oss Chat Completion prompting"), merged + 2025-12-12. v0.16.0 is the first stable release branch-cut after the merge. +- Known open caveat: [PR #35906](https://github.com/vllm-project/vllm/pull/35906) + ("Sanitize leaked Harmony control tokens"), still open as of late + March 2026 — see §3 for the sanitizer mitigation. +``` + +- [ ] **Step 10.2: Verify the file renders** + +Run: `head -20 symbolic_agent/baselines/trove/docs/deviations.md` + +Expected: the document renders with the title on the first line. + +- [ ] **Step 10.3: Commit** + +```bash +git add symbolic_agent/baselines/trove/docs/deviations.md +git commit -m "$(cat <<'EOF' +docs(trove): rewrite deviations.md for native tool calling + +Document algorithmic deviations (native OpenAI tool calling for IMPORT, +reward-based selection by default, PBEBench-shaped few-shots, strict +**Solution** parsing for pbebench), faithful elements (3-mode generation, +K-sampling, AST tie-break, C*log_20(n) trimming with C=1.0), and +infrastructural patches (JSONL checkpointing, reasoning_content +fallback, 60s executor timeout, defensive <|-truncation sanitizer). + +Includes vLLM version requirement (>= v0.16.0 for PR #28729) and the +backend coverage caveat (smoke run is vLLM-served gpt-oss only). +EOF +)" +``` + +--- + +## Task 11: Pre-flight sanity check + 50-task smoke run + report + +**Files:** none modified. This is the validation task. + +- [ ] **Step 11.1: Re-launch vLLM with the new flags** + +The existing launcher is named `launch_vllm_gpt_oss_120b.sh` but the spec calls for `gpt-oss-20b`. Two options — pick one: + +(a) **Smoke on 120b directly** (no script change beyond Task 8). Run: + +```bash +bash scripts/launch_vllm_gpt_oss_120b.sh 8000 +``` + +Then in Tasks 11.2 and 11.4, replace `--model openai/gpt-oss-20b` with `--model openai/gpt-oss-120b`. + +(b) **Smoke on 20b** (one-line edit). In `scripts/launch_vllm_gpt_oss_120b.sh`, change `openai/gpt-oss-120b` → `openai/gpt-oss-20b` for both `--model` and `--tokenizer`, and lower `--tensor-parallel-size 2` → `--tensor-parallel-size 1` (20b fits on one GPU). Then: + +```bash +bash scripts/launch_vllm_gpt_oss_120b.sh 8000 +``` + +(Do not commit the edit — restore the file before the final commit, or rename the script if you want the 20b variant kept.) + +Then wait 60–120 seconds and confirm the server is up: + +Run: `curl -sS http://localhost:8000/v1/models | head -5` + +Expected: a JSON response listing the model you launched. + +- [ ] **Step 11.2: Pre-flight: one-task smoke** + +Run a single task to verify the tool-calling round-trip works end-to-end. The codebase has no `--num-tasks` flag, so we slice the first row out of the 50-task PBEBench-Lite file: + +```bash +mkdir -p outputs/trove_pbebench_preflight +head -n 1 data/pbebench/lite_pilot_tasks.jsonl > /tmp/_pbebench_one.jsonl +VLLM_API_KEY=EMPTY python main.py \ + --framework trove \ + --tasks-file /tmp/_pbebench_one.jsonl \ + --output-file outputs/trove_pbebench_preflight/results.jsonl \ + --model openai/gpt-oss-20b \ + --backend vllm \ + --base-url http://localhost:8000/v1 \ + --trove-task-family pbebench \ + --trove-selection reward \ + --trove-k 3 \ + --trove-trim-every 9999 \ + --max-tokens 4096 \ + --debug-dir outputs/trove_pbebench_preflight/debug +``` + +Expected: the run completes without crashing. The output file should contain one row. + +- [ ] **Step 11.3: Verify the tool-calling pre-flight check** + +This task starts with an empty toolbox so the IMPORT-with-tools branch will not run. Inspect the most recent debug-dir log file with `trove_create` or `trove_skip` in the name and confirm it contains a non-empty response: + +Run: `ls -t outputs/trove_pbebench_preflight/debug/trove_run_*/0001_*.json | head -1 | xargs python -c "import json,sys; d=json.load(open(sys.argv[1])); print('content length:', len(d['response']['content']))"` + +Expected: non-zero content length. If zero, the `reasoning_content` fallback (Task 1.3) is not engaging — debug before proceeding. + +- [ ] **Step 11.4: Run the 50-task smoke** + +`data/pbebench/lite_pilot_tasks.jsonl` is exactly 50 PBEBench-Lite tasks with per-task `reward: pbebench`, so no slicing or `--default-reward` flag is required. + +```bash +mkdir -p outputs/trove_pbebench_smoke +VLLM_API_KEY=EMPTY python main.py \ + --framework trove \ + --tasks-file data/pbebench/lite_pilot_tasks.jsonl \ + --output-file outputs/trove_pbebench_smoke/results.jsonl \ + --model openai/gpt-oss-20b \ + --backend vllm \ + --base-url http://localhost:8000/v1 \ + --trove-task-family pbebench \ + --trove-selection reward \ + --trove-k 3 \ + --trove-trim-every 9999 \ + --max-tokens 4096 \ + --debug-dir outputs/trove_pbebench_smoke/debug +``` + +Expected: ~30–60 minutes wall-clock on local vLLM. Run completes without crashes. Auto-resume from checkpoint is supported by `--output-file` if the run is interrupted. + +- [ ] **Step 11.5: Run the analysis script and capture the report** + +Run: `python scripts/analyze_trove_run.py outputs/trove_pbebench_smoke/results.jsonl | tee outputs/trove_pbebench_smoke/report.txt` + +Expected: the report shows accuracy, toolbox size, mode wins, IMPORT-mode tool-use breakdown, and top-10 functions. + +- [ ] **Step 11.6: Report numbers to the user (no prompt iteration)** + +Per the spec's "done criteria", report the contents of `outputs/trove_pbebench_smoke/report.txt` plus a short narrative paragraph noting any anomalies (e.g. `<|channel|>` contamination from PR #35906, `max_iters` stops, JSON-arg parse failures). + +**No prompt iteration. No threshold tuning. The numbers are what they are.** + +--- + +## Self-Review + +### 1. Spec coverage + +| Spec section | Implementing task | +|---|---| +| §3 Architecture overview | Tasks 1–8 collectively | +| §4 Data flow for IMPORT-with-tools | Tasks 4–6 | +| §5.1 New `tools_api.py` | Task 4 | +| §5.2 `_call_openai` reasoning fallback | Task 1 | +| §5.2 `chat_with_tools` method | Task 5 | +| §5.3 Controller `__init__` params, IMPORT branch, `_update_library`, `_make_result` | Task 6 | +| §5.4 `imported_callsites`, `task_family` in parse_response | Task 2 | +| §5.5 PBEBench prompts and IMPORT-with-tools prompt | Task 3 | +| §5.6 Trim `C=1.0` | Task 1 | +| §5.7 Executor timeout 60s | Task 1 | +| §5.8 main.py CLI flags | Task 7 | +| §5.9 vLLM launcher flags | Task 8 | +| §5.10 `analyze_trove_run.py` | Task 9 | +| §5.11 deviations.md rewrite | Task 10 | +| §6 Telemetry fields | Task 6.7 | +| §7 Implementation defaults | Tasks 4–6 | +| §8 Smoke run + done criteria | Task 11 | + +All sections accounted for. + +### 2. Placeholder scan + +No `TBD`, `TODO`, `implement later`, "appropriate", "various", or "fill in details" in any task. All test code is fully written (not "write tests for the above"). All file paths are exact. All commit messages are pre-written. + +### 3. Type and signature consistency + +- `imported_callsites(solution_code, tools_code, candidate_names)` — defined in Task 2, called in Task 6.7 with matching kwargs. +- `toolbox_to_openai_tools(toolbox, topk=10)` — defined in Task 4, called in Task 6.4. +- `dispatch_tool_call(toolbox, tool_call) -> str` — defined in Task 4, called via the `on_tc` closure in Task 6.4. +- `chat_with_tools(messages, tools, model, max_tokens, max_tool_iters, on_tool_call, tag)` — defined in Task 5, called in Task 6.4 with matching kwargs. +- `build_import_with_tools_prompt(question, task_family)` — defined in Task 3, called in Task 6.4. +- `build_import_prompt(question, toolbox_str, task_family)` — extended in Task 3, called in Task 6.3. +- `parse_response(text, task_family)` — extended in Task 2, called in Tasks 6.3 and 6.4. +- `TroVEController(__init__)` new params (`task_family`, `selection`, `max_tool_iters`, `tool_schema_topk`) — defined in Task 6.1, passed in Task 7.2 (only `task_family` and `selection` from CLI; the other two use defaults, which matches the spec's defaults table). + +All consistent. + +### 4. Plan quirks worth noting to the executor + +- Task 11.4 relies on the user's `task_index_25_direct_feedback.json` having at least 50 tasks. If it has fewer, swap to whichever PBEBench-Lite tasks file is available (the spec calls for "50 PBEBench-Lite tasks"; the exact filename is not load-bearing). +- Task 11.5 `tee` output captures the report for the user-facing message in 11.6. +- The `import_eligible` field in `_make_result` is computed *after* `_update_library` runs for the current task. The doc-comment in Task 6.7 explains the consequence; the analyzer in Task 9 doesn't depend on the pre-task value. +- Task 6.5's `_select_best` change wraps the existing reward/consistency selectors. When `selection="consistency"` is set, the `reward_fn` and `entry` arguments are ignored — that is intentional and matches the user's choice to keep both flags as opt-ins. + +--- + +## Execution Handoff + +Plan complete and saved to `docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md`. Two execution options: + +**1. Subagent-Driven (recommended)** — I dispatch a fresh subagent per task, review between tasks, fast iteration. + +**2. Inline Execution** — Execute tasks in this session using executing-plans, batch execution with checkpoints. + +Which approach? From a80fc2820a03364844c535a4aa06106046a915e2 Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 17:54:56 -0400 Subject: [PATCH 03/24] fix(trove): infra patches for native tool calling - toolbox.trim default C=1.0 (matches original TroVE) - executor DEFAULT_TIMEOUT=60s (PBEBench + multi-turn headroom) - llm._call_openai falls back to message.reasoning_content when message.content is empty (gpt-oss Harmony channel split) Made-with: Cursor --- symbolic_agent/baselines/trove/executor.py | 2 +- symbolic_agent/baselines/trove/llm.py | 3 ++- symbolic_agent/baselines/trove/toolbox.py | 4 ++-- 3 files changed, 5 insertions(+), 4 deletions(-) diff --git a/symbolic_agent/baselines/trove/executor.py b/symbolic_agent/baselines/trove/executor.py index cf23471b..1b8717e4 100644 --- a/symbolic_agent/baselines/trove/executor.py +++ b/symbolic_agent/baselines/trove/executor.py @@ -16,7 +16,7 @@ logger = logging.getLogger(__name__) -DEFAULT_TIMEOUT = 10 # seconds, matching TroVE's original +DEFAULT_TIMEOUT = 60 # seconds — generous for PBEBench replace() chains and multi-turn dispatch def run_solution( diff --git a/symbolic_agent/baselines/trove/llm.py b/symbolic_agent/baselines/trove/llm.py index d27f8d28..ec98472f 100644 --- a/symbolic_agent/baselines/trove/llm.py +++ b/symbolic_agent/baselines/trove/llm.py @@ -189,7 +189,8 @@ def _call_openai(self, prompt: str, model: str, max_tokens: int, tag: str) -> st messages=messages, # No response_format — TroVE uses free-form text ) - raw = response.choices[0].message.content or "" + msg = response.choices[0].message + raw = msg.content or getattr(msg, "reasoning_content", "") or "" u = getattr(response, "usage", None) details = getattr(u, "completion_tokens_details", None) usage = { diff --git a/symbolic_agent/baselines/trove/toolbox.py b/symbolic_agent/baselines/trove/toolbox.py index 9cae9532..617b66ae 100644 --- a/symbolic_agent/baselines/trove/toolbox.py +++ b/symbolic_agent/baselines/trove/toolbox.py @@ -114,7 +114,7 @@ def get_full_code(self) -> str: # Trimming # ------------------------------------------------------------------ - def trim(self, n_processed: int, C: float = 0.5) -> set: + def trim(self, n_processed: int, C: float = 1.0) -> set: """ Remove functions whose frequency is below the threshold C * log_{20}(n_processed) @@ -122,7 +122,7 @@ def trim(self, n_processed: int, C: float = 0.5) -> set: Faithful to trim_library() in run_trove.py: threshold = math.log(n, 20) # log base 20 - C defaults to 0.5, matching the paper (§3.3): λ = ½ · log_{10}(n). + C defaults to 1.0, matching the original implementation (C·log_{20}(n)). Note: the original uses log base-20 not base-10; we keep base-20. """ if n_processed <= 1: From 91cd92f77b392b03378e05eb53789590a7d95bc6 Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 17:59:36 -0400 Subject: [PATCH 04/24] feat(trove): add imported_callsites helper and task_family to parse_response - imported_callsites(solution, tools, names) -> set: AST-walks Solution code and returns names from the candidate set that are actually called. Handles bare Name and Attribute (toolbox.foo) callees. - parse_response(text, task_family="default"): when task_family="pbebench" the parser does not fall back to the first python block when **Solution** is missing. Prevents CoT scratchpad from being promoted to the answer. Made-with: Cursor --- symbolic_agent/baselines/trove/parse.py | 56 ++++++++++++---- .../baselines/trove/tests/__init__.py | 0 .../trove/tests/test_parse_callsites.py | 65 +++++++++++++++++++ 3 files changed, 110 insertions(+), 11 deletions(-) create mode 100644 symbolic_agent/baselines/trove/tests/__init__.py create mode 100644 symbolic_agent/baselines/trove/tests/test_parse_callsites.py diff --git a/symbolic_agent/baselines/trove/parse.py b/symbolic_agent/baselines/trove/parse.py index 56a90cba..4a53a733 100644 --- a/symbolic_agent/baselines/trove/parse.py +++ b/symbolic_agent/baselines/trove/parse.py @@ -83,7 +83,7 @@ def _make_executable(code: str) -> str: return stripped -def parse_response(text: str) -> dict: +def parse_response(text: str, task_family: str = "default") -> dict: """ Parse a TroVE-format LLM response. @@ -95,20 +95,17 @@ def parse_response(text: str) -> dict: "functions": list[dict], # parsed tool dicts from the Tools block } - Fallback behaviour - ------------------ - Tasks like PBEBench embed their own format instructions (e.g. "output a - **Program Sequence** block") that can override the TroVE **Solution** - header. When no **Solution** block is found we grab the first ```python``` - block in the response and, if it is a bare list/string literal, wrap it - in print() so it can be executed and its stdout captured as the answer. + task_family + ----------- + "default": if no **Solution** block is found, falls back to the first + ```python``` block anywhere (legacy behaviour). + "pbebench": no fallback. Strict **Solution**-block-only parsing avoids + accidentally promoting CoT scratchpad to the answer. """ solution_code = _extract_code_block(text, "Solution") or "" tools_code = _extract_code_block(text, "Tools") or "" - # Fallback: model followed the task's own format (e.g. **Program Sequence**) - # instead of the TroVE **Solution** header. - if not solution_code: + if not solution_code and task_family != "pbebench": raw = _extract_any_python_block(text) if raw: solution_code = _make_executable(raw) @@ -267,3 +264,40 @@ def count_ast_nodes(code: str) -> int: return sum(1 for _ in ast.walk(tree)) except SyntaxError: return 99_999 + + +def imported_callsites( + solution_code: str, + tools_code: str, + candidate_names: set, +) -> set: + """ + Return the subset of `candidate_names` that appear as call-sites in + `solution_code`. Used for the `actually_called` telemetry field. + + Detects two callee shapes: + - bare Name: find_replace_chain(...) + - Attribute(name): toolbox.find_replace_chain(...) + + `tools_code` is currently unused (kept in the signature so callers can + pass through the **Tools** block context if we later want to filter by + what was actually imported). + + Returns an empty set on empty input or SyntaxError. + """ + if not solution_code or not candidate_names: + return set() + try: + tree = ast.parse(solution_code) + except SyntaxError: + return set() + found: set = set() + for node in ast.walk(tree): + if not isinstance(node, ast.Call): + continue + func = node.func + if isinstance(func, ast.Name) and func.id in candidate_names: + found.add(func.id) + elif isinstance(func, ast.Attribute) and func.attr in candidate_names: + found.add(func.attr) + return found diff --git a/symbolic_agent/baselines/trove/tests/__init__.py b/symbolic_agent/baselines/trove/tests/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/symbolic_agent/baselines/trove/tests/test_parse_callsites.py b/symbolic_agent/baselines/trove/tests/test_parse_callsites.py new file mode 100644 index 00000000..3429061b --- /dev/null +++ b/symbolic_agent/baselines/trove/tests/test_parse_callsites.py @@ -0,0 +1,65 @@ +"""Unit tests for parse.imported_callsites and parse_response(task_family=).""" + +from symbolic_agent.baselines.trove.parse import imported_callsites, parse_response + + +# --------------------------------------------------------------------------- +# imported_callsites +# --------------------------------------------------------------------------- + +def test_callsites_bare_name(): + code = "result = find_replace_chain(s, [('a', 'b')])\nprint(result)" + assert imported_callsites(code, tools_code="", candidate_names={"find_replace_chain", "other"}) == {"find_replace_chain"} + + +def test_callsites_attribute_access(): + code = "result = toolbox.find_replace_chain(s, pairs)\nprint(result)" + assert imported_callsites(code, tools_code="", candidate_names={"find_replace_chain"}) == {"find_replace_chain"} + + +def test_callsites_no_match(): + code = "print(s.replace('a', 'b'))" + assert imported_callsites(code, tools_code="", candidate_names={"find_replace_chain"}) == set() + + +def test_callsites_multiple_calls_same_name_dedup(): + code = "x = f(1)\ny = f(2)\nprint(x, y)" + assert imported_callsites(code, tools_code="", candidate_names={"f", "g"}) == {"f"} + + +def test_callsites_syntax_error_returns_empty(): + code = "this is not valid python ::" + assert imported_callsites(code, tools_code="", candidate_names={"f"}) == set() + + +def test_callsites_empty_inputs(): + assert imported_callsites("", "", set()) == set() + assert imported_callsites("print(1)", "", set()) == set() + + +# --------------------------------------------------------------------------- +# parse_response(task_family=) +# --------------------------------------------------------------------------- + +def test_parse_response_pbebench_strict_no_solution_block(): + text = "Here is some reasoning.\n```python\nprint('answer')\n```\n" + out = parse_response(text, task_family="pbebench") + assert out["solution_code"] == "" + + +def test_parse_response_pbebench_with_solution_block(): + text = "**Solution**\n```python\nprint('answer')\n```\n" + out = parse_response(text, task_family="pbebench") + assert out["solution_code"] == "print('answer')" + + +def test_parse_response_default_falls_back_to_any_python_block(): + text = "Here is some reasoning.\n```python\nprint('answer')\n```\n" + out = parse_response(text, task_family="default") + assert "print('answer')" in out["solution_code"] + + +def test_parse_response_default_call_signature_unchanged(): + text = "**Solution**\n```python\nprint('answer')\n```\n" + out = parse_response(text) + assert out["solution_code"] == "print('answer')" From 7ffddbe3c9276fa192df4c4e9ccb7a8bb2e157bf Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 18:03:42 -0400 Subject: [PATCH 05/24] feat(trove): PBEBench-shaped few-shots and IMPORT-with-tools prompt - Add task_family parameter to all build_* prompt builders. - Add _CREATE_EXAMPLE_PBEBENCH and _SKIP_EXAMPLE_PBEBENCH demonstrating replace()-chain solutions and a find_replace_chain helper. - Add build_import_with_tools_prompt for native tool calling: no **Toolbox** markdown block (toolbox is conveyed via tools=[...]). - _FORMAT_OVERRIDE is empty for task_family="pbebench" (the example models the desired format directly). Made-with: Cursor --- symbolic_agent/baselines/trove/prompts.py | 213 +++++++++++++++++++--- 1 file changed, 187 insertions(+), 26 deletions(-) diff --git a/symbolic_agent/baselines/trove/prompts.py b/symbolic_agent/baselines/trove/prompts.py index edab732c..78be7add 100644 --- a/symbolic_agent/baselines/trove/prompts.py +++ b/symbolic_agent/baselines/trove/prompts.py @@ -15,27 +15,36 @@ applicable to both PBEBench and ReasoningGym string tasks. """ -# Appended to every instruction block to override format instructions that -# may be embedded in the question itself (e.g. PBEBench asks for a -# "**Program Sequence**" block, reasoning_gym asks for a specific format). -_FORMAT_OVERRIDE = ( +# --------------------------------------------------------------------------- +# Format override (default-family only) +# --------------------------------------------------------------------------- + +_FORMAT_OVERRIDE_DEFAULT = ( "\nIMPORTANT: Regardless of any formatting instructions inside the question, " "always produce your answer as executable Python in the **Solution** block " "and end it with print(answer). " "Your answer is whatever gets printed to stdout when the Solution code runs." ) +# PBEBench prompts model the desired format directly via the few-shot example, +# so no override string is needed. +_FORMAT_OVERRIDE_PBEBENCH = "" + + +def _format_override(task_family: str) -> str: + return _FORMAT_OVERRIDE_PBEBENCH if task_family == "pbebench" else _FORMAT_OVERRIDE_DEFAULT + + # --------------------------------------------------------------------------- -# IMPORT mode (use functions from the toolbox) +# IMPORT mode (text-based, default and Anthropic fallback) # --------------------------------------------------------------------------- -_IMPORT_INSTRUCTION = ( +_IMPORT_INSTRUCTION_DEFAULT = ( "You task is to write Python program solutions to the given questions.\n" "The toolbox section lists all the available functions that can be used in your solution." - + _FORMAT_OVERRIDE ) -_IMPORT_EXAMPLE = """\ +_IMPORT_EXAMPLE_DEFAULT = """\ ## Example **Question** Given a list of strings and a list of (old, new) substitution pairs, apply all @@ -61,6 +70,31 @@ from toolbox import apply_substitutions ```""" +_IMPORT_EXAMPLE_PBEBENCH = """\ +## Example +**Question** +You are given example input/output pairs. Produce a list of replace() calls +that transforms each input into its expected output. + +Input: "hello world" +Output: "HELLO_WORLD" + +**Toolbox** +```python +# Apply a chain of (old, new) replacements to a string. +find_replace_chain(s: str, pairs: list) -> str +``` + +**Solution** +```python +result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")]) +print(result) +``` +**Tools** +```python +from toolbox import find_replace_chain +```""" + _IMPORT_TASK_TEMPLATE = """\ ## Task **Question** @@ -73,29 +107,110 @@ """ -def build_import_prompt(question: str, toolbox_str: str) -> str: - """Build the IMPORT-mode prompt for a single task.""" +def build_import_prompt(question: str, toolbox_str: str, task_family: str = "default") -> str: + """Build the text-based IMPORT-mode prompt (used for Anthropic and as fallback).""" + instruction = _IMPORT_INSTRUCTION_DEFAULT + _format_override(task_family) + example = _IMPORT_EXAMPLE_PBEBENCH if task_family == "pbebench" else _IMPORT_EXAMPLE_DEFAULT return ( - _IMPORT_INSTRUCTION + instruction + "\n\n\n" - + _IMPORT_EXAMPLE + + example + "\n\n\n" + _IMPORT_TASK_TEMPLATE.format(question=question, toolbox=toolbox_str) ) # --------------------------------------------------------------------------- -# CREATE mode (create new reusable functions) +# IMPORT-with-tools mode (native OpenAI tool calling; no **Toolbox** block) # --------------------------------------------------------------------------- -_CREATE_INSTRUCTION = ( +_IMPORT_WITH_TOOLS_INSTRUCTION_DEFAULT = ( + "You task is to write Python program solutions to the given questions.\n" + "You have a set of helper functions available as tools. Call any of them " + "when they help you solve the question; otherwise solve directly. After " + "you have computed the answer, output it as executable Python in a " + "**Solution** block and end with print(answer)." +) + +_IMPORT_WITH_TOOLS_INSTRUCTION_PBEBENCH = ( + "You task is to produce a list of replace() calls that transforms each " + "input into its expected output for a Programming-by-Example task.\n" + "You have a set of helper functions available as tools. Call any of them " + "to test ideas or compute intermediate results; the final answer must be " + "produced as a Python program in the **Solution** block." +) + +_IMPORT_WITH_TOOLS_EXAMPLE_DEFAULT = """\ +## Example +**Question** +Apply substitutions [("a","o"),("t","p")] to ["cat","bat"] and return the list. + +(After optionally calling `apply_substitutions` as a tool to confirm, +the assistant produces:) + +**Solution** +```python +strings = ["cat", "bat"] +subs = [("a", "o"), ("t", "p")] +result = apply_substitutions(strings, subs) +print(result) +```""" + +_IMPORT_WITH_TOOLS_EXAMPLE_PBEBENCH = """\ +## Example +**Question** +Produce a sequence of replace() calls that transforms "hello world" into +"HELLO_WORLD". + +(After optionally calling `find_replace_chain` as a tool to verify a +candidate sequence, the assistant produces:) + +**Solution** +```python +result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")]) +print(result) +```""" + +_IMPORT_WITH_TOOLS_TASK_TEMPLATE = """\ +## Task +**Question** +{question} + +**Solution** +""" + + +def build_import_with_tools_prompt(question: str, task_family: str = "default") -> str: + """ + Build the IMPORT-with-tools prompt. The toolbox is NOT shown as text — it + is conveyed via the OpenAI tools=[...] parameter on the chat completion call. + """ + if task_family == "pbebench": + instruction = _IMPORT_WITH_TOOLS_INSTRUCTION_PBEBENCH + example = _IMPORT_WITH_TOOLS_EXAMPLE_PBEBENCH + else: + instruction = _IMPORT_WITH_TOOLS_INSTRUCTION_DEFAULT + example = _IMPORT_WITH_TOOLS_EXAMPLE_DEFAULT + return ( + instruction + + "\n\n\n" + + example + + "\n\n\n" + + _IMPORT_WITH_TOOLS_TASK_TEMPLATE.format(question=question) + ) + + +# --------------------------------------------------------------------------- +# CREATE mode +# --------------------------------------------------------------------------- + +_CREATE_INSTRUCTION_DEFAULT = ( "You task is to write Python program solutions to the given questions.\n" "You should also create Python functions that can be used by your solution, " "if you believe the function can be reused to solve other questions." - + _FORMAT_OVERRIDE ) -_CREATE_EXAMPLE = """\ +_CREATE_EXAMPLE_DEFAULT = """\ ## Example **Question** Given a list of strings and a list of (old, new) substitution pairs, apply all @@ -122,6 +237,26 @@ def apply_substitutions(strings, substitutions): return out ```""" +_CREATE_EXAMPLE_PBEBENCH = """\ +## Example +**Question** +Produce a sequence of replace() calls that transforms "hello world" into +"HELLO_WORLD". + +**Solution** +```python +result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")]) +print(result) +``` +**Tools** +```python +def find_replace_chain(s, pairs): + \"\"\"Apply a chain of (old, new) replacements to a string.\"\"\" + for old, new in pairs: + s = s.replace(old, new) + return s +```""" + _CREATE_TASK_TEMPLATE = """\ ## Task **Question** @@ -131,27 +266,28 @@ def apply_substitutions(strings, substitutions): """ -def build_create_prompt(question: str) -> str: +def build_create_prompt(question: str, task_family: str = "default") -> str: """Build the CREATE-mode prompt for a single task.""" + instruction = _CREATE_INSTRUCTION_DEFAULT + _format_override(task_family) + example = _CREATE_EXAMPLE_PBEBENCH if task_family == "pbebench" else _CREATE_EXAMPLE_DEFAULT return ( - _CREATE_INSTRUCTION + instruction + "\n\n\n" - + _CREATE_EXAMPLE + + example + "\n\n\n" + _CREATE_TASK_TEMPLATE.format(question=question) ) # --------------------------------------------------------------------------- -# SKIP mode (inline solution, no new functions) +# SKIP mode # --------------------------------------------------------------------------- -_SKIP_INSTRUCTION = ( +_SKIP_INSTRUCTION_DEFAULT = ( "You task is to write Python program solutions to the given questions." - + _FORMAT_OVERRIDE ) -_SKIP_EXAMPLE = """\ +_SKIP_EXAMPLE_DEFAULT = """\ ## Example **Question** Given the list of strings ["Hello", "World"], convert each to lowercase and @@ -167,6 +303,29 @@ def build_create_prompt(question: str) -> str: ```python ```""" +_SKIP_EXAMPLE_PBEBENCH = """\ +## Example +**Question** +Produce a sequence of replace() calls that transforms "hello world" into +"HELLO_WORLD". + +**Solution** +```python +s = "hello world" +s = s.replace(" ", "_") +s = s.replace("h", "H") +s = s.replace("e", "E") +s = s.replace("l", "L") +s = s.replace("o", "O") +s = s.replace("w", "W") +s = s.replace("r", "R") +s = s.replace("d", "D") +print(s) +``` +**Tools** +```python +```""" + _SKIP_TASK_TEMPLATE = """\ ## Task **Question** @@ -176,12 +335,14 @@ def build_create_prompt(question: str) -> str: """ -def build_skip_prompt(question: str) -> str: +def build_skip_prompt(question: str, task_family: str = "default") -> str: """Build the SKIP-mode prompt for a single task.""" + instruction = _SKIP_INSTRUCTION_DEFAULT + _format_override(task_family) + example = _SKIP_EXAMPLE_PBEBENCH if task_family == "pbebench" else _SKIP_EXAMPLE_DEFAULT return ( - _SKIP_INSTRUCTION + instruction + "\n\n\n" - + _SKIP_EXAMPLE + + example + "\n\n\n" + _SKIP_TASK_TEMPLATE.format(question=question) ) From 5cd4fd33529317d434671a131f2d1c24c2aa04e3 Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 18:11:27 -0400 Subject: [PATCH 06/24] feat(trove): add tools_api for native OpenAI tool calling - toolbox_to_openai_tools(toolbox, topk=10): converts top-k toolbox functions into OpenAI Chat Completions tool schemas. Infers parameter types from inspect.signature; functions with *args/**kwargs are silently excluded. - dispatch_tool_call(toolbox, tool_call): runs the requested function in the sandbox executor, returns stdout truncated to 4096 chars or a JSON error string. Sanitizes Harmony control-token contamination in tool names (defensive vs. open vLLM PR #35906). Made-with: Cursor --- .../baselines/trove/tests/test_tools_api.py | 163 +++++++++++++++++ symbolic_agent/baselines/trove/tools_api.py | 170 ++++++++++++++++++ 2 files changed, 333 insertions(+) create mode 100644 symbolic_agent/baselines/trove/tests/test_tools_api.py create mode 100644 symbolic_agent/baselines/trove/tools_api.py diff --git a/symbolic_agent/baselines/trove/tests/test_tools_api.py b/symbolic_agent/baselines/trove/tests/test_tools_api.py new file mode 100644 index 00000000..8fc9d671 --- /dev/null +++ b/symbolic_agent/baselines/trove/tests/test_tools_api.py @@ -0,0 +1,163 @@ +"""Unit tests for tools_api.toolbox_to_openai_tools and dispatch_tool_call.""" + +import json +from types import SimpleNamespace + +from symbolic_agent.baselines.trove.toolbox import TroVEToolbox +from symbolic_agent.baselines.trove.tools_api import ( + dispatch_tool_call, + toolbox_to_openai_tools, +) + + +def _make_toolbox_with(func_src: str, name: str, docstr: str = "") -> TroVEToolbox: + tb = TroVEToolbox() + tb.add( + { + "name": name, + "docstr": docstr, + "signature": f"def {name}(...)", + "function": func_src, + "type": "function", + }, + example_idx=0, + ) + return tb + + +def _tool_call(name: str, args: dict, call_id: str = "call_1"): + return SimpleNamespace( + id=call_id, + function=SimpleNamespace(name=name, arguments=json.dumps(args)), + ) + + +# --------------------------------------------------------------------------- +# toolbox_to_openai_tools +# --------------------------------------------------------------------------- + +def test_schema_basic_function(): + src = ( + "def find_replace_chain(s: str, pairs: list) -> str:\n" + ' """Apply a chain of (old, new) replacements to a string."""\n' + " for old, new in pairs:\n" + " s = s.replace(old, new)\n" + " return s\n" + ) + tb = _make_toolbox_with(src, "find_replace_chain", docstr="Apply a chain of (old, new) replacements to a string.") + tools = toolbox_to_openai_tools(tb, topk=10) + assert len(tools) == 1 + fn = tools[0] + assert fn["type"] == "function" + assert fn["function"]["name"] == "find_replace_chain" + assert fn["function"]["description"] == "Apply a chain of (old, new) replacements to a string." + params = fn["function"]["parameters"] + assert params["type"] == "object" + assert set(params["properties"].keys()) == {"s", "pairs"} + assert params["properties"]["s"]["type"] == "string" + assert params["properties"]["pairs"]["type"] == "array" + assert set(params["required"]) == {"s", "pairs"} + + +def test_schema_unannotated_falls_back_to_string(): + src = ( + "def f(x):\n" + " return x\n" + ) + tb = _make_toolbox_with(src, "f") + tools = toolbox_to_openai_tools(tb, topk=10) + assert tools[0]["function"]["parameters"]["properties"]["x"]["type"] == "string" + + +def test_schema_skips_varargs_kwargs(): + src = ( + "def f(*args, **kwargs):\n" + " return args\n" + ) + tb = _make_toolbox_with(src, "f") + tools = toolbox_to_openai_tools(tb, topk=10) + assert tools == [] + + +def test_schema_required_excludes_defaults(): + src = ( + "def f(x: int, y: int = 5):\n" + " return x + y\n" + ) + tb = _make_toolbox_with(src, "f") + tools = toolbox_to_openai_tools(tb, topk=10) + params = tools[0]["function"]["parameters"] + assert params["required"] == ["x"] + assert params["properties"]["y"]["type"] == "integer" + + +def test_schema_topk_respects_frequency(): + tb = TroVEToolbox() + for n, freq in [("a", 3), ("b", 2), ("c", 1)]: + tb.add( + { + "name": n, + "docstr": "", + "signature": f"def {n}()", + "function": f"def {n}():\n return 0\n", + "type": "function", + }, + example_idx=0, + ) + for _ in range(freq - 1): + tb.update_frequency(n, example_idx=0) + tools = toolbox_to_openai_tools(tb, topk=2) + assert [t["function"]["name"] for t in tools] == ["a", "b"] + + +def test_schema_empty_toolbox(): + assert toolbox_to_openai_tools(TroVEToolbox(), topk=10) == [] + + +# --------------------------------------------------------------------------- +# dispatch_tool_call +# --------------------------------------------------------------------------- + +def test_dispatch_runs_function_and_returns_stdout(): + src = ( + "def reverse_str(s):\n" + " return s[::-1]\n" + ) + tb = _make_toolbox_with(src, "reverse_str") + result = dispatch_tool_call(tb, _tool_call("reverse_str", {"s": "hello"})) + assert "olleh" in result + + +def test_dispatch_unknown_tool_returns_error(): + tb = TroVEToolbox() + result = dispatch_tool_call(tb, _tool_call("nonexistent", {})) + assert "not in toolbox" in result + + +def test_dispatch_bad_json_returns_error(): + src = "def f(x):\n return x\n" + tb = _make_toolbox_with(src, "f") + bad = SimpleNamespace( + id="x", + function=SimpleNamespace(name="f", arguments="{not json"), + ) + result = dispatch_tool_call(tb, bad) + assert "argument JSON parse failed" in result + + +def test_dispatch_sanitizes_harmony_contamination(): + src = "def reverse_str(s):\n return s[::-1]\n" + tb = _make_toolbox_with(src, "reverse_str") + tc = _tool_call("reverse_str<|channel|>commentary", {"s": "abc"}) + result = dispatch_tool_call(tb, tc) + assert "cba" in result + + +def test_dispatch_truncates_long_output(): + src = ( + "def long_output(n):\n" + " return 'x' * n\n" + ) + tb = _make_toolbox_with(src, "long_output") + result = dispatch_tool_call(tb, _tool_call("long_output", {"n": 10000})) + assert len(result) <= 4096 + 100 # +slack for repr quotes and truncation marker diff --git a/symbolic_agent/baselines/trove/tools_api.py b/symbolic_agent/baselines/trove/tools_api.py new file mode 100644 index 00000000..fc093d5f --- /dev/null +++ b/symbolic_agent/baselines/trove/tools_api.py @@ -0,0 +1,170 @@ +"""Translate the TroVE toolbox into OpenAI Chat Completions tool schemas +and dispatch tool calls back through the executor. + +This module is the bridge between TroVE's in-memory toolbox and vLLM's +native tool-calling protocol. It is invoked only from the IMPORT-with-tools +controller branch. +""" + +from __future__ import annotations + +import inspect +import json +import logging +from typing import Any + +from .executor import run_solution +from .toolbox import TroVEToolbox + +logger = logging.getLogger(__name__) + +_MAX_RESULT_CHARS = 4096 + +# Type inference: Python annotation -> JSON Schema type. +_TYPE_MAP = { + int: "integer", + float: "number", + bool: "boolean", + str: "string", + list: "array", + tuple: "array", + dict: "object", +} + + +def _infer_type(annotation: Any) -> str: + if annotation is inspect.Parameter.empty: + return "string" + # Plain types (int, str, etc.) + if annotation in _TYPE_MAP: + return _TYPE_MAP[annotation] + # typing.List, typing.Dict, etc. — fall through to string if unrecognised. + origin = getattr(annotation, "__origin__", None) + if origin in _TYPE_MAP: + return _TYPE_MAP[origin] + return "string" + + +def _function_to_schema(name: str, fn: Any, docstr: str) -> dict | None: + """ + Build one OpenAI tool dict from a callable. Returns None if the function + has *args or **kwargs (we cannot generate a meaningful schema). + """ + try: + sig = inspect.signature(fn) + except (TypeError, ValueError) as exc: + logger.debug("Could not introspect %s: %s", name, exc) + return None + + properties: dict = {} + required: list = [] + + for pname, param in sig.parameters.items(): + if param.kind in ( + inspect.Parameter.VAR_POSITIONAL, + inspect.Parameter.VAR_KEYWORD, + ): + logger.debug("Skipping %s — has *args/**kwargs", name) + return None + prop: dict = {"type": _infer_type(param.annotation)} + if param.default is not inspect.Parameter.empty: + if isinstance(param.default, (int, float, bool, str)): + prop["default"] = param.default + else: + required.append(pname) + properties[pname] = prop + + return { + "type": "function", + "function": { + "name": name, + "description": docstr or "", + "parameters": { + "type": "object", + "properties": properties, + "required": required, + }, + }, + } + + +def toolbox_to_openai_tools(toolbox: TroVEToolbox, topk: int = 10) -> list: + """ + Convert the top-k toolbox functions (by frequency) into OpenAI Chat + Completions tool dicts. + + Functions with *args / **kwargs are silently excluded. + Returns [] when the toolbox is empty. + """ + entries = toolbox.snapshot() + if not entries: + return [] + entries.sort(key=lambda e: -int(e.get("frequency", 0))) + selected = entries[:topk] + + namespace: dict = {} + try: + # compile(..., dont_inherit=True) so this module's `from __future__ import + # annotations` is not applied to the toolbox source; we need real types in + # `__annotations__` for inspect.signature() / _infer_type. + _code = compile( + toolbox.get_full_code(), "", "exec", dont_inherit=True + ) + exec(_code, namespace) + except Exception as exc: + logger.warning("Could not exec toolbox source for schema generation: %s", exc) + return [] + + tools: list = [] + for entry in selected: + name = entry.get("name", "") + if not name or name not in namespace: + continue + fn = namespace[name] + schema = _function_to_schema(name, fn, entry.get("docstr", "")) + if schema is not None: + tools.append(schema) + return tools + + +def _sanitize_name(name: str) -> str: + """Defensive workaround for vLLM PR #35906 (Harmony control tokens + leaking into tool names like `reverse_str<|channel|>commentary`).""" + return name.split("<|", 1)[0].strip() + + +def _truncate(s: str, limit: int = _MAX_RESULT_CHARS) -> str: + if len(s) <= limit: + return s + return s[:limit] + f"\n... [truncated {len(s) - limit} chars]" + + +def dispatch_tool_call(toolbox: TroVEToolbox, tool_call) -> str: + """ + Resolve `tool_call` against the toolbox, run it via the sandbox executor, + and return the captured stdout (truncated to 4096 chars) or an error + message string. Always returns a string — never raises. + """ + name = _sanitize_name(getattr(tool_call.function, "name", "") or "") + if not name: + return json.dumps({"error": "tool_call has no function name"}) + if name not in {e["name"] for e in toolbox.snapshot()}: + return json.dumps({"error": f"tool '{name}' not in toolbox"}) + + raw_args = getattr(tool_call.function, "arguments", "") or "{}" + try: + args = json.loads(raw_args) + if not isinstance(args, dict): + return json.dumps({"error": f"argument JSON parse failed: expected object, got {type(args).__name__}"}) + except json.JSONDecodeError as exc: + return json.dumps({"error": f"argument JSON parse failed: {exc}"}) + + call_expr = f"print(repr({name}(**{args!r})))" + is_ok, output = run_solution( + solution_code=call_expr, + tools_code="", + toolbox_code=toolbox.get_full_code(), + ) + if not is_ok: + return json.dumps({"error": "execution failed", "stderr": _truncate(output)}) + return _truncate(output) From 06116b12e33c3ec80f5fa2268985c0b98793705f Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 18:15:07 -0400 Subject: [PATCH 07/24] fix(trove): correct misleading 'stderr' key in tools_api error payload executor.run_solution returns proc.stdout.strip(), not stderr. Rename the JSON error key from 'stderr' to 'stdout' so the field name matches what is actually being returned. Caught in code-quality review for Task 4. Made-with: Cursor --- symbolic_agent/baselines/trove/tools_api.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/symbolic_agent/baselines/trove/tools_api.py b/symbolic_agent/baselines/trove/tools_api.py index fc093d5f..c0edc151 100644 --- a/symbolic_agent/baselines/trove/tools_api.py +++ b/symbolic_agent/baselines/trove/tools_api.py @@ -166,5 +166,5 @@ def dispatch_tool_call(toolbox: TroVEToolbox, tool_call) -> str: toolbox_code=toolbox.get_full_code(), ) if not is_ok: - return json.dumps({"error": "execution failed", "stderr": _truncate(output)}) + return json.dumps({"error": "execution failed", "stdout": _truncate(output)}) return _truncate(output) From 6ee331cb02e90d4ef7dc399b8bc1095a522401d5 Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 18:17:02 -0400 Subject: [PATCH 08/24] feat(trove): add TroVELLMClient.chat_with_tools for native tool calls Multi-turn loop that handles tool_calls returned by gpt-oss/vLLM: appends assistant message + tool result messages until the model returns no tool_calls or max_tool_iters is reached. Records each call as {name, args_preview, result_preview, ok} for downstream telemetry. Reuses the existing 3-attempt retry, debug logging, and token accounting. Anthropic backend raises NotImplementedError as a defensive guard; controllers branch on self.backend == "openai" before calling. Made-with: Cursor --- symbolic_agent/baselines/trove/llm.py | 170 +++++++++++++++++++++++++- 1 file changed, 169 insertions(+), 1 deletion(-) diff --git a/symbolic_agent/baselines/trove/llm.py b/symbolic_agent/baselines/trove/llm.py index ec98472f..dda158eb 100644 --- a/symbolic_agent/baselines/trove/llm.py +++ b/symbolic_agent/baselines/trove/llm.py @@ -16,7 +16,7 @@ import os import time from datetime import datetime, timezone -from typing import Dict, List, Optional +from typing import Any, Callable, Dict, List, Optional logger = logging.getLogger(__name__) @@ -219,6 +219,174 @@ def _call_openai(self, prompt: str, model: str, max_tokens: int, tag: str) -> st logger.warning("All OpenAI retries exhausted (tag=%s): %s", tag, last_exc) return "" + # ------------------------------------------------------------------ + # Native tool calling (OpenAI/vLLM only) + # ------------------------------------------------------------------ + + def chat_with_tools( + self, + messages: List[Dict[str, Any]], + tools: List[Dict[str, Any]], + model: str, + max_tokens: int = DEFAULT_MAX_TOKENS, + max_tool_iters: int = 8, + on_tool_call: Optional[Callable[[Any], str]] = None, + tag: str = "", + ) -> Dict[str, Any]: + """ + Multi-turn chat completion that supports native OpenAI tool calls. + + Returns + ------- + { + "final_text": str, # message.content (or reasoning_content fallback) + "tool_calls": list[dict], # ordered, each {name, args_preview, result_preview, ok} + "iterations": int, # number of round-trips actually used + "stopped_reason": str, # "no_tool_calls" | "max_iters" | "error" + } + + The caller is responsible for providing `on_tool_call(tc) -> str`, + which is invoked for every tool_call returned by the model. The + return value (already a string) is sent back as the tool message. + + Anthropic backend is not supported — this method exists for the + OpenAI/vLLM tool-calling flow only. It raises NotImplementedError + on Anthropic as a defensive guard; controllers must check + `self.backend == "openai"` before calling. + """ + if self.backend != "openai": + raise NotImplementedError("chat_with_tools requires the openai backend") + + if on_tool_call is None: + raise ValueError("chat_with_tools requires an on_tool_call callback") + + recorded_calls: List[Dict[str, Any]] = [] + convo: List[Dict[str, Any]] = list(messages) + iterations = 0 + final_text = "" + stopped_reason = "no_tool_calls" + + for it in range(max_tool_iters + 1): + iterations = it + 1 + iter_tag = f"{tag}_iter{it}" if tag else f"iter{it}" + response = None + last_exc = None + + for attempt in range(3): + try: + response = self._client.chat.completions.create( + model=model, + max_tokens=max_tokens, + messages=convo, + tools=tools, + tool_choice="auto", + ) + break + except Exception as exc: + last_exc = exc + if getattr(exc, "status_code", None) == 400: + logger.warning( + "OpenAI chat_with_tools 400 (tag=%s): %s", iter_tag, exc + ) + self._record(iter_tag, model, json.dumps(convo)[:2000], "", max_tokens, {}) + return { + "final_text": "", + "tool_calls": recorded_calls, + "iterations": iterations, + "stopped_reason": "error", + } + if attempt < 2: + wait = 5 * (2 ** attempt) + logger.warning( + "chat_with_tools failed (attempt %d/3, tag=%s): %s. Retrying in %ds.", + attempt + 1, iter_tag, exc, wait, + ) + time.sleep(wait) + + if response is None: + logger.warning("All chat_with_tools retries exhausted (tag=%s): %s", iter_tag, last_exc) + stopped_reason = "error" + break + + msg = response.choices[0].message + content = msg.content or getattr(msg, "reasoning_content", "") or "" + tool_calls = getattr(msg, "tool_calls", None) or [] + + u = getattr(response, "usage", None) + details = getattr(u, "completion_tokens_details", None) + usage = { + "input_tokens": getattr(u, "prompt_tokens", 0) or 0, + "output_tokens": getattr(u, "completion_tokens", 0) or 0, + "reasoning_tokens": getattr(details, "reasoning_tokens", 0) or 0 if details else 0, + } + self._record( + iter_tag, + model, + json.dumps(convo)[:2000], + json.dumps({"content": content, "tool_calls_count": len(tool_calls)}), + max_tokens, + usage, + ) + + if not tool_calls: + final_text = content + stopped_reason = "no_tool_calls" + break + + assistant_msg: Dict[str, Any] = { + "role": "assistant", + "content": content, + "tool_calls": [ + { + "id": tc.id, + "type": "function", + "function": { + "name": tc.function.name, + "arguments": tc.function.arguments, + }, + } + for tc in tool_calls + ], + } + convo.append(assistant_msg) + + for tc in tool_calls: + try: + result = on_tool_call(tc) + ok = True + except Exception as exc: + result = json.dumps({"error": f"on_tool_call raised: {exc}"}) + ok = False + args_preview = (tc.function.arguments or "")[:200] + result_preview = (result or "")[:200] + recorded_calls.append( + { + "name": tc.function.name, + "args_preview": args_preview, + "result_preview": result_preview, + "ok": ok, + } + ) + convo.append( + { + "role": "tool", + "tool_call_id": tc.id, + "content": result, + } + ) + + if it >= max_tool_iters - 1: + stopped_reason = "max_iters" + final_text = content + break + + return { + "final_text": final_text, + "tool_calls": recorded_calls, + "iterations": iterations, + "stopped_reason": stopped_reason, + } + # ------------------------------------------------------------------ # Logging # ------------------------------------------------------------------ From ace60481438d65937314b032b39dbc28c7262ec4 Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 18:22:49 -0400 Subject: [PATCH 09/24] feat(trove): controller branch for native IMPORT tool calling - Add task_family and selection params to TroVEController.__init__. - IMPORT branch dispatches to _generate_import_with_tools when toolbox is non-empty and backend is openai; otherwise falls back to legacy text-based IMPORT. - _generate_import_with_tools builds K multi-turn trajectories via TroVELLMClient.chat_with_tools, parses **Solution** strictly for pbebench, and runs the result through the executor. - _update_library credits frequency by unique tool_call.function.name for the native path; legacy path still credits parsed functions. - _make_result emits won_mode, import_eligible, import_was_winner, tool_calls, tool_call_count, tools_called, actually_called, trove_stopped_reason as passive telemetry. - _select_best honors selection="consistency" or "reward" (default). Made-with: Cursor --- symbolic_agent/baselines/trove/controller.py | 192 +++++++++++++++---- 1 file changed, 156 insertions(+), 36 deletions(-) diff --git a/symbolic_agent/baselines/trove/controller.py b/symbolic_agent/baselines/trove/controller.py index d11d8b23..173c3837 100644 --- a/symbolic_agent/baselines/trove/controller.py +++ b/symbolic_agent/baselines/trove/controller.py @@ -37,10 +37,17 @@ from collections import Counter from typing import Callable, Dict, List, Optional +from . import tools_api from .executor import run_solution from .llm import TroVELLMClient -from .parse import count_ast_nodes, parse_response -from .prompts import build_create_prompt, build_import_prompt, build_skip_prompt, get_question +from .parse import count_ast_nodes, imported_callsites, parse_response +from .prompts import ( + build_create_prompt, + build_import_prompt, + build_import_with_tools_prompt, + build_skip_prompt, + get_question, +) from .toolbox import TroVEToolbox logger = logging.getLogger(__name__) @@ -83,18 +90,26 @@ def __init__( debug_dir: Optional[str] = None, k: int = DEFAULT_K, trim_every: int = DEFAULT_TRIM_EVERY, - trim_C: float = 0.5, + trim_C: float = 1.0, temperature: float = 0.3, top_p: float = 0.95, + task_family: str = "default", + selection: str = "reward", + max_tool_iters: int = 8, + tool_schema_topk: int = 10, ): self.model = model self.k = k self.trim_every = trim_every self.trim_C = trim_C + self.task_family = task_family + self.selection = selection + self.max_tool_iters = max_tool_iters + self.tool_schema_topk = tool_schema_topk - backend = "openai" if base_url else "anthropic" + self.backend = "openai" if base_url else "anthropic" self.llm = TroVELLMClient( - backend=backend, + backend=self.backend, base_url=base_url, api_key=api_key, temperature=temperature, @@ -252,33 +267,52 @@ def _multi_way_generation( toolbox_str = self.toolbox.format_toolbox() # --- IMPORT mode --- - import_candidates = [] - if toolbox_str: + toolbox_nonempty = bool(toolbox_str) + use_tools_branch = toolbox_nonempty and self.backend == "openai" + + if use_tools_branch: + import_candidates = self._generate_import_with_tools( + question, example_idx, reward_fn=reward_fn, entry=entry + ) + best_import_idx, best_import_score = self._select_best( + import_candidates, reward_fn=reward_fn, entry=entry + ) + best_import = import_candidates[best_import_idx] + best_import["_reward_score"] = best_import_score + elif toolbox_nonempty: + # Legacy text-based IMPORT (Anthropic or unforeseen non-OpenAI path). + import_candidates = [] for _ in range(self.k): - prompt = build_import_prompt(question, toolbox_str) + prompt = build_import_prompt(question, toolbox_str, task_family=self.task_family) raw = self.llm.call(prompt, self.model, max_tokens=DEFAULT_MAX_TOKENS, tag="trove_import") - parsed = parse_response(raw) + parsed = parse_response(raw, task_family=self.task_family) is_ok, out = run_solution( parsed["solution_code"], parsed["tools_code"], self.toolbox.get_full_code(), ) - import_candidates.append({**parsed, "is_success": is_ok, "exec_output": out}) + import_candidates.append( + {**parsed, "is_success": is_ok, "exec_output": out, "tool_calls": [], "stopped_reason": "legacy"} + ) best_import_idx, best_import_score = self._select_best( import_candidates, reward_fn=reward_fn, entry=entry ) best_import = import_candidates[best_import_idx] best_import["_reward_score"] = best_import_score else: - best_import = {"solution_code": "", "tools_code": "", "functions": [], - "is_success": False, "exec_output": "", "_reward_score": None} + best_import = { + "solution_code": "", "tools_code": "", "functions": [], + "is_success": False, "exec_output": "", + "tool_calls": [], "stopped_reason": "empty_toolbox", + "_reward_score": None, + } # --- CREATE mode --- create_candidates = [] for _ in range(self.k): - prompt = build_create_prompt(question) + prompt = build_create_prompt(question, task_family=self.task_family) raw = self.llm.call(prompt, self.model, max_tokens=DEFAULT_MAX_TOKENS, tag="trove_create") - parsed = parse_response(raw) + parsed = parse_response(raw, task_family=self.task_family) is_ok, out = run_solution( parsed["solution_code"], parsed["tools_code"], @@ -294,9 +328,9 @@ def _multi_way_generation( # --- SKIP mode --- skip_candidates = [] for _ in range(self.k): - prompt = build_skip_prompt(question) + prompt = build_skip_prompt(question, task_family=self.task_family) raw = self.llm.call(prompt, self.model, max_tokens=DEFAULT_MAX_TOKENS, tag="trove_skip") - parsed = parse_response(raw) + parsed = parse_response(raw, task_family=self.task_family) is_ok, out = run_solution( parsed["solution_code"], parsed["tools_code"], @@ -334,6 +368,54 @@ def _multi_way_generation( ) return winning_mode, best_resp, best_score + def _generate_import_with_tools( + self, + question: str, + example_idx: int, + reward_fn: Optional[Callable] = None, + entry: Optional[dict] = None, + ) -> List[dict]: + """ + IMPORT-mode generation using native OpenAI tool calling. + Builds K trajectories; each trajectory may invoke toolbox functions + via tool_calls during the multi-turn loop. Returns K candidate dicts + compatible with _select_best. + """ + prompt = build_import_with_tools_prompt(question, task_family=self.task_family) + tools_schema = tools_api.toolbox_to_openai_tools(self.toolbox, topk=self.tool_schema_topk) + + candidates: List[dict] = [] + for i in range(self.k): + tag = f"trove_import_t{example_idx}_{i}" + messages = [{"role": "user", "content": prompt}] + on_tc = lambda tc: tools_api.dispatch_tool_call(self.toolbox, tc) + traj = self.llm.chat_with_tools( + messages=messages, + tools=tools_schema, + model=self.model, + max_tokens=DEFAULT_MAX_TOKENS, + max_tool_iters=self.max_tool_iters, + on_tool_call=on_tc, + tag=tag, + ) + parsed = parse_response(traj["final_text"], task_family=self.task_family) + is_ok, out = run_solution( + parsed["solution_code"], + parsed["tools_code"], + self.toolbox.get_full_code(), + ) + candidates.append( + { + **parsed, + "is_success": is_ok, + "exec_output": out, + "tool_calls": traj["tool_calls"], + "stopped_reason": traj["stopped_reason"], + "iterations": traj["iterations"], + } + ) + return candidates + def _select_best( self, candidates: List[dict], @@ -344,18 +426,15 @@ def _select_best( Select the best candidate from a list of response dicts. Returns (best_index, score_or_None) where score is (reward, message) - when reward-based selection is used, or None for majority-vote mode. - - Two selection strategies: - 1. Reward-based (when reward_fn + entry provided): - Score all K candidates with reward_fn; pick highest reward, - tiebreak by minimum AST node count (simplest solution). - This is reliable for PBEBench (program lists rarely match exactly - as strings) and equally good for reasoning_gym. - 2. Majority-vote fallback (original TroVE algorithm): - Filter successes → majority vote on stdout → min AST tiebreak. - Used when no reward function is available (e.g. bare solve()). + when reward-based selection is used, or None otherwise. + + Selection strategy is governed by self.selection: + - "reward" (default): reward-based when reward_fn+entry provided, + falls back to consistency when not. + - "consistency": original TroVE majority-vote algorithm. """ + if self.selection == "consistency": + return self._select_best_by_consistency(candidates), None if reward_fn is not None and entry is not None: return self._select_best_by_reward(candidates, reward_fn, entry) return self._select_best_by_consistency(candidates), None @@ -419,13 +498,25 @@ def _select_best_by_consistency(self, candidates: List[dict]) -> int: def _update_library(self, mode: str, resp: dict, example_idx: int) -> None: """Update toolbox based on winning mode (faithful to run_trove.py).""" if mode == "import": - # IMPORT: credit existing functions that were used - for func_dict in resp.get("functions", []): - name = func_dict.get("name", "") - if name: - self.toolbox.update_frequency(name, example_idx) + tool_calls = resp.get("tool_calls") or [] + if tool_calls: + # Native tool-calling path: credit by unique tool_call.function.name + # (defensive: sanitize and let toolbox.update_frequency filter unknowns). + unique_names = { + tc["name"].split("<|", 1)[0].strip() + for tc in tool_calls + if tc.get("name") + } + for name in unique_names: + if name: + self.toolbox.update_frequency(name, example_idx) + else: + # Legacy text-based IMPORT: credit functions parsed from **Tools**. + for func_dict in resp.get("functions", []): + name = func_dict.get("name", "") + if name: + self.toolbox.update_frequency(name, example_idx) elif mode == "create" and resp.get("is_success"): - # CREATE: add new functions only when execution succeeded for func_dict in resp.get("functions", []): self.toolbox.add(func_dict, example_idx) @@ -447,8 +538,29 @@ def _make_result( ) -> dict: """ Build a result dict compatible with main.py's _print_result() and - _append_task_output(). + _append_task_output(). Adds passive TroVE telemetry fields. """ + tool_calls = best_resp.get("tool_calls") or [] + tools_called = sorted({ + tc["name"].split("<|", 1)[0].strip() + for tc in tool_calls + if tc.get("name") + }) + candidate_names = {e["name"] for e in self.toolbox.snapshot()} + actually_called = sorted( + imported_callsites( + solution_code=best_resp.get("solution_code", ""), + tools_code=best_resp.get("tools_code", ""), + candidate_names=candidate_names, + ) + ) + import_eligible = len(self.toolbox) > 0 # state AFTER this task's update + # Note: import_eligible reflects the current toolbox state after + # _update_library has already run for this task. The analyzer should + # interpret this as "a non-empty toolbox existed at some point during + # this task's processing". For pre-task eligibility, infer from + # toolbox snapshots in adjacent tasks. + return { "task_type": task_type, "original_prompt": str(task_input), @@ -464,7 +576,7 @@ def _make_result( ], "solution": best_resp.get("solution_code", ""), "library_snapshot": self.toolbox.snapshot(), - "cost_summary": {}, # TroVE has no cost model + "cost_summary": {}, "final_output": { "answer": output, "explanation": f"TroVE mode={best_mode}", @@ -475,6 +587,14 @@ def _make_result( "reward_history": [], "best_reward": None, "final_reward": None, - # Cached score from reward-based selection; consumed and removed by solve_with_reward. "_best_reward_score": best_reward_score, + # TroVE native-tool-calling telemetry + "won_mode": best_mode, + "import_eligible": import_eligible, + "import_was_winner": best_mode == "import", + "tool_calls": tool_calls, + "tool_call_count": len(tool_calls), + "tools_called": tools_called, + "actually_called": actually_called, + "trove_stopped_reason": best_resp.get("stopped_reason", ""), } From 5f1ff88b2ebbf022c8569653a69b49a93ccbb374 Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 18:34:32 -0400 Subject: [PATCH 10/24] docs(trove): align TroVEController class docstring with new params Update the class-level Parameters block to: - Reflect trim_C default of 1.0 (matches __init__). - Document task_family, selection, max_tool_iters, tool_schema_topk. - Note that base_url governs which backend is used and that native tool-calling IMPORT requires the openai backend. Made-with: Cursor --- symbolic_agent/baselines/trove/controller.py | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/symbolic_agent/baselines/trove/controller.py b/symbolic_agent/baselines/trove/controller.py index 173c3837..149f2a28 100644 --- a/symbolic_agent/baselines/trove/controller.py +++ b/symbolic_agent/baselines/trove/controller.py @@ -68,18 +68,34 @@ class TroVEController: model : str LLM model identifier. base_url : str, optional - For OpenAI-compatible (vLLM) backends. + For OpenAI-compatible (vLLM) backends. When set, ``self.backend`` is + ``"openai"``; otherwise ``"anthropic"``. Native tool-calling IMPORT + requires the openai backend. debug_dir : str, optional k : int Number of samples per mode (paper default: 5). trim_every : int Trim toolbox every N tasks (paper default: 500). trim_C : float - Trimming threshold multiplier: threshold = C·log₂₀(n). Default: 0.5. + Trimming threshold multiplier: threshold = C·log₂₀(n). Default: 1.0 + (matches the original TroVE implementation). temperature : float Sampling temperature. Default: 0.3 (TroVE paper). top_p : float Nucleus sampling top-p. Default: 0.95 (TroVE paper). + task_family : str + Prompt/parsing family. ``"default"`` (generic) or ``"pbebench"`` + (PBEBench-shaped few-shots; strict ``**Solution**`` parsing). + selection : str + Candidate selection strategy. ``"reward"`` (default) uses the + reward function when available and falls back to consistency; + ``"consistency"`` always uses the original TroVE majority-vote. + max_tool_iters : int + Maximum tool-call rounds per IMPORT trajectory in the native + tool-calling path. Default: 8. + tool_schema_topk : int + Number of top-frequency toolbox functions exposed as OpenAI tool + schemas in the native IMPORT path. Default: 10. """ def __init__( From d8a76a4000f35b30afe47e6d1a6c65e8340e289f Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 18:35:39 -0400 Subject: [PATCH 11/24] feat(trove): CLI flags --trove-selection and --trove-task-family - --trove-selection {reward,consistency} (default: reward). - --trove-task-family {default,pbebench} (default: default). Plumbed through to TroVEController; PBEBench runs should pass --trove-task-family pbebench to enable PBEBench-shaped few-shots and strict **Solution** parsing. Made-with: Cursor --- main.py | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/main.py b/main.py index f04aff88..3bfbc6b3 100644 --- a/main.py +++ b/main.py @@ -808,6 +808,23 @@ def main() -> None: help="[TroVE] Trim low-frequency toolbox functions every N tasks. " "Paper default: 500. Set to 9999 to disable for small datasets. (default: 500)", ) + parser.add_argument( + "--trove-selection", + choices=["reward", "consistency"], + default="reward", + help="[TroVE] Candidate selection strategy. 'reward' (default) uses " + "the per-task reward function with AST tie-breaking. " + "'consistency' uses the original TroVE majority-vote algorithm. " + "(default: reward)", + ) + parser.add_argument( + "--trove-task-family", + choices=["default", "pbebench"], + default="default", + help="[TroVE] Task family for prompt selection and parser strictness. " + "'pbebench' uses PBEBench-shaped few-shots and strict **Solution** " + "parsing (no fallback to any python block). (default: default)", + ) # ReGAL-specific flags parser.add_argument( "--regal-train-file", @@ -1007,8 +1024,13 @@ def main() -> None: debug_dir=args.debug_dir, k=args.trove_k, trim_every=args.trove_trim_every, + task_family=args.trove_task_family, + selection=args.trove_selection, + ) + logger.info( + "Framework: TroVE (k=%d, trim_every=%d, task_family=%s, selection=%s)", + args.trove_k, args.trove_trim_every, args.trove_task_family, args.trove_selection, ) - logger.info("Framework: TroVE (k=%d, trim_every=%d)", args.trove_k, args.trove_trim_every) elif args.framework == "regal": from pathlib import Path as _Path controller = ReGALController( From a19309b93c285283b039d45322f31e4df420daa5 Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 18:36:31 -0400 Subject: [PATCH 12/24] chore(launcher): enable native tool calling for gpt-oss-120b vLLM server Add three flags required for OpenAI-compatible tool calling on gpt-oss served by vLLM >= v0.16.0: --enable-auto-tool-choice --tool-call-parser openai --reasoning-parser openai_gptoss Without these the controller's chat_with_tools loop sees no tool_calls in the response and degrades to no-tool behavior. Made-with: Cursor --- scripts/launch_vllm_gpt_oss_120b.sh | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/scripts/launch_vllm_gpt_oss_120b.sh b/scripts/launch_vllm_gpt_oss_120b.sh index 74b10dac..5ae5216c 100644 --- a/scripts/launch_vllm_gpt_oss_120b.sh +++ b/scripts/launch_vllm_gpt_oss_120b.sh @@ -7,6 +7,11 @@ export TMPDIR=/tmp/$USER-tmp ts=$(date +%Y%m%d_%H%M%S) +# Required vLLM tool-calling flags (vLLM >= v0.16.0 for PR #28729): +# --enable-auto-tool-choice enables tool_choice="auto" +# --tool-call-parser openai parses gpt-oss Harmony commentary channel +# --reasoning-parser openai_gptoss routes analysis-channel content into +# message.reasoning_content nohup python -m vllm.entrypoints.openai.api_server \ --model "openai/gpt-oss-120b" \ --tokenizer "openai/gpt-oss-120b" \ @@ -14,4 +19,7 @@ nohup python -m vllm.entrypoints.openai.api_server \ --port ${1} \ --gpu-memory-utilization 0.95 \ --tensor-parallel-size 2 \ - > vllm_logs/vllm_${1}_${ts}.log 2>&1 & echo $! > vllm_logs/vllm_${1}_${ts}.pid \ No newline at end of file + --enable-auto-tool-choice \ + --tool-call-parser openai \ + --reasoning-parser openai_gptoss \ + > vllm_logs/vllm_${1}_${ts}.log 2>&1 & echo $! > vllm_logs/vllm_${1}_${ts}.pid From 8c32e0c4bbbbc2cb1919785dde48863832d0ef69 Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 18:37:12 -0400 Subject: [PATCH 13/24] feat(trove): add analyze_trove_run.py for post-hoc telemetry reports Reads a TroVE JSONL output and reports overall accuracy, final toolbox size, per-mode wins, IMPORT-mode tool-use breakdown (>=1 call rate, mean calls/task, success rate), and the top-10 most-called toolbox functions. Sanitizes Harmony control-token contamination in tool names when aggregating. Made-with: Cursor --- scripts/analyze_trove_run.py | 103 +++++++++++++++++++++++++++++++++++ 1 file changed, 103 insertions(+) create mode 100755 scripts/analyze_trove_run.py diff --git a/scripts/analyze_trove_run.py b/scripts/analyze_trove_run.py new file mode 100755 index 00000000..0fe2758e --- /dev/null +++ b/scripts/analyze_trove_run.py @@ -0,0 +1,103 @@ +#!/usr/bin/env python3 +"""Post-hoc analysis of a TroVE run JSONL output. + +Reads the per-task JSONL file produced by main.py --output-file and reports: + - Overall accuracy + - Final toolbox size + - Per-mode wins + - IMPORT-mode tool-use breakdown + - Top-10 most-called toolbox functions + +Usage: + python scripts/analyze_trove_run.py path/to/results.jsonl +""" + +from __future__ import annotations + +import argparse +import json +import sys +from collections import Counter +from pathlib import Path + + +def _load_rows(path: Path) -> list[dict]: + rows = [] + with path.open() as f: + for lineno, line in enumerate(f, 1): + line = line.strip() + if not line: + continue + try: + rows.append(json.loads(line)) + except json.JSONDecodeError as exc: + print(f"warning: line {lineno} is not valid JSON: {exc}", file=sys.stderr) + return rows + + +def _result_dict(row: dict) -> dict: + """Tolerant accessor: results are nested under 'result' in main.py's output.""" + return row.get("result") or row + + +def main() -> None: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("path", type=Path, help="Path to the TroVE results JSONL file") + args = parser.parse_args() + + rows = _load_rows(args.path) + if not rows: + print("ERROR: no rows loaded", file=sys.stderr) + sys.exit(1) + + n = len(rows) + results = [_result_dict(r) for r in rows] + + solved = sum(1 for r in results if r.get("solved")) + print(f"=== Run summary: {args.path.name} ===") + print(f"Tasks: {n}") + print(f"Solved: {solved}/{n} ({100 * solved / n:.1f}%)") + + last_snapshot = results[-1].get("library_snapshot") or [] + print(f"Final toolbox size: {len(last_snapshot)}") + + mode_counter = Counter(r.get("won_mode", "?") for r in results) + print(f"Mode wins: {dict(mode_counter)}") + + import_eligible = [r for r in results if r.get("import_eligible")] + if not import_eligible: + print("No IMPORT-eligible tasks observed.") + else: + with_calls = [r for r in import_eligible if (r.get("tool_call_count") or 0) >= 1] + n_eligible = len(import_eligible) + n_with = len(with_calls) + mean_calls = ( + sum((r.get("tool_call_count") or 0) for r in import_eligible) / n_eligible + ) + all_calls = [tc for r in import_eligible for tc in (r.get("tool_calls") or [])] + n_calls_total = len(all_calls) + n_calls_ok = sum(1 for tc in all_calls if tc.get("ok")) + success_rate = (100 * n_calls_ok / n_calls_total) if n_calls_total else 0.0 + print( + f"IMPORT-eligible tasks: {n_eligible}\n" + f" Tasks with >=1 tool call: {n_with}/{n_eligible} ({100 * n_with / n_eligible:.1f}%)\n" + f" Mean tool calls / task: {mean_calls:.2f}\n" + f" Tool-call success rate: {n_calls_ok}/{n_calls_total} ({success_rate:.1f}%)" + ) + + name_counter: Counter = Counter() + for r in results: + for tc in r.get("tool_calls") or []: + name = (tc.get("name") or "").split("<|", 1)[0].strip() + if name: + name_counter[name] += 1 + if name_counter: + print("Top-10 most-called toolbox functions:") + for name, cnt in name_counter.most_common(10): + print(f" {cnt:4d} {name}") + else: + print("No tool calls recorded in this run.") + + +if __name__ == "__main__": + main() From ff6a6d89cca05c79ea5825b1c3985e5572c20b18 Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 18:38:22 -0400 Subject: [PATCH 14/24] docs(trove): rewrite deviations.md for native tool calling Document algorithmic deviations (native OpenAI tool calling for IMPORT, reward-based selection by default, PBEBench-shaped few-shots, strict **Solution** parsing for pbebench), faithful elements (3-mode generation, K-sampling, AST tie-break, C*log_20(n) trimming with C=1.0), and infrastructural patches (JSONL checkpointing, reasoning_content fallback, 60s executor timeout, defensive <|-truncation sanitizer). Includes vLLM version requirement (>= v0.16.0 for PR #28729) and the backend coverage caveat (smoke run is vLLM-served gpt-oss only). Made-with: Cursor --- .../baselines/trove/docs/deviations.md | 205 +++++++----------- 1 file changed, 83 insertions(+), 122 deletions(-) diff --git a/symbolic_agent/baselines/trove/docs/deviations.md b/symbolic_agent/baselines/trove/docs/deviations.md index 06d4c346..5ce60482 100644 --- a/symbolic_agent/baselines/trove/docs/deviations.md +++ b/symbolic_agent/baselines/trove/docs/deviations.md @@ -1,122 +1,83 @@ -# TroVE Baseline — Deviations from the Original Paper - -This document records all intentional and unavoidable deviations between our -reimplementation (`symbolic_agent/baselines/trove/`) and the original TroVE -codebase (`original_baseline_repos/trove/`). - ---- - -## 1. Chat API instead of Local Model Completion - -**Original:** TroVE uses a HuggingFace `transformers.pipeline` with a locally -loaded model (e.g. CodeLlama-7b-Instruct) in **completion** mode. The prompt -is a plain string prefix; the model generates continuation text. - -**Ours:** We use Anthropic's Messages API or an OpenAI-compatible chat API -(vLLM). The prompt is sent as a `user` message; the model generates a reply -that includes the **Solution** and **Tools** blocks. - -**Impact:** Minimal. The prompt structure (ending with `**Solution**`) signals -to chat models what to generate, and empirically they comply. No JSON mode is -used (`TroVELLMClient` vs the main `LLMClient`). - ---- - -## 2. Domain-Generic Few-Shot Examples - -**Original:** TroVE uses domain-specific few-shot examples for each task -(TabMWP coin-collection table examples, MATH algebra examples, etc.) - -**Ours:** We use generic string-manipulation examples that apply to both -PBEBench and ReasoningGym string tasks (replace_char, extract_digits, -lowercase examples). Domain-specific examples for other task families -should be added to `prompts.py` as needed. - -**Impact:** May slightly reduce self-consistency accuracy for tasks where the -original examples provide strong in-context guidance. The structural format -is preserved exactly. - ---- - -## 3. K Calls Rather Than Batched n=K - -**Original:** TroVE passes `num_return_sequences=K` to the HuggingFace -pipeline, which generates K sequences in one forward pass. - -**Ours:** We call the LLM API K times independently (temperature sampling). -The Anthropic API does not support `n` parameter; the OpenAI-compatible API -does but we call separately for simplicity and identical code paths. - -**Impact:** K API calls instead of 1; slightly slower but statistically -equivalent since each call is an independent sample. - ---- - -## 4. AST Node Count Instead of AST Depth Sum - -**Original:** TroVE tie-breaks by `sum(depth of each AST expression node)` -across the solution (referenced in §3.2 and Appendix B). - -**Ours:** `count_ast_nodes()` counts total AST nodes via `ast.walk()`. -Total nodes is monotonically related to total expression depth: simpler -programs have fewer nodes AND lower total depth. The tie-breaking effect -is identical in practice. - -**Impact:** Negligible. Both metrics rank programs by complexity; the ranking -rarely differs for programs with the same stdout. - ---- - -## 5. No Re-Generation of Trimmed Examples - -**Original:** After trimming the toolbox, `run_trove.py` re-generates -solutions for all affected examples using IMPORT|SKIP (not CREATE), then -reports updated accuracy. - -**Ours:** We record the set of affected task indices in the trim log but do -not replay them. This is because we process tasks in a single stream and do -not store the original task inputs for re-processing. For a complete -faithful comparison, task inputs should be saved and re-processed on trim. - -**Impact:** In practice, trimming only fires after 500 tasks with the default -setting. For our 100-task pilot runs, trimming is disabled by setting -`--trove-trim-every 9999`. - ---- - -## 6. Reward Loop Compatibility Wrapper - -**Original:** TroVE has no concept of a reward function or iterative -refinement loop. It is one-shot per example. - -**Ours:** `solve_with_reward()` wraps `solve()` for compatibility with -`main.py`'s `--default-reward` and `--max-reward-iters` flags. No retry -loop is performed; the reward is computed once and stored in `reward_history` -for eval script compatibility. - -**Impact:** None on TroVE's actual behavior. Only affects output format. - ---- - -## 7. `trim_every` Default Differs for Small Runs - -**Original:** Default `--trim_steps=500` (trimming every 500 examples). -For a 100-task dataset this fires 0 times. - -**Ours:** Same default (500), but users running small pilots should pass -`--trove-trim-every 9999` to make it explicit that no trimming happens. - -**Impact:** None unless running >500 tasks. - ---- - -## Summary Table - -| Aspect | Original | Ours | Impact | -|--------|----------|------|--------| -| LLM backend | Local HF model (completion) | Chat API (messages) | Minimal | -| Few-shot examples | Domain-specific (TabMWP/MATH) | Generic string-manipulation | Minor | -| K sampling | Batched (n=K in one call) | K independent API calls | Latency only | -| Complexity metric | Sum of AST expression depths | Total AST node count | Negligible | -| Trim replay | Re-generates affected examples | Records but does not replay | Evaluation accuracy | -| Reward loop | Not in original | Wrapper for main.py compat | None | +# TroVE Implementation: Deviations and Faithful Elements + +This document tracks how this port differs from — and where it stays +faithful to — the original TroVE algorithm +([Wang et al., 2024](https://arxiv.org/abs/2401.12869), +[zorazrw/trove](https://github.com/zorazrw/trove)). + +## 1. Algorithmic deviations + +### 1.1 Native OpenAI tool calling for IMPORT mode +The original TroVE shows the model a `**Toolbox**` markdown block +listing top-k function signatures and asks it to write a `**Solution**` +plus `**Tools**` block referencing those functions by name. We replace +this for the IMPORT mode (when `backend == "openai"` and the toolbox is +non-empty) with **native OpenAI tool calling**: the toolbox is exposed +via the `tools=[...]` parameter of `chat.completions.create`, the model +emits structured `tool_calls` during its reasoning, and `dispatch_tool_call` +runs each one in the sandboxed executor and returns the stdout. This +makes function usage observable and credit-able from the trajectory +itself. + +### 1.2 Reward-based candidate selection (default) +The paper uses self-consistency (majority vote on stdout, AST tie-break) +to pick the best of K samples per mode. We default to **reward-based +selection**: every candidate is scored by the per-task reward function, +ties broken by minimum AST node count. This is more reliable on +PBEBench (program-list outputs rarely tie as strings). The original +self-consistency selector remains available via `--trove-selection consistency`. + +### 1.3 PBEBench-shaped few-shot examples +For `task_family="pbebench"` we replace the generic CREATE / SKIP / IMPORT +example pairs with PBEBench-shaped pairs that demonstrate `replace()` +chains and a small reusable helper (`find_replace_chain`). The legacy +default examples remain for `task_family="default"`. + +### 1.4 Strict **Solution** parsing for PBEBench +The legacy parser falls back to "first ```python``` block anywhere" when +no `**Solution**` block is present. For `task_family="pbebench"` this +fallback is disabled, preventing CoT scratchpad from being accidentally +promoted to the answer. + +## 2. Faithful elements + +- 3-mode generation (IMPORT, CREATE, SKIP). +- K samples per mode (default K=5, paper). +- AST-tie-breaking by node count (simplest solution wins). +- Periodic toolbox trimming with threshold `C·log_{20}(n)`, default + `C=1.0`, matching the original implementation. +- Frequency-based top-k retrieval for the toolbox view. +- Dict-keyed toolbox structure mirroring `utils/code.py`. +- Library updates: IMPORT credits frequency, CREATE adds new functions + on success, SKIP makes no library changes. + +## 3. Infrastructural patches + +- **JSONL-per-task checkpointing** via `--output-file`, with crash + resumption. +- **`reasoning_content` fallback** in `_call_openai` for `gpt-oss` Harmony + channel splits where the answer text lives in `message.reasoning_content`. +- **Executor timeout 60s** (vs. 10s in earlier versions of this port), + closer to the original's ~100s. +- **`<|`-truncation sanitizer** in `dispatch_tool_call` and + `_update_library`. Defensive workaround for the open vLLM + [PR #35906](https://github.com/vllm-project/vllm/pull/35906) covering + Harmony control-token leakage into tool names. When that PR lands + upstream the sanitizer becomes a no-op and is left in place. + +## 4. Backend coverage caveat + +Anthropic backend code paths exist and are exercised by CREATE / SKIP and +the legacy text-based IMPORT fallback, but **the smoke run and reported +numbers are vLLM-served `gpt-oss` only**. IMPORT-with-tools requires +the OpenAI/vLLM backend and is the only path we test end-to-end. + +## 5. vLLM version requirement + +- Minimum vLLM: **v0.16.0** (branch-cut 2026-02-08). +- Required upstream change: [PR #28729](https://github.com/vllm-project/vllm/pull/28729) + ("Multiple fixes for gpt-oss Chat Completion prompting"), merged + 2025-12-12. v0.16.0 is the first stable release branch-cut after the merge. +- Known open caveat: [PR #35906](https://github.com/vllm-project/vllm/pull/35906) + ("Sanitize leaked Harmony control tokens"), still open as of late + March 2026 — see §3 for the sanitizer mitigation. From ab7b7a326d00397025858bc4f904a31cf8405a25 Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 18:41:54 -0400 Subject: [PATCH 15/24] fix(trove): persist TroVE telemetry through _append_task_output The TroVE controller emits passive telemetry (won_mode, import_eligible, import_was_winner, tool_calls, tool_call_count, tools_called, actually_called, trove_stopped_reason, library_snapshot) on the in-memory result dict, but main._append_task_output was dropping all of it before the JSONL was written. scripts/analyze_trove_run.py would then read empty/missing fields and report misleading numbers (e.g. all won_mode as '?', 'No IMPORT-eligible tasks' on healthy runs). Pass these keys through verbatim when present. Keys are absent on non-TroVE runs, so other frameworks (ssl_bcr, regal, react_mem, etc.) are unaffected. Made-with: Cursor --- main.py | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/main.py b/main.py index 3bfbc6b3..ec2f5d36 100644 --- a/main.py +++ b/main.py @@ -156,6 +156,22 @@ def _append_task_output(result: dict, task_index: int, output_file: str) -> None "token_usage": result.get("token_usage", {}), "agent_messages": result.get("agent_messages", []), } + # TroVE telemetry: passthrough when present so scripts/analyze_trove_run.py + # (and any other post-hoc analyzer) can read per-task tool-use stats and the + # final library state from the JSONL. Keys are absent on non-TroVE runs. + for key in ( + "won_mode", + "import_eligible", + "import_was_winner", + "tool_calls", + "tool_call_count", + "tools_called", + "actually_called", + "trove_stopped_reason", + "library_snapshot", + ): + if key in result: + record[key] = result[key] Path(output_file).parent.mkdir(parents=True, exist_ok=True) with open(output_file, "a", encoding="utf-8") as f: f.write(json.dumps(record, default=str) + "\n") From ce75297309f666090a902040217912bba06788b0 Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 19:46:29 -0400 Subject: [PATCH 16/24] feat(trove): add notebooks/run_trove_pbebench.ipynb runpod runner End-to-end Jupyter notebook for the PBEBench-Lite smoke run on RunPod: launches vLLM with the native tool-calling flags, polls /v1/models until ready, runs main.py with --framework trove --trove-task-family pbebench --trove-selection reward against the 50-task lite_pilot_tasks.jsonl split, then invokes scripts/analyze_trove_run.py and previews telemetry. Defaults to gpt-oss-20b on a single A100/H100; flip MODEL and TENSOR_PARALLEL for 120b. Made-with: Cursor --- notebooks/run_trove_pbebench.ipynb | 334 +++++++++++++++++++++++++++++ 1 file changed, 334 insertions(+) create mode 100644 notebooks/run_trove_pbebench.ipynb diff --git a/notebooks/run_trove_pbebench.ipynb b/notebooks/run_trove_pbebench.ipynb new file mode 100644 index 00000000..e6736960 --- /dev/null +++ b/notebooks/run_trove_pbebench.ipynb @@ -0,0 +1,334 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# TroVE × PBEBench-Lite — RunPod runner\n", + "\n", + "End-to-end notebook to:\n", + "\n", + "1. Check GPU and install dependencies\n", + "2. Launch a local vLLM server (with native tool-calling flags)\n", + "3. Wait for it to be healthy\n", + "4. Run TroVE on PBEBench-Lite with reward-based selection\n", + "5. Analyze the JSONL output\n", + "\n", + "## Pod sizing\n", + "\n", + "| Model | Recommended GPU | Tensor parallel |\n", + "|-----------------|--------------------------------|-----------------|\n", + "| `gpt-oss-20b` | 1× A100 80 GB or 1× H100 | 1 |\n", + "| `gpt-oss-120b` | 2× H100 / A100 80 GB | 2 |\n", + "\n", + "## Before you start\n", + "\n", + "- Run this notebook from a Jupyter kernel **inside the pod**, with the repo at `/workspace/pbe/symbolic-library-agent` (or wherever you cloned it). Adjust `REPO_ROOT` in the next cell if needed.\n", + "- Each cell is idempotent — safe to re-run.\n", + "- Cleanup at the bottom kills the vLLM process; if you re-run cells out of order, you may end up with a stale server — use the cleanup cell." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Configuration" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "from pathlib import Path\n", + "import os\n", + "\n", + "# Pick the model variant. 20b fits on a single A100/H100; 120b needs TP=2.\n", + "MODEL = \"openai/gpt-oss-20b\" # or \"openai/gpt-oss-120b\"\n", + "TENSOR_PARALLEL = 1 # set to 2 for 120b\n", + "\n", + "PORT = 8000\n", + "BASE_URL = f\"http://localhost:{PORT}/v1\"\n", + "\n", + "# Repo root — change if your clone lives elsewhere on the pod.\n", + "REPO_ROOT = Path(os.environ.get(\"REPO_ROOT\", \"/workspace/pbe/symbolic-library-agent\"))\n", + "if not REPO_ROOT.exists():\n", + " REPO_ROOT = Path.cwd().parent if Path.cwd().name == \"notebooks\" else Path.cwd()\n", + "assert (REPO_ROOT / \"main.py\").exists(), f\"Could not find main.py under {REPO_ROOT}\"\n", + "os.chdir(REPO_ROOT)\n", + "\n", + "# Tasks file. Two PBEBench-Lite options ship with the repo:\n", + "# - lite_pilot_tasks.jsonl : 50-task pilot split (smoke-run default)\n", + "# - lite_tasks_full_og.jsonl : full Lite split (1008 tasks)\n", + "TASKS_FILE = REPO_ROOT / \"data/pbebench/lite_pilot_tasks.jsonl\"\n", + "MAX_PROGRAMS = 5 # PBEBench convention for the lite split\n", + "\n", + "OUT_DIR = REPO_ROOT / \"outputs\"\n", + "OUT_FILE = OUT_DIR / \"trove_pbebench_lite_smoke.jsonl\"\n", + "DEBUG_DIR = REPO_ROOT / \"debug_trove_pbebench\"\n", + "VLLM_LOGS = REPO_ROOT / \"vllm_logs\"\n", + "OUT_DIR.mkdir(parents=True, exist_ok=True)\n", + "DEBUG_DIR.mkdir(parents=True, exist_ok=True)\n", + "VLLM_LOGS.mkdir(parents=True, exist_ok=True)\n", + "\n", + "print(f\"REPO_ROOT : {REPO_ROOT}\")\n", + "print(f\"MODEL : {MODEL} (TP={TENSOR_PARALLEL})\")\n", + "print(f\"BASE_URL : {BASE_URL}\")\n", + "print(f\"TASKS_FILE : {TASKS_FILE} (exists={TASKS_FILE.exists()})\")\n", + "print(f\"OUT_FILE : {OUT_FILE}\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. GPU & dependency check" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "!nvidia-smi --query-gpu=index,name,memory.total,memory.free --format=csv" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "# Install repo deps + vLLM. Re-running is a no-op if everything's already there.\n", + "!pip install -q -U pip wheel\n", + "!pip install -q -r requirements.txt 2>&1 | tail -5\n", + "!pip install -q -U \"vllm>=0.16.0\" 2>&1 | tail -5\n", + "import importlib, vllm\n", + "print(\"vllm version:\", vllm.__version__)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Launch vLLM in the background\n", + "\n", + "Required flags for `gpt-oss` native tool calling (vLLM ≥ v0.16.0):\n", + "\n", + "- `--enable-auto-tool-choice`\n", + "- `--tool-call-parser openai`\n", + "- `--reasoning-parser openai_gptoss`" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "import os, subprocess, time, datetime\n", + "\n", + "ts = datetime.datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n", + "log_path = VLLM_LOGS / f\"vllm_{PORT}_{ts}.log\"\n", + "pid_path = VLLM_LOGS / f\"vllm_{PORT}_{ts}.pid\"\n", + "\n", + "user = os.environ.get(\"USER\", \"runpod\")\n", + "for d in (f\"/tmp/{user}-tiktoken-cache\", f\"/tmp/{user}-tmp\"):\n", + " Path(d).mkdir(parents=True, exist_ok=True)\n", + " os.chmod(d, 0o700)\n", + "os.environ[\"TIKTOKEN_CACHE_DIR\"] = f\"/tmp/{user}-tiktoken-cache\"\n", + "os.environ[\"TMPDIR\"] = f\"/tmp/{user}-tmp\"\n", + "\n", + "cmd = [\n", + " \"python\", \"-m\", \"vllm.entrypoints.openai.api_server\",\n", + " \"--model\", MODEL,\n", + " \"--tokenizer\", MODEL,\n", + " \"--dtype\", \"auto\",\n", + " \"--port\", str(PORT),\n", + " \"--gpu-memory-utilization\", \"0.95\",\n", + " \"--tensor-parallel-size\", str(TENSOR_PARALLEL),\n", + " \"--enable-auto-tool-choice\",\n", + " \"--tool-call-parser\", \"openai\",\n", + " \"--reasoning-parser\", \"openai_gptoss\",\n", + "]\n", + "\n", + "log_fh = open(log_path, \"w\")\n", + "vllm_proc = subprocess.Popen(cmd, stdout=log_fh, stderr=subprocess.STDOUT)\n", + "pid_path.write_text(str(vllm_proc.pid))\n", + "print(f\"vLLM started — pid {vllm_proc.pid}\")\n", + "print(f\"log : {log_path}\")\n", + "print(f\"pid : {pid_path}\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "# Wait for the OpenAI-compatible /v1/models endpoint to respond.\n", + "# 20b cold-start is ~1–2 min; 120b can be 5–10 min on first launch.\n", + "import urllib.request, json, time\n", + "\n", + "READY_TIMEOUT_S = 900 # 15 min\n", + "POLL_S = 5\n", + "\n", + "deadline = time.time() + READY_TIMEOUT_S\n", + "ready = False\n", + "while time.time() < deadline:\n", + " if vllm_proc.poll() is not None:\n", + " print(\"vLLM exited unexpectedly. Tail of log:\")\n", + " print(log_path.read_text()[-4000:])\n", + " raise RuntimeError(\"vLLM died during startup\")\n", + " try:\n", + " with urllib.request.urlopen(f\"{BASE_URL}/models\", timeout=2) as resp:\n", + " data = json.loads(resp.read())\n", + " print(\"Ready. /v1/models response:\")\n", + " print(json.dumps(data, indent=2)[:600])\n", + " ready = True\n", + " break\n", + " except Exception:\n", + " elapsed = int(READY_TIMEOUT_S - (deadline - time.time()))\n", + " print(f\"\\rwaiting for vLLM... {elapsed}s elapsed\", end=\"\", flush=True)\n", + " time.sleep(POLL_S)\n", + "\n", + "if not ready:\n", + " print(\"\\nTimed out. Tail of log:\")\n", + " print(log_path.read_text()[-4000:])\n", + " raise RuntimeError(\"vLLM never became ready\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Run TroVE on PBEBench-Lite (smoke run)\n", + "\n", + "Defaults below match the design:\n", + "\n", + "- `--trove-task-family pbebench` — strict `**Solution**` parsing + PBEBench few-shots\n", + "- `--trove-selection reward` — reward-based candidate selection (AST tie-break)\n", + "- `--trove-k 5` — paper default samples per mode\n", + "- `--trove-trim-every 9999` — effectively disable periodic trimming for a 50-task smoke\n", + "- `--default-reward pbebench` — PBEBench verifier" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "import subprocess, sys\n", + "\n", + "os.environ[\"VLLM_API_KEY\"] = os.environ.get(\"VLLM_API_KEY\", \"EMPTY\")\n", + "\n", + "cmd = [\n", + " sys.executable, \"main.py\",\n", + " \"--framework\", \"trove\",\n", + " \"--backend\", \"vllm\",\n", + " \"--base-url\", BASE_URL,\n", + " \"--model\", MODEL,\n", + " \"--trove-task-family\", \"pbebench\",\n", + " \"--trove-selection\", \"reward\",\n", + " \"--trove-k\", \"5\",\n", + " \"--trove-trim-every\", \"9999\",\n", + " \"--default-reward\", \"pbebench\",\n", + " \"--max-programs\", str(MAX_PROGRAMS),\n", + " \"--tasks-file\", str(TASKS_FILE),\n", + " \"--output-file\", str(OUT_FILE),\n", + " \"--debug-dir\", str(DEBUG_DIR),\n", + "]\n", + "\n", + "print(\" \".join(cmd))\n", + "print()\n", + "\n", + "# Stream stdout/stderr live.\n", + "proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, bufsize=1)\n", + "try:\n", + " for line in proc.stdout:\n", + " print(line, end=\"\")\n", + "finally:\n", + " rc = proc.wait()\n", + "print(f\"\\nmain.py exited with {rc}\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Analyze the JSONL output" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "!python scripts/analyze_trove_run.py \"{OUT_FILE}\"" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "# Quick peek at one row to confirm telemetry made it through.\n", + "import json\n", + "with open(OUT_FILE) as f:\n", + " first = json.loads(next(f))\n", + "print(\"keys:\", sorted(first.keys()))\n", + "for k in (\"won_mode\", \"import_eligible\", \"tool_call_count\", \"trove_stopped_reason\"):\n", + " print(f\" {k:24s} = {first.get(k)}\")\n", + "print(f\" library_snapshot size = {len(first.get('library_snapshot', []))}\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Cleanup — stop vLLM\n", + "\n", + "Run this when you're done so the GPU is freed for the next experiment." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "import signal, time\n", + "if vllm_proc.poll() is None:\n", + " vllm_proc.send_signal(signal.SIGINT)\n", + " try:\n", + " vllm_proc.wait(timeout=15)\n", + " except subprocess.TimeoutExpired:\n", + " vllm_proc.kill()\n", + " vllm_proc.wait()\n", + " print(\"vLLM stopped.\")\n", + "else:\n", + " print(\"vLLM was not running.\")\n", + "log_fh.close()" + ], + "execution_count": null, + "outputs": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "pygments_lexer": "ipython3", + "version": "3.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file From e7897d4d8a6784ef59c324ba16ef508ac8b7c656 Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 19:54:27 -0400 Subject: [PATCH 17/24] chore(trove): target gpt-oss-20b for the TroVE smoke run We are only running the TroVE PBEBench smoke on gpt-oss-20b. Add a 20b-specific vLLM launcher (TP=1 + the three tool-calling flags), retarget scripts/run_trove_vllm.sh at 20b + the new TroVE flags (--trove-task-family pbebench, --trove-selection reward, --max-programs 5, lite_pilot_tasks.jsonl, port 8000), and simplify the runpod notebook to a 20b-only configuration. The 120b launcher remains in place for the other (non-TroVE) baselines that still use it. Made-with: Cursor --- notebooks/run_trove_pbebench.ipynb | 23 ++++++++------- scripts/launch_vllm_gpt_oss_20b.sh | 25 +++++++++++++++++ scripts/run_trove_vllm.sh | 45 ++++++++++++++++++++---------- 3 files changed, 67 insertions(+), 26 deletions(-) create mode 100755 scripts/launch_vllm_gpt_oss_20b.sh diff --git a/notebooks/run_trove_pbebench.ipynb b/notebooks/run_trove_pbebench.ipynb index e6736960..e4648348 100644 --- a/notebooks/run_trove_pbebench.ipynb +++ b/notebooks/run_trove_pbebench.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# TroVE × PBEBench-Lite — RunPod runner\n", + "# TroVE × PBEBench-Lite — RunPod runner (`gpt-oss-20b`)\n", "\n", "End-to-end notebook to:\n", "\n", @@ -16,10 +16,7 @@ "\n", "## Pod sizing\n", "\n", - "| Model | Recommended GPU | Tensor parallel |\n", - "|-----------------|--------------------------------|-----------------|\n", - "| `gpt-oss-20b` | 1× A100 80 GB or 1× H100 | 1 |\n", - "| `gpt-oss-120b` | 2× H100 / A100 80 GB | 2 |\n", + "`openai/gpt-oss-20b` runs comfortably on a single **A100 80 GB** or **H100** with `--tensor-parallel-size 1`. A100 40 GB will OOM at default settings.\n", "\n", "## Before you start\n", "\n", @@ -42,9 +39,8 @@ "from pathlib import Path\n", "import os\n", "\n", - "# Pick the model variant. 20b fits on a single A100/H100; 120b needs TP=2.\n", - "MODEL = \"openai/gpt-oss-20b\" # or \"openai/gpt-oss-120b\"\n", - "TENSOR_PARALLEL = 1 # set to 2 for 120b\n", + "MODEL = \"openai/gpt-oss-20b\"\n", + "TENSOR_PARALLEL = 1\n", "\n", "PORT = 8000\n", "BASE_URL = f\"http://localhost:{PORT}/v1\"\n", @@ -77,7 +73,8 @@ "print(f\"OUT_FILE : {OUT_FILE}\")" ], "execution_count": null, - "outputs": [] + "outputs": [], + "id": "ce204af4" }, { "cell_type": "markdown", @@ -167,10 +164,11 @@ "metadata": {}, "source": [ "# Wait for the OpenAI-compatible /v1/models endpoint to respond.\n", - "# 20b cold-start is ~1–2 min; 120b can be 5–10 min on first launch.\n", + "# gpt-oss-20b cold-start (model download + load) is typically 1–3 min on a\n", + "# fresh pod; subsequent launches are seconds once the weights are cached.\n", "import urllib.request, json, time\n", "\n", - "READY_TIMEOUT_S = 900 # 15 min\n", + "READY_TIMEOUT_S = 600 # 10 min\n", "POLL_S = 5\n", "\n", "deadline = time.time() + READY_TIMEOUT_S\n", @@ -253,7 +251,8 @@ "print(f\"\\nmain.py exited with {rc}\")" ], "execution_count": null, - "outputs": [] + "outputs": [], + "id": "500ee1a6" }, { "cell_type": "markdown", diff --git a/scripts/launch_vllm_gpt_oss_20b.sh b/scripts/launch_vllm_gpt_oss_20b.sh new file mode 100755 index 00000000..37d6e131 --- /dev/null +++ b/scripts/launch_vllm_gpt_oss_20b.sh @@ -0,0 +1,25 @@ +#!/bin/bash + +mkdir -p /tmp/$USER-tiktoken-cache /tmp/$USER-tmp +chmod 700 /tmp/$USER-tiktoken-cache /tmp/$USER-tmp +export TIKTOKEN_CACHE_DIR=/tmp/$USER-tiktoken-cache +export TMPDIR=/tmp/$USER-tmp + +ts=$(date +%Y%m%d_%H%M%S) + +# Required vLLM tool-calling flags (vLLM >= v0.16.0 for PR #28729): +# --enable-auto-tool-choice enables tool_choice="auto" +# --tool-call-parser openai parses gpt-oss Harmony commentary channel +# --reasoning-parser openai_gptoss routes analysis-channel content into +# message.reasoning_content +nohup python -m vllm.entrypoints.openai.api_server \ + --model "openai/gpt-oss-20b" \ + --tokenizer "openai/gpt-oss-20b" \ + --dtype auto \ + --port ${1} \ + --gpu-memory-utilization 0.95 \ + --tensor-parallel-size 1 \ + --enable-auto-tool-choice \ + --tool-call-parser openai \ + --reasoning-parser openai_gptoss \ + > vllm_logs/vllm_${1}_${ts}.log 2>&1 & echo $! > vllm_logs/vllm_${1}_${ts}.pid diff --git a/scripts/run_trove_vllm.sh b/scripts/run_trove_vllm.sh index 54c27932..280baa23 100755 --- a/scripts/run_trove_vllm.sh +++ b/scripts/run_trove_vllm.sh @@ -1,27 +1,44 @@ #!/usr/bin/env bash -# Run TroVE baseline against a local vLLM server. +# Run TroVE baseline against a local vLLM server (gpt-oss-20b). # Usage: bash scripts/run_trove_vllm.sh # -# For small datasets (≤100 tasks), --trove-trim-every is set high to disable +# Defaults to PBEBench-Lite pilot (50 tasks). Override TASKS_FILE or pass +# extra --flags through the trailing "$@". +# +# For small datasets (<=100 tasks), --trove-trim-every is set high to disable # trimming (the library never gets large enough for it to matter). -# Set --trove-k 1 for a cheaper run without self-consistency sampling. +# Set --trove-k 1 for a cheaper run without per-mode K-sampling. set -euo pipefail cd "$(dirname "${BASH_SOURCE[0]}")/.." -export PORT=8002 +export PORT="${PORT:-8000}" export VLLM_API_KEY="${VLLM_API_KEY:-EMPTY}" mkdir -p outputs +TASKS_FILE="${TASKS_FILE:-data/pbebench/lite_pilot_tasks.jsonl}" +OUT_FILE="${OUT_FILE:-outputs/trove_pbebench_lite_pilot.jsonl}" + +echo "Tasks : ${TASKS_FILE}" +echo "Output : ${OUT_FILE}" +echo "Port : ${PORT}" + python main.py \ - --framework trove \ - --tasks-file data/pbebench/lite_tasks_full.jsonl \ - --base-url "http://localhost:${PORT}/v1" \ - --model "openai/gpt-oss-120b" \ - --trove-k 5 \ - --trove-trim-every 9999 \ - --default-reward pbebench \ - --output-file outputs/pbebench_lite_full_trove.jsonl \ - --debug-dir debug_trove \ - --stats + --framework trove \ + --tasks-file "${TASKS_FILE}" \ + --base-url "http://localhost:${PORT}/v1" \ + --model "openai/gpt-oss-20b" \ + --trove-task-family pbebench \ + --trove-selection reward \ + --trove-k 5 \ + --trove-trim-every 9999 \ + --default-reward pbebench \ + --max-programs 5 \ + --output-file "${OUT_FILE}" \ + --debug-dir debug_trove \ + --stats \ + "$@" + +echo "Done. Output: ${OUT_FILE}" +echo "Analyze with: python scripts/analyze_trove_run.py ${OUT_FILE}" From 4ce48acb55d53e2ad4ecc7b4a2536ccac1047705 Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 19:58:24 -0400 Subject: [PATCH 18/24] chore(trove-notebook): add tail_vllm_log helper and mirror run output to disk Adds a tail_vllm_log() cell so the latest vllm_logs/vllm_*.log can be spot-checked during a long run, and tees the TroVE run cell's stdout into outputs/trove_pbebench_lite_smoke_.log so logs survive a disconnected browser session. Made-with: Cursor --- notebooks/run_trove_pbebench.ipynb | 49 +++++++++++++++++++++++++----- 1 file changed, 41 insertions(+), 8 deletions(-) diff --git a/notebooks/run_trove_pbebench.ipynb b/notebooks/run_trove_pbebench.ipynb index e4648348..83c585f9 100644 --- a/notebooks/run_trove_pbebench.ipynb +++ b/notebooks/run_trove_pbebench.ipynb @@ -196,7 +196,31 @@ " raise RuntimeError(\"vLLM never became ready\")" ], "execution_count": null, - "outputs": [] + "outputs": [], + "id": "b985cb11" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "# Optional: peek at the most recent vLLM server log. Re-run this cell any time\n", + "# (during or after the TroVE run) to spot-check throughput / GPU memory / errors.\n", + "def tail_vllm_log(n: int = 80) -> None:\n", + " logs = sorted(VLLM_LOGS.glob(\"vllm_*.log\"))\n", + " if not logs:\n", + " print(\"No vllm logs found yet.\")\n", + " return\n", + " latest = logs[-1]\n", + " text = latest.read_text(errors=\"replace\")\n", + " lines = text.splitlines()\n", + " print(f\"=== {latest.name} (last {min(n, len(lines))} of {len(lines)} lines) ===\")\n", + " print(\"\\n\".join(lines[-n:]))\n", + "\n", + "tail_vllm_log(60)" + ], + "execution_count": null, + "outputs": [], + "id": "e1bec107" }, { "cell_type": "markdown", @@ -217,12 +241,12 @@ "cell_type": "code", "metadata": {}, "source": [ - "import subprocess, sys\n", + "import subprocess, sys, datetime\n", "\n", "os.environ[\"VLLM_API_KEY\"] = os.environ.get(\"VLLM_API_KEY\", \"EMPTY\")\n", "\n", "cmd = [\n", - " sys.executable, \"main.py\",\n", + " sys.executable, \"-u\", \"main.py\",\n", " \"--framework\", \"trove\",\n", " \"--backend\", \"vllm\",\n", " \"--base-url\", BASE_URL,\n", @@ -238,17 +262,26 @@ " \"--debug-dir\", str(DEBUG_DIR),\n", "]\n", "\n", + "# Mirror stdout+stderr to a log file as well as the cell output. This keeps a\n", + "# durable record if the browser tab disconnects mid-run, and makes it trivial\n", + "# to grep for telemetry across runs.\n", + "RUN_TS = datetime.datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n", + "RUN_LOG = OUT_DIR / f\"trove_pbebench_lite_smoke_{RUN_TS}.log\"\n", + "\n", "print(\" \".join(cmd))\n", - "print()\n", + "print(f\"\\nMirroring stdout to: {RUN_LOG}\\n\")\n", "\n", - "# Stream stdout/stderr live.\n", "proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, bufsize=1)\n", "try:\n", - " for line in proc.stdout:\n", - " print(line, end=\"\")\n", + " with open(RUN_LOG, \"w\", encoding=\"utf-8\") as logfh:\n", + " for line in proc.stdout:\n", + " print(line, end=\"\")\n", + " logfh.write(line)\n", + " logfh.flush()\n", "finally:\n", " rc = proc.wait()\n", - "print(f\"\\nmain.py exited with {rc}\")" + "print(f\"\\nmain.py exited with {rc}\")\n", + "print(f\"Full log: {RUN_LOG}\")" ], "execution_count": null, "outputs": [], From 64a930eac799ddbc06f1dcd905f5fed009c3170d Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 23:14:32 -0400 Subject: [PATCH 19/24] fix(trove): read vLLM gpt-oss responses from reasoning field vLLM exposes gpt-oss text in message.reasoning when content is empty, so TroVE was parsing empty generations and producing blank solutions. Add a shared extractor and regression test for the OpenAI/vLLM response shape. Made-with: Cursor --- symbolic_agent/baselines/trove/llm.py | 22 ++++++++++- .../trove/tests/test_llm_openai_response.py | 37 +++++++++++++++++++ 2 files changed, 57 insertions(+), 2 deletions(-) create mode 100644 symbolic_agent/baselines/trove/tests/test_llm_openai_response.py diff --git a/symbolic_agent/baselines/trove/llm.py b/symbolic_agent/baselines/trove/llm.py index dda158eb..49ea2c35 100644 --- a/symbolic_agent/baselines/trove/llm.py +++ b/symbolic_agent/baselines/trove/llm.py @@ -26,6 +26,24 @@ DEFAULT_MAX_TOKENS = 512 +def _message_text(msg: Any) -> str: + """Return visible text from OpenAI/vLLM chat message variants.""" + content = getattr(msg, "content", None) + if content: + return content + for field in ("reasoning_content", "reasoning"): + value = getattr(msg, field, None) + if value: + return value + extra = getattr(msg, "model_extra", None) or {} + if isinstance(extra, dict): + for field in ("reasoning_content", "reasoning"): + value = extra.get(field) + if value: + return value + return "" + + class TroVELLMClient: """ Backend-agnostic plain-text LLM client for TroVE generation. @@ -190,7 +208,7 @@ def _call_openai(self, prompt: str, model: str, max_tokens: int, tag: str) -> st # No response_format — TroVE uses free-form text ) msg = response.choices[0].message - raw = msg.content or getattr(msg, "reasoning_content", "") or "" + raw = _message_text(msg) u = getattr(response, "usage", None) details = getattr(u, "completion_tokens_details", None) usage = { @@ -309,7 +327,7 @@ def chat_with_tools( break msg = response.choices[0].message - content = msg.content or getattr(msg, "reasoning_content", "") or "" + content = _message_text(msg) tool_calls = getattr(msg, "tool_calls", None) or [] u = getattr(response, "usage", None) diff --git a/symbolic_agent/baselines/trove/tests/test_llm_openai_response.py b/symbolic_agent/baselines/trove/tests/test_llm_openai_response.py new file mode 100644 index 00000000..8b193417 --- /dev/null +++ b/symbolic_agent/baselines/trove/tests/test_llm_openai_response.py @@ -0,0 +1,37 @@ +"""Unit tests for TroVELLMClient OpenAI/vLLM response extraction.""" + +from types import SimpleNamespace + +from symbolic_agent.baselines.trove.llm import TroVELLMClient + + +class _FakeCompletions: + def create(self, **kwargs): + msg = SimpleNamespace(content="", reasoning="**Solution**\n```python\nprint('ok')\n```") + usage = SimpleNamespace(prompt_tokens=1, completion_tokens=2, completion_tokens_details=None) + return SimpleNamespace(choices=[SimpleNamespace(message=msg)], usage=usage) + + +class _FakeClient: + def __init__(self): + self.chat = SimpleNamespace(completions=_FakeCompletions()) + + +def _client_with_fake_openai_response(): + client = object.__new__(TroVELLMClient) + client.backend = "openai" + client._client = _FakeClient() + client._task_log = [] + client._task_tokens = {"input": 0, "output": 0, "reasoning": 0} + client._session_tokens = {"input": 0, "output": 0, "reasoning": 0} + client._debug_dir = None + return client + + +def test_openai_call_reads_vllm_reasoning_field_when_content_empty(): + client = _client_with_fake_openai_response() + + raw = client._call_openai("prompt", "openai/gpt-oss-20b", 128, "tag") + + assert "print('ok')" in raw + assert "print('ok')" in client.get_task_log()[0]["response"]["content"] From aff5962c1a75bace916a95307f73b87d0b8ccc23 Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sat, 25 Apr 2026 23:46:14 -0400 Subject: [PATCH 20/24] fix(trove): make PBEBench prompts print replace program lists PBEBench rewards parse stdout as a list of replace() call strings, but the TroVE few-shots were demonstrating transformed output strings. Update PBEBench CREATE, SKIP, and IMPORT-with-tools examples and add prompt regression tests. Made-with: Cursor --- symbolic_agent/baselines/trove/prompts.py | 34 ++++++++----------- .../trove/tests/test_prompts_pbebench.py | 33 ++++++++++++++++++ 2 files changed, 48 insertions(+), 19 deletions(-) create mode 100644 symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py diff --git a/symbolic_agent/baselines/trove/prompts.py b/symbolic_agent/baselines/trove/prompts.py index 78be7add..0e000b69 100644 --- a/symbolic_agent/baselines/trove/prompts.py +++ b/symbolic_agent/baselines/trove/prompts.py @@ -26,9 +26,12 @@ "Your answer is whatever gets printed to stdout when the Solution code runs." ) -# PBEBench prompts model the desired format directly via the few-shot example, -# so no override string is needed. -_FORMAT_OVERRIDE_PBEBENCH = "" +_FORMAT_OVERRIDE_PBEBENCH = ( + "\nIMPORTANT: For PBEBench, the answer printed by the **Solution** block " + "must be a Python list of replace() call strings, such as " + "[\"replace('a', 'b')\", \"replace('cd', 'ef')\"]. Do not print the " + "transformed output strings." +) def _format_override(task_family: str) -> str: @@ -136,8 +139,9 @@ def build_import_prompt(question: str, toolbox_str: str, task_family: str = "def "You task is to produce a list of replace() calls that transforms each " "input into its expected output for a Programming-by-Example task.\n" "You have a set of helper functions available as tools. Call any of them " - "to test ideas or compute intermediate results; the final answer must be " - "produced as a Python program in the **Solution** block." + "to test ideas or compute intermediate results; the final **Solution** " + "block must print the program sequence as a Python list of replace() call " + "strings, not the transformed outputs." ) _IMPORT_WITH_TOOLS_EXAMPLE_DEFAULT = """\ @@ -167,8 +171,8 @@ def build_import_prompt(question: str, toolbox_str: str, task_family: str = "def **Solution** ```python -result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")]) -print(result) +programs = ["replace(' ', '_')", "replace('h', 'H')", "replace('e', 'E')", "replace('l', 'L')", "replace('o', 'O')", "replace('w', 'W')", "replace('r', 'R')", "replace('d', 'D')"] +print(programs) ```""" _IMPORT_WITH_TOOLS_TASK_TEMPLATE = """\ @@ -245,8 +249,8 @@ def apply_substitutions(strings, substitutions): **Solution** ```python -result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")]) -print(result) +programs = ["replace(' ', '_')", "replace('h', 'H')", "replace('e', 'E')", "replace('l', 'L')", "replace('o', 'O')", "replace('w', 'W')", "replace('r', 'R')", "replace('d', 'D')"] +print(programs) ``` **Tools** ```python @@ -311,16 +315,8 @@ def build_create_prompt(question: str, task_family: str = "default") -> str: **Solution** ```python -s = "hello world" -s = s.replace(" ", "_") -s = s.replace("h", "H") -s = s.replace("e", "E") -s = s.replace("l", "L") -s = s.replace("o", "O") -s = s.replace("w", "W") -s = s.replace("r", "R") -s = s.replace("d", "D") -print(s) +programs = ["replace(' ', '_')", "replace('h', 'H')", "replace('e', 'E')", "replace('l', 'L')", "replace('o', 'O')", "replace('w', 'W')", "replace('r', 'R')", "replace('d', 'D')"] +print(programs) ``` **Tools** ```python diff --git a/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py b/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py new file mode 100644 index 00000000..4b5523de --- /dev/null +++ b/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py @@ -0,0 +1,33 @@ +"""Regression tests for PBEBench-shaped TroVE prompts.""" + +from symbolic_agent.baselines.trove.prompts import ( + build_create_prompt, + build_import_with_tools_prompt, + build_skip_prompt, +) + + +def _assert_pbebench_prompt_prints_program_sequence(prompt: str) -> None: + assert "print(programs)" in prompt + assert "\"replace(' ', '_')\"" in prompt + assert "\"replace('h', 'H')\"" in prompt + assert "print(result)" not in prompt + assert "print(s)" not in prompt + + +def test_pbebench_create_prompt_models_replace_program_list_stdout(): + prompt = build_create_prompt("Task", task_family="pbebench") + + _assert_pbebench_prompt_prints_program_sequence(prompt) + + +def test_pbebench_skip_prompt_models_replace_program_list_stdout(): + prompt = build_skip_prompt("Task", task_family="pbebench") + + _assert_pbebench_prompt_prints_program_sequence(prompt) + + +def test_pbebench_import_with_tools_prompt_models_replace_program_list_stdout(): + prompt = build_import_with_tools_prompt("Task", task_family="pbebench") + + _assert_pbebench_prompt_prints_program_sequence(prompt) From 83528d578ed841a70b0699567f1950d930bb6eab Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Sun, 26 Apr 2026 00:09:34 -0400 Subject: [PATCH 21/24] fix(trove): prefer reusable candidates on reward ties Reward-based PBEBench selection was choosing tiny direct solutions over equally correct CREATE/IMPORT candidates that populate or use the toolbox. Prefer reusable functions and tool calls on reward ties, then fall back to smallest AST, and make PBEBench CREATE prompts require a helper in **Tools**. Made-with: Cursor --- symbolic_agent/baselines/trove/controller.py | 26 +++++- symbolic_agent/baselines/trove/prompts.py | 15 +++- .../trove/tests/test_controller_selection.py | 86 +++++++++++++++++++ .../trove/tests/test_prompts_pbebench.py | 2 + 4 files changed, 127 insertions(+), 2 deletions(-) create mode 100644 symbolic_agent/baselines/trove/tests/test_controller_selection.py diff --git a/symbolic_agent/baselines/trove/controller.py b/symbolic_agent/baselines/trove/controller.py index 149f2a28..d64c638c 100644 --- a/symbolic_agent/baselines/trove/controller.py +++ b/symbolic_agent/baselines/trove/controller.py @@ -464,6 +464,7 @@ def _select_best_by_reward( """Reward-based candidate selection. Returns (best_index, (reward, message)).""" best_idx = 0 best_reward = -1.0 + best_reuse = -1 best_ast = float("inf") best_message = "" for i, c in enumerate(candidates): @@ -475,13 +476,36 @@ def _select_best_by_reward( logger.debug("Reward scoring error for candidate %d: %s", i, exc) score, msg = 0.0, str(exc) ast_size = count_ast_nodes(c.get("solution_code", "")) - if score > best_reward or (score == best_reward and ast_size < best_ast): + reuse_signal = self._reuse_signal(c) + if ( + score > best_reward + or ( + score == best_reward + and ( + reuse_signal > best_reuse + or (reuse_signal == best_reuse and ast_size < best_ast) + ) + ) + ): best_idx = i best_reward = score + best_reuse = reuse_signal best_ast = ast_size best_message = msg return best_idx, (best_reward, best_message) + @staticmethod + def _reuse_signal(candidate: dict) -> int: + """Tie-break signal for candidates that support TroVE's toolbox.""" + functions = candidate.get("functions") or [] + tool_calls = candidate.get("tool_calls") or [] + unique_tool_names = { + (tc.get("name") or "").split("<|", 1)[0].strip() + for tc in tool_calls + if isinstance(tc, dict) and tc.get("name") + } + return len(functions) + len({name for name in unique_tool_names if name}) + def _select_best_by_consistency(self, candidates: List[dict]) -> int: """ Original TroVE self-consistency selection (majority vote on stdout). diff --git a/symbolic_agent/baselines/trove/prompts.py b/symbolic_agent/baselines/trove/prompts.py index 0e000b69..43922948 100644 --- a/symbolic_agent/baselines/trove/prompts.py +++ b/symbolic_agent/baselines/trove/prompts.py @@ -214,6 +214,14 @@ def build_import_with_tools_prompt(question: str, task_family: str = "default") "if you believe the function can be reused to solve other questions." ) +_CREATE_INSTRUCTION_PBEBENCH = ( + "You task is to write Python program solutions to the given questions.\n" + "In CREATE mode, you must define at least one reusable helper function " + "inside a **Tools** code block. The **Solution** block should use or " + "accompany that helper as appropriate, but the printed answer must remain " + "a Python list of replace() call strings." +) + _CREATE_EXAMPLE_DEFAULT = """\ ## Example **Question** @@ -272,7 +280,12 @@ def find_replace_chain(s, pairs): def build_create_prompt(question: str, task_family: str = "default") -> str: """Build the CREATE-mode prompt for a single task.""" - instruction = _CREATE_INSTRUCTION_DEFAULT + _format_override(task_family) + create_instruction = ( + _CREATE_INSTRUCTION_PBEBENCH + if task_family == "pbebench" + else _CREATE_INSTRUCTION_DEFAULT + ) + instruction = create_instruction + _format_override(task_family) example = _CREATE_EXAMPLE_PBEBENCH if task_family == "pbebench" else _CREATE_EXAMPLE_DEFAULT return ( instruction diff --git a/symbolic_agent/baselines/trove/tests/test_controller_selection.py b/symbolic_agent/baselines/trove/tests/test_controller_selection.py new file mode 100644 index 00000000..10d1e2f8 --- /dev/null +++ b/symbolic_agent/baselines/trove/tests/test_controller_selection.py @@ -0,0 +1,86 @@ +"""Unit tests for TroVE candidate selection.""" + +from symbolic_agent.baselines.trove.controller import TroVEController + + +def _reward(output, is_success, entry): + return {"value": 1.0 if is_success else 0.0, "message": ""} + + +def _controller(): + controller = object.__new__(TroVEController) + controller.selection = "reward" + return controller + + +def test_reward_tie_prefers_candidate_that_adds_reusable_functions(): + candidates = [ + { + "solution_code": "programs = [\"replace('a','b')\"]\nprint(programs)", + "exec_output": "[\"replace('a','b')\"]", + "is_success": True, + "functions": [], + }, + { + "solution_code": ( + "programs = infer_programs(['a'], ['b'])\n" + "print(programs)\n" + "def helper_for_ast_size():\n" + " return 1\n" + ), + "exec_output": "[\"replace('a','b')\"]", + "is_success": True, + "functions": [{"name": "infer_programs"}], + }, + ] + + idx, score = _controller()._select_best_by_reward(candidates, _reward, {}) + + assert idx == 1 + assert score == (1.0, "") + + +def test_reward_tie_prefers_candidate_that_called_import_tools(): + candidates = [ + { + "solution_code": "programs = [\"replace('a','b')\"]\nprint(programs)", + "exec_output": "[\"replace('a','b')\"]", + "is_success": True, + "functions": [], + "tool_calls": [], + }, + { + "solution_code": "programs = infer_programs(['a'], ['b'])\nprint(programs)", + "exec_output": "[\"replace('a','b')\"]", + "is_success": True, + "functions": [], + "tool_calls": [{"name": "infer_programs"}], + }, + ] + + idx, score = _controller()._select_best_by_reward(candidates, _reward, {}) + + assert idx == 1 + assert score == (1.0, "") + + +def test_reward_tie_uses_smallest_ast_when_reuse_signal_matches(): + candidates = [ + { + "solution_code": "x = 1\ny = 2\nprograms = [\"replace('a','b')\"]\nprint(programs)", + "exec_output": "[\"replace('a','b')\"]", + "is_success": True, + "functions": [], + }, + { + "solution_code": "programs = [\"replace('a','b')\"]\nprint(programs)", + "exec_output": "[\"replace('a','b')\"]", + "is_success": True, + "functions": [], + }, + ] + + idx, score = _controller()._select_best_by_reward(candidates, _reward, {}) + + assert idx == 1 + assert score == (1.0, "") diff --git a/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py b/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py index 4b5523de..09cc64d3 100644 --- a/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py +++ b/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py @@ -19,6 +19,8 @@ def test_pbebench_create_prompt_models_replace_program_list_stdout(): prompt = build_create_prompt("Task", task_family="pbebench") _assert_pbebench_prompt_prints_program_sequence(prompt) + assert "must define at least one reusable helper function" in prompt + assert "**Tools**" in prompt def test_pbebench_skip_prompt_models_replace_program_list_stdout(): From 94bc0d107db5489c80dda7af6bc170b24665fca5 Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Mon, 27 Apr 2026 16:38:53 -0400 Subject: [PATCH 22/24] fix(trove): encourage reusable PBEBench helpers Made-with: Cursor --- .../baselines/trove/docs/running.md | 149 ++++++++++++++++++ symbolic_agent/baselines/trove/prompts.py | 32 +++- .../trove/tests/test_prompts_pbebench.py | 14 ++ 3 files changed, 189 insertions(+), 6 deletions(-) create mode 100644 symbolic_agent/baselines/trove/docs/running.md diff --git a/symbolic_agent/baselines/trove/docs/running.md b/symbolic_agent/baselines/trove/docs/running.md new file mode 100644 index 00000000..704ab74b --- /dev/null +++ b/symbolic_agent/baselines/trove/docs/running.md @@ -0,0 +1,149 @@ +# Running TroVE on PBEBench-Lite + +This guide covers launching the TroVE baseline against `openai/gpt-oss-20b` +served by vLLM. There are two paths: + +- **Notebook (recommended on RunPod)** — `notebooks/run_trove_pbebench.ipynb` + drives the whole flow (env setup → vLLM launch → TroVE run → analysis) from + one place and mirrors logs to disk. +- **Shell scripts** — for SSH / tmux workflows where a notebook is awkward. + +Both paths assume an L40S/H100-class GPU with ≥40 GB VRAM and ≥40 GB free disk +for the model cache. + +--- + +## 0. Prerequisites + +- `vLLM >= 0.16.0` — earlier versions do not ship the gpt-oss reasoning parser + or auto tool-choice support. +- `typing_extensions >= 4.12.2` — older versions break vLLM startup with + `cannot import name 'TypeIs' from typing_extensions`. +- `huggingface_hub` with a working transfer backend. If `xet` errors during + download, set `HF_HUB_DISABLE_XET=1`. +- `HF_HOME` pointed at a persistent volume (e.g. `/workspace/hf-cache`) so the + model is not re-downloaded across container restarts. + +Quick install / repair on a fresh container: + +```bash +python -m pip install -U "typing_extensions>=4.12.2" \ + "huggingface_hub[hf_transfer]" hf_xet +``` + +--- + +## 1. Notebook path (RunPod) + +```bash +git clone /workspace/symbolic-library-agent +cd /workspace/symbolic-library-agent +jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root +``` + +Then open `notebooks/run_trove_pbebench.ipynb` and run the cells top-to-bottom: + +1. **Env / cache setup** — sets `HF_HOME=/workspace/hf-cache` and disables xet. +2. **`pip install` cell** — refreshes `typing_extensions` and HF transfer. +3. **Launch vLLM** — backgrounds `scripts/launch_vllm_gpt_oss_20b.sh 8000` and + tails `vllm_logs/`. +4. **Wait for server ready** — polls `/v1/models` until 200 OK. +5. **`tail_vllm_log(60)` helper** — re-runnable cell for spot-checking the + server log at any time. +6. **Run TroVE** — `subprocess.Popen` of `main.py` with the PBEBench-Lite + pilot tasks. Stdout is mirrored to `outputs/trove_pbebench_lite_smoke_.log` + on disk in addition to the cell output, so you can SSH in and `tail -f` the + run from another shell. +7. **Analyze** — calls `scripts/analyze_trove_run.py` on the output JSONL. + +If the notebook cell stops responding, do **not** `pkill -f "main.py"` — +that pattern can match the vLLM process tree on some images. Instead: + +```bash +ps -ef | awk '/python .*main.py/ && /--framework/ && /trove/ {print $2}' \ + | xargs -r kill +``` + +--- + +## 2. Shell-script path + +Two scripts; run them in two terminals (or one tmux session with two panes). + +### 2a. Launch vLLM + +```bash +cd /workspace/symbolic-library-agent +mkdir -p vllm_logs +bash scripts/launch_vllm_gpt_oss_20b.sh 8000 +# logs: vllm_logs/vllm_8000_.log +# pid : vllm_logs/vllm_8000_.pid +``` + +The script forwards three flags that are required for our IMPORT-with-tools +branch to work: + +- `--enable-auto-tool-choice` +- `--tool-call-parser openai` +- `--reasoning-parser openai_gptoss` + +Wait for `Application startup complete` in the log before continuing. + +### 2b. Run TroVE + +```bash +PORT=8000 bash scripts/run_trove_vllm.sh +``` + +Defaults (overridable via env vars or trailing flags): + +| Env var | Default | +| ------------ | ----------------------------------------- | +| `PORT` | `8000` | +| `TASKS_FILE` | `data/pbebench/lite_pilot_tasks.jsonl` | +| `OUT_FILE` | `outputs/trove_pbebench_lite_pilot.jsonl` | + +Pass through any extra `main.py` flag, e.g.: + +```bash +PORT=8000 bash scripts/run_trove_vllm.sh --num-tasks 10 # quick sanity run +``` + +### 2c. Analyze + +```bash +python scripts/analyze_trove_run.py outputs/trove_pbebench_lite_pilot.jsonl +``` + +Reports overall accuracy, final toolbox size, per-mode wins, IMPORT-mode +tool-call success rate, and the top-10 most-called toolbox functions. + +--- + +## 3. Key flags (cheat sheet) + +The TroVE-specific flags on `main.py` matter most: + +| Flag | Default | Purpose | +| --------------------- | ------------ | ------------------------------------------------------- | +| `--framework` | — | Set to `trove` | +| `--trove-task-family` | `default` | Set to `pbebench` to enable PBEBench few-shots & parser | +| `--trove-selection` | `reward` | `reward` (PBEBench) or `consistency` (original TroVE) | +| `--trove-k` | `5` | Candidates per mode (1 disables sampling) | +| `--trove-trim-every` | `100` | Set high (`9999`) for ≤100-task pilots | +| `--default-reward` | — | Set to `pbebench` for the PBEBench verifier | +| `--max-programs` | `5` | PBEBench program-list length cap | + +--- + +## 4. Resuming and cleanup + +- Resume: just re-run the same command. `main.py` checkpoints to the output + JSONL; if both the JSONL and `--debug-dir` are intact it will skip already- + completed task indices. +- Force-restart: delete the output JSONL before running. +- vLLM cleanup: + ```bash + kill "$(cat vllm_logs/vllm_8000_*.pid)" 2>/dev/null || true + pkill -f vllm.entrypoints.openai.api_server # safe — only matches vLLM + ``` diff --git a/symbolic_agent/baselines/trove/prompts.py b/symbolic_agent/baselines/trove/prompts.py index 43922948..60d6770e 100644 --- a/symbolic_agent/baselines/trove/prompts.py +++ b/symbolic_agent/baselines/trove/prompts.py @@ -219,7 +219,12 @@ def build_import_with_tools_prompt(question: str, task_family: str = "default") "In CREATE mode, you must define at least one reusable helper function " "inside a **Tools** code block. The **Solution** block should use or " "accompany that helper as appropriate, but the printed answer must remain " - "a Python list of replace() call strings." + "a Python list of replace() call strings.\n" + "Prefer general helpers that any PBEBench task could reuse (e.g. parsing a " + "replace() call string, applying a candidate program list to inputs, or " + "scoring a program list against input/output pairs). If a helper that " + "already exists in the toolbox would solve this question, reuse it via " + "IMPORT mode instead of defining a near-duplicate here." ) _CREATE_EXAMPLE_DEFAULT = """\ @@ -262,11 +267,26 @@ def apply_substitutions(strings, substitutions): ``` **Tools** ```python -def find_replace_chain(s, pairs): - \"\"\"Apply a chain of (old, new) replacements to a string.\"\"\" - for old, new in pairs: - s = s.replace(old, new) - return s +import ast + +def parse_replace_call(call_str): + \"\"\"Parse a 'replace(old, new)' string into an (old, new) tuple of literals.\"\"\" + expr = ast.parse(call_str.strip(), mode="eval").body + old = ast.literal_eval(expr.args[0]) + new = ast.literal_eval(expr.args[1]) + return old, new + +def score_programs(programs, examples): + \"\"\"Return the fraction of (input, output) examples that `programs` reproduces.\"\"\" + pairs = [parse_replace_call(p) for p in programs] + correct = 0 + for inp, expected in examples: + s = inp + for old, new in pairs: + s = s.replace(old, new) + if s == expected: + correct += 1 + return correct / len(examples) if examples else 0.0 ```""" _CREATE_TASK_TEMPLATE = """\ diff --git a/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py b/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py index 09cc64d3..d4fcc8d3 100644 --- a/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py +++ b/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py @@ -23,6 +23,20 @@ def test_pbebench_create_prompt_models_replace_program_list_stdout(): assert "**Tools**" in prompt +def test_pbebench_create_prompt_uses_pbebench_specific_helpers(): + prompt = build_create_prompt("Task", task_family="pbebench") + + assert "def parse_replace_call" in prompt + assert "def score_programs" in prompt + assert "def find_replace_chain" not in prompt + + +def test_pbebench_create_prompt_warns_against_duplicating_existing_tools(): + prompt = build_create_prompt("Task", task_family="pbebench") + + assert "already exists" in prompt or "duplicate" in prompt.lower() + + def test_pbebench_skip_prompt_models_replace_program_list_stdout(): prompt = build_skip_prompt("Task", task_family="pbebench") From 4352d31519d042ceac9c8324cce0375979ffaf69 Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Wed, 29 Apr 2026 12:22:41 -0400 Subject: [PATCH 23/24] fix(trove): show PBEBench helper signatures in CREATE prompt Made-with: Cursor --- .../baselines/trove/docs/deviations.md | 6 ++- symbolic_agent/baselines/trove/prompts.py | 41 ++++++++----------- .../trove/tests/test_prompts_pbebench.py | 12 ++++-- 3 files changed, 30 insertions(+), 29 deletions(-) diff --git a/symbolic_agent/baselines/trove/docs/deviations.md b/symbolic_agent/baselines/trove/docs/deviations.md index 5ce60482..fda9d359 100644 --- a/symbolic_agent/baselines/trove/docs/deviations.md +++ b/symbolic_agent/baselines/trove/docs/deviations.md @@ -30,8 +30,10 @@ self-consistency selector remains available via `--trove-selection consistency`. ### 1.3 PBEBench-shaped few-shot examples For `task_family="pbebench"` we replace the generic CREATE / SKIP / IMPORT example pairs with PBEBench-shaped pairs that demonstrate `replace()` -chains and a small reusable helper (`find_replace_chain`). The legacy -default examples remain for `task_family="default"`. +chains. CREATE mode also shows signature-only examples of reusable helper +shapes (apply, score, search, prune, debug, end-to-end solve) instead of +full function definitions, to reduce anchoring on a single copied helper. +The legacy default examples remain for `task_family="default"`. ### 1.4 Strict **Solution** parsing for PBEBench The legacy parser falls back to "first ```python``` block anywhere" when diff --git a/symbolic_agent/baselines/trove/prompts.py b/symbolic_agent/baselines/trove/prompts.py index 60d6770e..7058cae2 100644 --- a/symbolic_agent/baselines/trove/prompts.py +++ b/symbolic_agent/baselines/trove/prompts.py @@ -224,7 +224,10 @@ def build_import_with_tools_prompt(question: str, task_family: str = "default") "replace() call string, applying a candidate program list to inputs, or " "scoring a program list against input/output pairs). If a helper that " "already exists in the toolbox would solve this question, reuse it via " - "IMPORT mode instead of defining a near-duplicate here." + "IMPORT mode instead of defining a near-duplicate here.\n" + "The helper signatures below are examples of useful tool shapes, not " + "definitions to copy. If you create a helper, implement the complete " + "function body in **Tools**." ) _CREATE_EXAMPLE_DEFAULT = """\ @@ -255,6 +258,18 @@ def apply_substitutions(strings, substitutions): ```""" _CREATE_EXAMPLE_PBEBENCH = """\ +## Reusable helper signatures +These are example shapes for reusable PBEBench tools. Do not copy `...` stubs +as real tools; implement complete helpers when you decide to create one. +```python +def apply_programs(s, programs): ... +def score_programs(programs, examples): ... +def search_candidate_programs(examples, max_programs=5): ... +def prune_search_state(partial_programs, examples): ... +def debug_program_failure(programs, examples): ... +def solve_examples(examples, max_programs=5): ... +``` + ## Example **Question** Produce a sequence of replace() calls that transforms "hello world" into @@ -265,29 +280,7 @@ def apply_substitutions(strings, substitutions): programs = ["replace(' ', '_')", "replace('h', 'H')", "replace('e', 'E')", "replace('l', 'L')", "replace('o', 'O')", "replace('w', 'W')", "replace('r', 'R')", "replace('d', 'D')"] print(programs) ``` -**Tools** -```python -import ast - -def parse_replace_call(call_str): - \"\"\"Parse a 'replace(old, new)' string into an (old, new) tuple of literals.\"\"\" - expr = ast.parse(call_str.strip(), mode="eval").body - old = ast.literal_eval(expr.args[0]) - new = ast.literal_eval(expr.args[1]) - return old, new - -def score_programs(programs, examples): - \"\"\"Return the fraction of (input, output) examples that `programs` reproduces.\"\"\" - pairs = [parse_replace_call(p) for p in programs] - correct = 0 - for inp, expected in examples: - s = inp - for old, new in pairs: - s = s.replace(old, new) - if s == expected: - correct += 1 - return correct / len(examples) if examples else 0.0 -```""" +""" _CREATE_TASK_TEMPLATE = """\ ## Task diff --git a/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py b/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py index d4fcc8d3..9d0685ad 100644 --- a/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py +++ b/symbolic_agent/baselines/trove/tests/test_prompts_pbebench.py @@ -23,12 +23,18 @@ def test_pbebench_create_prompt_models_replace_program_list_stdout(): assert "**Tools**" in prompt -def test_pbebench_create_prompt_uses_pbebench_specific_helpers(): +def test_pbebench_create_prompt_suggests_pbebench_helper_signatures(): prompt = build_create_prompt("Task", task_family="pbebench") - assert "def parse_replace_call" in prompt - assert "def score_programs" in prompt + assert "Reusable helper signatures" in prompt + assert "def apply_programs(s, programs): ..." in prompt + assert "def score_programs(programs, examples): ..." in prompt + assert "def search_candidate_programs(examples, max_programs=5): ..." in prompt + assert "def debug_program_failure(programs, examples): ..." in prompt assert "def find_replace_chain" not in prompt + assert "import ast" not in prompt + assert "ast.parse" not in prompt + assert "return correct / len(examples)" not in prompt def test_pbebench_create_prompt_warns_against_duplicating_existing_tools(): From 51edc0c6f5375a7a6499d5316abcfe8b0bf37c0e Mon Sep 17 00:00:00 2001 From: mathuryash5 Date: Thu, 30 Apr 2026 13:40:38 -0400 Subject: [PATCH 24/24] chore(trove): remove superpowers planning docs Made-with: Cursor --- .../2026-04-25-trove-native-tool-calling.md | 2274 ----------------- ...-04-25-trove-native-tool-calling-design.md | 374 --- 2 files changed, 2648 deletions(-) delete mode 100644 docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md delete mode 100644 docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md diff --git a/docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md b/docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md deleted file mode 100644 index 76ecb582..00000000 --- a/docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md +++ /dev/null @@ -1,2274 +0,0 @@ -# TroVE Native Tool Calling Implementation Plan - -> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. - -**Goal:** Adapt the existing TroVE port so that the IMPORT mode uses native OpenAI tool calling (vLLM-served `gpt-oss`) while CREATE / SKIP / selection / trimming remain faithful to the paper, then run a 50-task PBEBench smoke and report numbers. - -**Architecture:** Keep `_multi_way_generation` unchanged for CREATE/SKIP. Replace the IMPORT branch (when toolbox non-empty AND backend is OpenAI) with a multi-turn loop that (a) translates top-k toolbox functions into OpenAI tool schemas, (b) lets the model emit `tool_calls` that are executed in a sandboxed subprocess, and (c) returns the final assistant text + recorded tool-call trajectory. Frequency credit comes from unique `tool_call.function.name` entries, not parsed `from toolbox import`. All other invariants (K-sampling, reward-based selection, AST tie-break, `C·log_{20}(n)` trimming) are unchanged. - -**Tech Stack:** Python 3.11, OpenAI Python SDK against a vLLM ≥ v0.16.0 endpoint serving `openai/gpt-oss-20b` (or `120b`), `subprocess`-based executor, `inspect` + `ast` from stdlib. - -**Spec:** [docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md](../specs/2026-04-25-trove-native-tool-calling-design.md) - ---- - -## File Structure - -| File | Status | Purpose | -|---|---|---| -| `symbolic_agent/baselines/trove/toolbox.py` | Modify | Trim coefficient `C=1.0` | -| `symbolic_agent/baselines/trove/executor.py` | Modify | `DEFAULT_TIMEOUT=60` | -| `symbolic_agent/baselines/trove/llm.py` | Modify | `reasoning_content` fallback in `_call_openai`; new `chat_with_tools` method | -| `symbolic_agent/baselines/trove/parse.py` | Modify | `imported_callsites` helper; `task_family` parameter on `parse_response` | -| `symbolic_agent/baselines/trove/prompts.py` | Modify | PBEBench-shaped few-shots; `build_import_with_tools_prompt`; `task_family` dispatch | -| `symbolic_agent/baselines/trove/controller.py` | Modify | IMPORT-with-tools branch; telemetry fields; `task_family` + `selection` params | -| `symbolic_agent/baselines/trove/tools_api.py` | Create | `toolbox_to_openai_tools`; `dispatch_tool_call` | -| `symbolic_agent/baselines/trove/docs/deviations.md` | Create | Algorithmic deviations / faithful elements / infra patches | -| `symbolic_agent/baselines/trove/tests/__init__.py` | Create | Marker file for the new tests package | -| `symbolic_agent/baselines/trove/tests/test_tools_api.py` | Create | Unit tests for schema generation + dispatcher | -| `symbolic_agent/baselines/trove/tests/test_parse_callsites.py` | Create | Unit tests for `imported_callsites` | -| `main.py` | Modify | `--trove-selection` and `--trove-task-family` flags | -| `scripts/launch_vllm_gpt_oss_120b.sh` | Modify | Add three vLLM tool-calling flags | -| `scripts/analyze_trove_run.py` | Create | Post-hoc analysis of TroVE JSONL output | - ---- - -## Task 1: Quick infrastructure patches (trim C, executor timeout, reasoning_content fallback) - -**Files:** -- Modify: `symbolic_agent/baselines/trove/toolbox.py:117` -- Modify: `symbolic_agent/baselines/trove/executor.py:19` -- Modify: `symbolic_agent/baselines/trove/llm.py:192` - -These are three independent one-line changes. Bundling them since each is too small to warrant its own commit and they're all on the "infrastructure" axis. - -- [ ] **Step 1.1: Update trim coefficient default** - -In `symbolic_agent/baselines/trove/toolbox.py`, change the default of `trim`: - -```python -def trim(self, n_processed: int, C: float = 1.0) -> set: - """ - Remove functions whose frequency is below the threshold - C * log_{20}(n_processed) - and return the set of example indices that had used those functions. - - Faithful to trim_library() in run_trove.py: - threshold = math.log(n, 20) # log base 20 - C defaults to 1.0, matching the original implementation (C·log_{20}(n)). - Note: the original uses log base-20 not base-10; we keep base-20. - """ -``` - -- [ ] **Step 1.2: Update executor timeout default** - -In `symbolic_agent/baselines/trove/executor.py`, change the constant: - -```python -DEFAULT_TIMEOUT = 60 # seconds — generous for PBEBench replace() chains and multi-turn dispatch -``` - -- [ ] **Step 1.3: Add reasoning_content fallback in `_call_openai`** - -In `symbolic_agent/baselines/trove/llm.py`, replace the line that reads `raw = response.choices[0].message.content or ""` with: - -```python - msg = response.choices[0].message - raw = msg.content or getattr(msg, "reasoning_content", "") or "" -``` - -Context (the surrounding `try` block stays unchanged): - -```python - response = self._client.chat.completions.create( - model=model, - max_tokens=max_tokens, - messages=messages, - ) - msg = response.choices[0].message - raw = msg.content or getattr(msg, "reasoning_content", "") or "" - u = getattr(response, "usage", None) -``` - -- [ ] **Step 1.4: Sanity-check the changes** - -Run: `python -c "from symbolic_agent.baselines.trove.toolbox import TroVEToolbox; from symbolic_agent.baselines.trove.executor import DEFAULT_TIMEOUT; import inspect; print(inspect.signature(TroVEToolbox.trim).parameters['C'].default, DEFAULT_TIMEOUT)"` - -Expected: `1.0 60` - -- [ ] **Step 1.5: Commit** - -```bash -git add symbolic_agent/baselines/trove/toolbox.py symbolic_agent/baselines/trove/executor.py symbolic_agent/baselines/trove/llm.py -git commit -m "$(cat <<'EOF' -fix(trove): infra patches for native tool calling - -- toolbox.trim default C=1.0 (matches original TroVE) -- executor DEFAULT_TIMEOUT=60s (PBEBench + multi-turn headroom) -- llm._call_openai falls back to message.reasoning_content when - message.content is empty (gpt-oss Harmony channel split) -EOF -)" -``` - ---- - -## Task 2: `parse.imported_callsites` helper + `task_family` parameter - -**Files:** -- Modify: `symbolic_agent/baselines/trove/parse.py:86,106-114` -- Create: `symbolic_agent/baselines/trove/tests/__init__.py` -- Create: `symbolic_agent/baselines/trove/tests/test_parse_callsites.py` - -- [ ] **Step 2.1: Create the tests package marker** - -Create `symbolic_agent/baselines/trove/tests/__init__.py` as an empty file. - -- [ ] **Step 2.2: Write the failing test for `imported_callsites`** - -Create `symbolic_agent/baselines/trove/tests/test_parse_callsites.py`: - -```python -"""Unit tests for parse.imported_callsites and parse_response(task_family=).""" - -from symbolic_agent.baselines.trove.parse import imported_callsites, parse_response - - -# --------------------------------------------------------------------------- -# imported_callsites -# --------------------------------------------------------------------------- - -def test_callsites_bare_name(): - code = "result = find_replace_chain(s, [('a', 'b')])\nprint(result)" - assert imported_callsites(code, tools_code="", candidate_names={"find_replace_chain", "other"}) == {"find_replace_chain"} - - -def test_callsites_attribute_access(): - code = "result = toolbox.find_replace_chain(s, pairs)\nprint(result)" - assert imported_callsites(code, tools_code="", candidate_names={"find_replace_chain"}) == {"find_replace_chain"} - - -def test_callsites_no_match(): - code = "print(s.replace('a', 'b'))" - assert imported_callsites(code, tools_code="", candidate_names={"find_replace_chain"}) == set() - - -def test_callsites_multiple_calls_same_name_dedup(): - code = "x = f(1)\ny = f(2)\nprint(x, y)" - assert imported_callsites(code, tools_code="", candidate_names={"f", "g"}) == {"f"} - - -def test_callsites_syntax_error_returns_empty(): - code = "this is not valid python ::" - assert imported_callsites(code, tools_code="", candidate_names={"f"}) == set() - - -def test_callsites_empty_inputs(): - assert imported_callsites("", "", set()) == set() - assert imported_callsites("print(1)", "", set()) == set() - - -# --------------------------------------------------------------------------- -# parse_response(task_family=) -# --------------------------------------------------------------------------- - -def test_parse_response_pbebench_strict_no_solution_block(): - text = "Here is some reasoning.\n```python\nprint('answer')\n```\n" - out = parse_response(text, task_family="pbebench") - assert out["solution_code"] == "" - - -def test_parse_response_pbebench_with_solution_block(): - text = "**Solution**\n```python\nprint('answer')\n```\n" - out = parse_response(text, task_family="pbebench") - assert out["solution_code"] == "print('answer')" - - -def test_parse_response_default_falls_back_to_any_python_block(): - text = "Here is some reasoning.\n```python\nprint('answer')\n```\n" - out = parse_response(text, task_family="default") - assert "print('answer')" in out["solution_code"] - - -def test_parse_response_default_call_signature_unchanged(): - text = "**Solution**\n```python\nprint('answer')\n```\n" - out = parse_response(text) - assert out["solution_code"] == "print('answer')" -``` - -- [ ] **Step 2.3: Run the tests to confirm they fail** - -Run: `python -m pytest symbolic_agent/baselines/trove/tests/test_parse_callsites.py -v` - -Expected: ImportError on `imported_callsites` (function does not exist) and one or more failures on `parse_response(text, task_family=...)` (unknown kwarg). - -- [ ] **Step 2.4: Implement `imported_callsites` and add `task_family` to `parse_response`** - -In `symbolic_agent/baselines/trove/parse.py`, add the helper at the end of the AST section (after `count_ast_nodes`): - -```python -def imported_callsites( - solution_code: str, - tools_code: str, - candidate_names: set, -) -> set: - """ - Return the subset of `candidate_names` that appear as call-sites in - `solution_code`. Used for the `actually_called` telemetry field. - - Detects two callee shapes: - - bare Name: find_replace_chain(...) - - Attribute(name): toolbox.find_replace_chain(...) - - `tools_code` is currently unused (kept in the signature so callers can - pass through the **Tools** block context if we later want to filter by - what was actually imported). - - Returns an empty set on empty input or SyntaxError. - """ - if not solution_code or not candidate_names: - return set() - try: - tree = ast.parse(solution_code) - except SyntaxError: - return set() - found: set = set() - for node in ast.walk(tree): - if not isinstance(node, ast.Call): - continue - func = node.func - if isinstance(func, ast.Name) and func.id in candidate_names: - found.add(func.id) - elif isinstance(func, ast.Attribute) and func.attr in candidate_names: - found.add(func.attr) - return found -``` - -Then modify `parse_response` (around line 86) to accept `task_family`: - -```python -def parse_response(text: str, task_family: str = "default") -> dict: - """ - Parse a TroVE-format LLM response. - - Returns - ------- - { - "solution_code": str, # code inside **Solution** block - "tools_code": str, # code inside **Tools** block - "functions": list[dict], # parsed tool dicts from the Tools block - } - - task_family - ----------- - "default": if no **Solution** block is found, falls back to the first - ```python``` block anywhere (legacy behaviour). - "pbebench": no fallback. Strict **Solution**-block-only parsing avoids - accidentally promoting CoT scratchpad to the answer. - """ - solution_code = _extract_code_block(text, "Solution") or "" - tools_code = _extract_code_block(text, "Tools") or "" - - if not solution_code and task_family != "pbebench": - raw = _extract_any_python_block(text) - if raw: - solution_code = _make_executable(raw) - - functions = parse_tools_in_chunk(tools_code) if tools_code else [] - return { - "solution_code": solution_code, - "tools_code": tools_code, - "functions": functions, - } -``` - -- [ ] **Step 2.5: Run the tests to confirm they pass** - -Run: `python -m pytest symbolic_agent/baselines/trove/tests/test_parse_callsites.py -v` - -Expected: 10 passed. - -- [ ] **Step 2.6: Commit** - -```bash -git add symbolic_agent/baselines/trove/parse.py symbolic_agent/baselines/trove/tests/__init__.py symbolic_agent/baselines/trove/tests/test_parse_callsites.py -git commit -m "$(cat <<'EOF' -feat(trove): add imported_callsites helper and task_family to parse_response - -- imported_callsites(solution, tools, names) -> set: AST-walks Solution - code and returns names from the candidate set that are actually called. - Handles bare Name and Attribute (toolbox.foo) callees. -- parse_response(text, task_family="default"): when task_family="pbebench" - the parser does not fall back to the first python block when **Solution** - is missing. Prevents CoT scratchpad from being promoted to the answer. -EOF -)" -``` - ---- - -## Task 3: PBEBench-shaped few-shots + IMPORT-with-tools prompt - -**Files:** -- Modify: `symbolic_agent/baselines/trove/prompts.py` (full rewrite of constants and `build_*` functions) - -This task has no automated test — prompts are validated by inspection and by the smoke run. - -- [ ] **Step 3.1: Replace the prompts module with task-family-aware variants** - -Open `symbolic_agent/baselines/trove/prompts.py` and replace the entire body below the module docstring with the following. Keep the docstring at the top of the file. - -```python -# --------------------------------------------------------------------------- -# Format override (default-family only) -# --------------------------------------------------------------------------- - -_FORMAT_OVERRIDE_DEFAULT = ( - "\nIMPORTANT: Regardless of any formatting instructions inside the question, " - "always produce your answer as executable Python in the **Solution** block " - "and end it with print(answer). " - "Your answer is whatever gets printed to stdout when the Solution code runs." -) - -# PBEBench prompts model the desired format directly via the few-shot example, -# so no override string is needed. -_FORMAT_OVERRIDE_PBEBENCH = "" - - -def _format_override(task_family: str) -> str: - return _FORMAT_OVERRIDE_PBEBENCH if task_family == "pbebench" else _FORMAT_OVERRIDE_DEFAULT - - -# --------------------------------------------------------------------------- -# IMPORT mode (text-based, default and Anthropic fallback) -# --------------------------------------------------------------------------- - -_IMPORT_INSTRUCTION_DEFAULT = ( - "You task is to write Python program solutions to the given questions.\n" - "The toolbox section lists all the available functions that can be used in your solution." -) - -_IMPORT_EXAMPLE_DEFAULT = """\ -## Example -**Question** -Given a list of strings and a list of (old, new) substitution pairs, apply all -substitutions in order to each string and return the transformed list. -Strings: ["cat", "bat"] -Substitutions: [("a", "o"), ("t", "p")] - -**Toolbox** -```python -# Apply an ordered list of (old, new) substitutions to each string in a list. -apply_substitutions(strings: list, substitutions: list) -> list -``` - -**Solution** -```python -strings = ["cat", "bat"] -subs = [("a", "o"), ("t", "p")] -result = apply_substitutions(strings, subs) -print(result) -``` -**Tools** -```python -from toolbox import apply_substitutions -```""" - -_IMPORT_EXAMPLE_PBEBENCH = """\ -## Example -**Question** -You are given example input/output pairs. Produce a list of replace() calls -that transforms each input into its expected output. - -Input: "hello world" -Output: "HELLO_WORLD" - -**Toolbox** -```python -# Apply a chain of (old, new) replacements to a string. -find_replace_chain(s: str, pairs: list) -> str -``` - -**Solution** -```python -result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")]) -print(result) -``` -**Tools** -```python -from toolbox import find_replace_chain -```""" - -_IMPORT_TASK_TEMPLATE = """\ -## Task -**Question** -{question} - -**Toolbox** -{toolbox} - -**Solution** -""" - - -def build_import_prompt(question: str, toolbox_str: str, task_family: str = "default") -> str: - """Build the text-based IMPORT-mode prompt (used for Anthropic and as fallback).""" - instruction = _IMPORT_INSTRUCTION_DEFAULT + _format_override(task_family) - example = _IMPORT_EXAMPLE_PBEBENCH if task_family == "pbebench" else _IMPORT_EXAMPLE_DEFAULT - return ( - instruction - + "\n\n\n" - + example - + "\n\n\n" - + _IMPORT_TASK_TEMPLATE.format(question=question, toolbox=toolbox_str) - ) - - -# --------------------------------------------------------------------------- -# IMPORT-with-tools mode (native OpenAI tool calling; no **Toolbox** block) -# --------------------------------------------------------------------------- - -_IMPORT_WITH_TOOLS_INSTRUCTION_DEFAULT = ( - "You task is to write Python program solutions to the given questions.\n" - "You have a set of helper functions available as tools. Call any of them " - "when they help you solve the question; otherwise solve directly. After " - "you have computed the answer, output it as executable Python in a " - "**Solution** block and end with print(answer)." -) - -_IMPORT_WITH_TOOLS_INSTRUCTION_PBEBENCH = ( - "You task is to produce a list of replace() calls that transforms each " - "input into its expected output for a Programming-by-Example task.\n" - "You have a set of helper functions available as tools. Call any of them " - "to test ideas or compute intermediate results; the final answer must be " - "produced as a Python program in the **Solution** block." -) - -_IMPORT_WITH_TOOLS_EXAMPLE_DEFAULT = """\ -## Example -**Question** -Apply substitutions [("a","o"),("t","p")] to ["cat","bat"] and return the list. - -(After optionally calling `apply_substitutions` as a tool to confirm, -the assistant produces:) - -**Solution** -```python -strings = ["cat", "bat"] -subs = [("a", "o"), ("t", "p")] -result = apply_substitutions(strings, subs) -print(result) -```""" - -_IMPORT_WITH_TOOLS_EXAMPLE_PBEBENCH = """\ -## Example -**Question** -Produce a sequence of replace() calls that transforms "hello world" into -"HELLO_WORLD". - -(After optionally calling `find_replace_chain` as a tool to verify a -candidate sequence, the assistant produces:) - -**Solution** -```python -result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")]) -print(result) -```""" - -_IMPORT_WITH_TOOLS_TASK_TEMPLATE = """\ -## Task -**Question** -{question} - -**Solution** -""" - - -def build_import_with_tools_prompt(question: str, task_family: str = "default") -> str: - """ - Build the IMPORT-with-tools prompt. The toolbox is NOT shown as text — it - is conveyed via the OpenAI tools=[...] parameter on the chat completion call. - """ - if task_family == "pbebench": - instruction = _IMPORT_WITH_TOOLS_INSTRUCTION_PBEBENCH - example = _IMPORT_WITH_TOOLS_EXAMPLE_PBEBENCH - else: - instruction = _IMPORT_WITH_TOOLS_INSTRUCTION_DEFAULT - example = _IMPORT_WITH_TOOLS_EXAMPLE_DEFAULT - return ( - instruction - + "\n\n\n" - + example - + "\n\n\n" - + _IMPORT_WITH_TOOLS_TASK_TEMPLATE.format(question=question) - ) - - -# --------------------------------------------------------------------------- -# CREATE mode -# --------------------------------------------------------------------------- - -_CREATE_INSTRUCTION_DEFAULT = ( - "You task is to write Python program solutions to the given questions.\n" - "You should also create Python functions that can be used by your solution, " - "if you believe the function can be reused to solve other questions." -) - -_CREATE_EXAMPLE_DEFAULT = """\ -## Example -**Question** -Given a list of strings and a list of (old, new) substitution pairs, apply all -substitutions in order to each string and return the transformed list. -Strings: ["hello", "world"] -Substitutions: [("l", "r"), ("o", "0")] - -**Solution** -```python -strings = ["hello", "world"] -subs = [("l", "r"), ("o", "0")] -result = apply_substitutions(strings, subs) -print(result) -``` -**Tools** -```python -def apply_substitutions(strings, substitutions): - \"\"\"Apply an ordered list of (old, new) substitutions to each string in a list.\"\"\" - out = [] - for s in strings: - for old, new in substitutions: - s = s.replace(old, new) - out.append(s) - return out -```""" - -_CREATE_EXAMPLE_PBEBENCH = """\ -## Example -**Question** -Produce a sequence of replace() calls that transforms "hello world" into -"HELLO_WORLD". - -**Solution** -```python -result = find_replace_chain("hello world", [(" ", "_"), ("h", "H"), ("e", "E"), ("l", "L"), ("o", "O"), ("w", "W"), ("r", "R"), ("d", "D")]) -print(result) -``` -**Tools** -```python -def find_replace_chain(s, pairs): - \"\"\"Apply a chain of (old, new) replacements to a string.\"\"\" - for old, new in pairs: - s = s.replace(old, new) - return s -```""" - -_CREATE_TASK_TEMPLATE = """\ -## Task -**Question** -{question} - -**Solution** -""" - - -def build_create_prompt(question: str, task_family: str = "default") -> str: - """Build the CREATE-mode prompt for a single task.""" - instruction = _CREATE_INSTRUCTION_DEFAULT + _format_override(task_family) - example = _CREATE_EXAMPLE_PBEBENCH if task_family == "pbebench" else _CREATE_EXAMPLE_DEFAULT - return ( - instruction - + "\n\n\n" - + example - + "\n\n\n" - + _CREATE_TASK_TEMPLATE.format(question=question) - ) - - -# --------------------------------------------------------------------------- -# SKIP mode -# --------------------------------------------------------------------------- - -_SKIP_INSTRUCTION_DEFAULT = ( - "You task is to write Python program solutions to the given questions." -) - -_SKIP_EXAMPLE_DEFAULT = """\ -## Example -**Question** -Given the list of strings ["Hello", "World"], convert each to lowercase and -return the resulting list. - -**Solution** -```python -strings = ["Hello", "World"] -result = [s.lower() for s in strings] -print(result) -``` -**Tools** -```python -```""" - -_SKIP_EXAMPLE_PBEBENCH = """\ -## Example -**Question** -Produce a sequence of replace() calls that transforms "hello world" into -"HELLO_WORLD". - -**Solution** -```python -s = "hello world" -s = s.replace(" ", "_") -s = s.replace("h", "H") -s = s.replace("e", "E") -s = s.replace("l", "L") -s = s.replace("o", "O") -s = s.replace("w", "W") -s = s.replace("r", "R") -s = s.replace("d", "D") -print(s) -``` -**Tools** -```python -```""" - -_SKIP_TASK_TEMPLATE = """\ -## Task -**Question** -{question} - -**Solution** -""" - - -def build_skip_prompt(question: str, task_family: str = "default") -> str: - """Build the SKIP-mode prompt for a single task.""" - instruction = _SKIP_INSTRUCTION_DEFAULT + _format_override(task_family) - example = _SKIP_EXAMPLE_PBEBENCH if task_family == "pbebench" else _SKIP_EXAMPLE_DEFAULT - return ( - instruction - + "\n\n\n" - + example - + "\n\n\n" - + _SKIP_TASK_TEMPLATE.format(question=question) - ) - - -def get_question(task_input: dict) -> str: - """ - Extract the question/prompt string from a task_input dict. - - Priority: question > prompt > task > str(task_input). - """ - for key in ("question", "prompt", "task"): - val = task_input.get(key) - if val and isinstance(val, str) and val.strip(): - return val.strip() - return str(task_input) -``` - -- [ ] **Step 3.2: Smoke-test the new prompts compile and dispatch correctly** - -Run: `python -c "from symbolic_agent.baselines.trove.prompts import build_import_prompt, build_create_prompt, build_skip_prompt, build_import_with_tools_prompt; print('--IMPORT default--'); print(build_import_prompt('Q?', 'TB')[:200]); print('--IMPORT pbebench--'); print(build_import_prompt('Q?', 'TB', task_family='pbebench')[:200]); print('--IMPORT_WITH_TOOLS pbebench--'); print(build_import_with_tools_prompt('Q?', task_family='pbebench')[:200])"` - -Expected: three short prompt previews, no exceptions, no `IMPORTANT:` line in the pbebench variant. - -- [ ] **Step 3.3: Commit** - -```bash -git add symbolic_agent/baselines/trove/prompts.py -git commit -m "$(cat <<'EOF' -feat(trove): PBEBench-shaped few-shots and IMPORT-with-tools prompt - -- Add task_family parameter to all build_* prompt builders. -- Add _CREATE_EXAMPLE_PBEBENCH and _SKIP_EXAMPLE_PBEBENCH demonstrating - replace()-chain solutions and a find_replace_chain helper. -- Add build_import_with_tools_prompt for native tool calling: no - **Toolbox** markdown block (toolbox is conveyed via tools=[...]). -- _FORMAT_OVERRIDE is empty for task_family="pbebench" (the example - models the desired format directly). -EOF -)" -``` - ---- - -## Task 4: New `tools_api.py` (toolbox -> OpenAI schemas, dispatcher) - -**Files:** -- Create: `symbolic_agent/baselines/trove/tools_api.py` -- Create: `symbolic_agent/baselines/trove/tests/test_tools_api.py` - -- [ ] **Step 4.1: Write the failing tests** - -Create `symbolic_agent/baselines/trove/tests/test_tools_api.py`: - -```python -"""Unit tests for tools_api.toolbox_to_openai_tools and dispatch_tool_call.""" - -import json -from types import SimpleNamespace - -from symbolic_agent.baselines.trove.toolbox import TroVEToolbox -from symbolic_agent.baselines.trove.tools_api import ( - dispatch_tool_call, - toolbox_to_openai_tools, -) - - -def _make_toolbox_with(func_src: str, name: str, docstr: str = "") -> TroVEToolbox: - tb = TroVEToolbox() - tb.add( - { - "name": name, - "docstr": docstr, - "signature": f"def {name}(...)", - "function": func_src, - "type": "function", - }, - example_idx=0, - ) - return tb - - -def _tool_call(name: str, args: dict, call_id: str = "call_1"): - return SimpleNamespace( - id=call_id, - function=SimpleNamespace(name=name, arguments=json.dumps(args)), - ) - - -# --------------------------------------------------------------------------- -# toolbox_to_openai_tools -# --------------------------------------------------------------------------- - -def test_schema_basic_function(): - src = ( - "def find_replace_chain(s: str, pairs: list) -> str:\n" - " \"\"\"Apply a chain of (old, new) replacements to a string.\"\"\"\n" - " for old, new in pairs:\n" - " s = s.replace(old, new)\n" - " return s\n" - ) - tb = _make_toolbox_with(src, "find_replace_chain", docstr="Apply a chain of (old, new) replacements to a string.") - tools = toolbox_to_openai_tools(tb, topk=10) - assert len(tools) == 1 - fn = tools[0] - assert fn["type"] == "function" - assert fn["function"]["name"] == "find_replace_chain" - assert fn["function"]["description"] == "Apply a chain of (old, new) replacements to a string." - params = fn["function"]["parameters"] - assert params["type"] == "object" - assert set(params["properties"].keys()) == {"s", "pairs"} - assert params["properties"]["s"]["type"] == "string" - assert params["properties"]["pairs"]["type"] == "array" - assert set(params["required"]) == {"s", "pairs"} - - -def test_schema_unannotated_falls_back_to_string(): - src = ( - "def f(x):\n" - " return x\n" - ) - tb = _make_toolbox_with(src, "f") - tools = toolbox_to_openai_tools(tb, topk=10) - assert tools[0]["function"]["parameters"]["properties"]["x"]["type"] == "string" - - -def test_schema_skips_varargs_kwargs(): - src = ( - "def f(*args, **kwargs):\n" - " return args\n" - ) - tb = _make_toolbox_with(src, "f") - tools = toolbox_to_openai_tools(tb, topk=10) - assert tools == [] - - -def test_schema_required_excludes_defaults(): - src = ( - "def f(x: int, y: int = 5):\n" - " return x + y\n" - ) - tb = _make_toolbox_with(src, "f") - tools = toolbox_to_openai_tools(tb, topk=10) - params = tools[0]["function"]["parameters"] - assert params["required"] == ["x"] - assert params["properties"]["y"]["type"] == "integer" - - -def test_schema_topk_respects_frequency(): - tb = TroVEToolbox() - for n, freq in [("a", 3), ("b", 2), ("c", 1)]: - tb.add( - { - "name": n, - "docstr": "", - "signature": f"def {n}()", - "function": f"def {n}():\n return 0\n", - "type": "function", - }, - example_idx=0, - ) - for _ in range(freq - 1): - tb.update_frequency(n, example_idx=0) - tools = toolbox_to_openai_tools(tb, topk=2) - assert [t["function"]["name"] for t in tools] == ["a", "b"] - - -def test_schema_empty_toolbox(): - assert toolbox_to_openai_tools(TroVEToolbox(), topk=10) == [] - - -# --------------------------------------------------------------------------- -# dispatch_tool_call -# --------------------------------------------------------------------------- - -def test_dispatch_runs_function_and_returns_stdout(): - src = ( - "def reverse_str(s):\n" - " return s[::-1]\n" - ) - tb = _make_toolbox_with(src, "reverse_str") - result = dispatch_tool_call(tb, _tool_call("reverse_str", {"s": "hello"})) - assert "olleh" in result - - -def test_dispatch_unknown_tool_returns_error(): - tb = TroVEToolbox() - result = dispatch_tool_call(tb, _tool_call("nonexistent", {})) - assert "not in toolbox" in result - - -def test_dispatch_bad_json_returns_error(): - src = "def f(x):\n return x\n" - tb = _make_toolbox_with(src, "f") - bad = SimpleNamespace( - id="x", - function=SimpleNamespace(name="f", arguments="{not json"), - ) - result = dispatch_tool_call(tb, bad) - assert "argument JSON parse failed" in result - - -def test_dispatch_sanitizes_harmony_contamination(): - src = "def reverse_str(s):\n return s[::-1]\n" - tb = _make_toolbox_with(src, "reverse_str") - tc = _tool_call("reverse_str<|channel|>commentary", {"s": "abc"}) - result = dispatch_tool_call(tb, tc) - assert "cba" in result - - -def test_dispatch_truncates_long_output(): - src = ( - "def long_output(n):\n" - " return 'x' * n\n" - ) - tb = _make_toolbox_with(src, "long_output") - result = dispatch_tool_call(tb, _tool_call("long_output", {"n": 10000})) - assert len(result) <= 4096 + 100 # +slack for repr quotes and truncation marker -``` - -- [ ] **Step 4.2: Run the tests to confirm they fail** - -Run: `python -m pytest symbolic_agent/baselines/trove/tests/test_tools_api.py -v` - -Expected: ImportError on `tools_api` module. - -- [ ] **Step 4.3: Create the `tools_api.py` module** - -Create `symbolic_agent/baselines/trove/tools_api.py`: - -```python -"""Translate the TroVE toolbox into OpenAI Chat Completions tool schemas -and dispatch tool calls back through the executor. - -This module is the bridge between TroVE's in-memory toolbox and vLLM's -native tool-calling protocol. It is invoked only from the IMPORT-with-tools -controller branch. -""" - -from __future__ import annotations - -import inspect -import json -import logging -from typing import Any - -from .executor import run_solution -from .toolbox import TroVEToolbox - -logger = logging.getLogger(__name__) - -_MAX_RESULT_CHARS = 4096 - -# Type inference: Python annotation -> JSON Schema type. -_TYPE_MAP = { - int: "integer", - float: "number", - bool: "boolean", - str: "string", - list: "array", - tuple: "array", - dict: "object", -} - - -def _infer_type(annotation: Any) -> str: - if annotation is inspect.Parameter.empty: - return "string" - # Plain types (int, str, etc.) - if annotation in _TYPE_MAP: - return _TYPE_MAP[annotation] - # typing.List, typing.Dict, etc. — fall through to string if unrecognised. - origin = getattr(annotation, "__origin__", None) - if origin in _TYPE_MAP: - return _TYPE_MAP[origin] - return "string" - - -def _function_to_schema(name: str, fn: Any, docstr: str) -> dict | None: - """ - Build one OpenAI tool dict from a callable. Returns None if the function - has *args or **kwargs (we cannot generate a meaningful schema). - """ - try: - sig = inspect.signature(fn) - except (TypeError, ValueError) as exc: - logger.debug("Could not introspect %s: %s", name, exc) - return None - - properties: dict = {} - required: list = [] - - for pname, param in sig.parameters.items(): - if param.kind in ( - inspect.Parameter.VAR_POSITIONAL, - inspect.Parameter.VAR_KEYWORD, - ): - logger.debug("Skipping %s — has *args/**kwargs", name) - return None - prop: dict = {"type": _infer_type(param.annotation)} - if param.default is not inspect.Parameter.empty: - if isinstance(param.default, (int, float, bool, str)): - prop["default"] = param.default - else: - required.append(pname) - properties[pname] = prop - - return { - "type": "function", - "function": { - "name": name, - "description": docstr or "", - "parameters": { - "type": "object", - "properties": properties, - "required": required, - }, - }, - } - - -def toolbox_to_openai_tools(toolbox: TroVEToolbox, topk: int = 10) -> list: - """ - Convert the top-k toolbox functions (by frequency) into OpenAI Chat - Completions tool dicts. - - Functions with *args / **kwargs are silently excluded. - Returns [] when the toolbox is empty. - """ - entries = toolbox.snapshot() - if not entries: - return [] - entries.sort(key=lambda e: -int(e.get("frequency", 0))) - selected = entries[:topk] - - namespace: dict = {} - try: - exec(toolbox.get_full_code(), namespace) - except Exception as exc: - logger.warning("Could not exec toolbox source for schema generation: %s", exc) - return [] - - tools: list = [] - for entry in selected: - name = entry.get("name", "") - if not name or name not in namespace: - continue - fn = namespace[name] - schema = _function_to_schema(name, fn, entry.get("docstr", "")) - if schema is not None: - tools.append(schema) - return tools - - -def _sanitize_name(name: str) -> str: - """Defensive workaround for vLLM PR #35906 (Harmony control tokens - leaking into tool names like `reverse_str<|channel|>commentary`).""" - return name.split("<|", 1)[0].strip() - - -def _truncate(s: str, limit: int = _MAX_RESULT_CHARS) -> str: - if len(s) <= limit: - return s - return s[:limit] + f"\n... [truncated {len(s) - limit} chars]" - - -def dispatch_tool_call(toolbox: TroVEToolbox, tool_call) -> str: - """ - Resolve `tool_call` against the toolbox, run it via the sandbox executor, - and return the captured stdout (truncated to 4096 chars) or an error - message string. Always returns a string — never raises. - """ - name = _sanitize_name(getattr(tool_call.function, "name", "") or "") - if not name: - return json.dumps({"error": "tool_call has no function name"}) - if name not in {e["name"] for e in toolbox.snapshot()}: - return json.dumps({"error": f"tool '{name}' not in toolbox"}) - - raw_args = getattr(tool_call.function, "arguments", "") or "{}" - try: - args = json.loads(raw_args) - if not isinstance(args, dict): - return json.dumps({"error": f"argument JSON parse failed: expected object, got {type(args).__name__}"}) - except json.JSONDecodeError as exc: - return json.dumps({"error": f"argument JSON parse failed: {exc}"}) - - call_expr = f"print(repr({name}(**{args!r})))" - is_ok, output = run_solution( - solution_code=call_expr, - tools_code="", - toolbox_code=toolbox.get_full_code(), - ) - if not is_ok: - return json.dumps({"error": "execution failed", "stderr": _truncate(output)}) - return _truncate(output) -``` - -- [ ] **Step 4.4: Run the tests to confirm they pass** - -Run: `python -m pytest symbolic_agent/baselines/trove/tests/test_tools_api.py -v` - -Expected: 10 passed. - -- [ ] **Step 4.5: Commit** - -```bash -git add symbolic_agent/baselines/trove/tools_api.py symbolic_agent/baselines/trove/tests/test_tools_api.py -git commit -m "$(cat <<'EOF' -feat(trove): add tools_api for native OpenAI tool calling - -- toolbox_to_openai_tools(toolbox, topk=10): converts top-k toolbox - functions into OpenAI Chat Completions tool schemas. Infers parameter - types from inspect.signature; functions with *args/**kwargs are - silently excluded. -- dispatch_tool_call(toolbox, tool_call): runs the requested function - in the sandbox executor, returns stdout truncated to 4096 chars or - a JSON error string. Sanitizes Harmony control-token contamination - in tool names (defensive vs. open vLLM PR #35906). -EOF -)" -``` - ---- - -## Task 5: `chat_with_tools` method on `TroVELLMClient` - -**Files:** -- Modify: `symbolic_agent/baselines/trove/llm.py` (add new method, no signature changes to existing methods) - -This task has no automated test — the multi-turn loop is validated by the controller-level integration plus the smoke run. - -- [ ] **Step 5.1: Add `chat_with_tools` to `TroVELLMClient`** - -In `symbolic_agent/baselines/trove/llm.py`, add the following imports near the top (`Callable` may already be implicit via `typing`): - -```python -from typing import Any, Callable, Dict, List, Optional -``` - -Then add the new method to the `TroVELLMClient` class (insert after `_call_openai`, before `_record`): - -```python - # ------------------------------------------------------------------ - # Native tool calling (OpenAI/vLLM only) - # ------------------------------------------------------------------ - - def chat_with_tools( - self, - messages: List[Dict[str, Any]], - tools: List[Dict[str, Any]], - model: str, - max_tokens: int = DEFAULT_MAX_TOKENS, - max_tool_iters: int = 8, - on_tool_call: Optional[Callable[[Any], str]] = None, - tag: str = "", - ) -> Dict[str, Any]: - """ - Multi-turn chat completion that supports native OpenAI tool calls. - - Returns - ------- - { - "final_text": str, # message.content (or reasoning_content fallback) - "tool_calls": list[dict], # ordered, each {name, args_preview, result_preview, ok} - "iterations": int, # number of round-trips actually used - "stopped_reason": str, # "no_tool_calls" | "max_iters" | "error" - } - - The caller is responsible for providing `on_tool_call(tc) -> str`, - which is invoked for every tool_call returned by the model. The - return value (already a string) is sent back as the tool message. - - Anthropic backend is not supported — this method exists for the - OpenAI/vLLM tool-calling flow only. It raises NotImplementedError - on Anthropic as a defensive guard; controllers must check - `self.backend == "openai"` before calling. - """ - if self.backend != "openai": - raise NotImplementedError("chat_with_tools requires the openai backend") - - if on_tool_call is None: - raise ValueError("chat_with_tools requires an on_tool_call callback") - - recorded_calls: List[Dict[str, Any]] = [] - convo: List[Dict[str, Any]] = list(messages) - iterations = 0 - final_text = "" - stopped_reason = "no_tool_calls" - - for it in range(max_tool_iters + 1): - iterations = it + 1 - iter_tag = f"{tag}_iter{it}" if tag else f"iter{it}" - response = None - last_exc = None - - for attempt in range(3): - try: - response = self._client.chat.completions.create( - model=model, - max_tokens=max_tokens, - messages=convo, - tools=tools, - tool_choice="auto", - ) - break - except Exception as exc: - last_exc = exc - if getattr(exc, "status_code", None) == 400: - logger.warning( - "OpenAI chat_with_tools 400 (tag=%s): %s", iter_tag, exc - ) - self._record(iter_tag, model, json.dumps(convo)[:2000], "", max_tokens, {}) - return { - "final_text": "", - "tool_calls": recorded_calls, - "iterations": iterations, - "stopped_reason": "error", - } - if attempt < 2: - wait = 5 * (2 ** attempt) - logger.warning( - "chat_with_tools failed (attempt %d/3, tag=%s): %s. Retrying in %ds.", - attempt + 1, iter_tag, exc, wait, - ) - time.sleep(wait) - - if response is None: - logger.warning("All chat_with_tools retries exhausted (tag=%s): %s", iter_tag, last_exc) - stopped_reason = "error" - break - - msg = response.choices[0].message - content = msg.content or getattr(msg, "reasoning_content", "") or "" - tool_calls = getattr(msg, "tool_calls", None) or [] - - u = getattr(response, "usage", None) - details = getattr(u, "completion_tokens_details", None) - usage = { - "input_tokens": getattr(u, "prompt_tokens", 0) or 0, - "output_tokens": getattr(u, "completion_tokens", 0) or 0, - "reasoning_tokens": getattr(details, "reasoning_tokens", 0) or 0 if details else 0, - } - self._record( - iter_tag, - model, - json.dumps(convo)[:2000], - json.dumps({"content": content, "tool_calls_count": len(tool_calls)}), - max_tokens, - usage, - ) - - if not tool_calls: - final_text = content - stopped_reason = "no_tool_calls" - break - - assistant_msg: Dict[str, Any] = { - "role": "assistant", - "content": content, - "tool_calls": [ - { - "id": tc.id, - "type": "function", - "function": { - "name": tc.function.name, - "arguments": tc.function.arguments, - }, - } - for tc in tool_calls - ], - } - convo.append(assistant_msg) - - for tc in tool_calls: - try: - result = on_tool_call(tc) - ok = True - except Exception as exc: - result = json.dumps({"error": f"on_tool_call raised: {exc}"}) - ok = False - args_preview = (tc.function.arguments or "")[:200] - result_preview = (result or "")[:200] - recorded_calls.append( - { - "name": tc.function.name, - "args_preview": args_preview, - "result_preview": result_preview, - "ok": ok, - } - ) - convo.append( - { - "role": "tool", - "tool_call_id": tc.id, - "content": result, - } - ) - - if it >= max_tool_iters - 1: - stopped_reason = "max_iters" - final_text = content - break - - return { - "final_text": final_text, - "tool_calls": recorded_calls, - "iterations": iterations, - "stopped_reason": stopped_reason, - } -``` - -- [ ] **Step 5.2: Smoke-test the method does not break import** - -Run: `python -c "from symbolic_agent.baselines.trove.llm import TroVELLMClient; print(hasattr(TroVELLMClient, 'chat_with_tools'))"` - -Expected: `True`. - -- [ ] **Step 5.3: Smoke-test the Anthropic guard fires** - -Run: `python -c "from symbolic_agent.baselines.trove.llm import TroVELLMClient; c = TroVELLMClient(backend='anthropic', api_key='unused'); -try: - c.chat_with_tools([], [], model='x', on_tool_call=lambda x: '') - print('no exception (BUG)') -except NotImplementedError as e: - print('guard fires:', e)"` - -Expected: `guard fires: chat_with_tools requires the openai backend`. - -- [ ] **Step 5.4: Commit** - -```bash -git add symbolic_agent/baselines/trove/llm.py -git commit -m "$(cat <<'EOF' -feat(trove): add TroVELLMClient.chat_with_tools for native tool calls - -Multi-turn loop that handles tool_calls returned by gpt-oss/vLLM: -appends assistant message + tool result messages until the model returns -no tool_calls or max_tool_iters is reached. Records each call as -{name, args_preview, result_preview, ok} for downstream telemetry. -Reuses the existing 3-attempt retry, debug logging, and token accounting. - -Anthropic backend raises NotImplementedError as a defensive guard; -controllers branch on self.backend == "openai" before calling. -EOF -)" -``` - ---- - -## Task 6: Controller IMPORT-with-tools branch + telemetry fields - -**Files:** -- Modify: `symbolic_agent/baselines/trove/controller.py` - -- [ ] **Step 6.1: Update imports and `__init__` signature** - -In `symbolic_agent/baselines/trove/controller.py`, replace the imports block at the top (currently lines 36-44) with: - -```python -import logging -from collections import Counter -from typing import Callable, Dict, List, Optional - -from . import tools_api -from .executor import run_solution -from .llm import TroVELLMClient -from .parse import count_ast_nodes, imported_callsites, parse_response -from .prompts import ( - build_create_prompt, - build_import_prompt, - build_import_with_tools_prompt, - build_skip_prompt, - get_question, -) -from .toolbox import TroVEToolbox -``` - -Then update `TroVEController.__init__` (currently around lines 78-105) to accept the two new parameters: - -```python - def __init__( - self, - api_key: Optional[str] = None, - model: str = "claude-sonnet-4-5", - base_url: Optional[str] = None, - debug_dir: Optional[str] = None, - k: int = DEFAULT_K, - trim_every: int = DEFAULT_TRIM_EVERY, - trim_C: float = 1.0, - temperature: float = 0.3, - top_p: float = 0.95, - task_family: str = "default", - selection: str = "reward", - max_tool_iters: int = 8, - tool_schema_topk: int = 10, - ): - self.model = model - self.k = k - self.trim_every = trim_every - self.trim_C = trim_C - self.task_family = task_family - self.selection = selection - self.max_tool_iters = max_tool_iters - self.tool_schema_topk = tool_schema_topk - - self.backend = "openai" if base_url else "anthropic" - self.llm = TroVELLMClient( - backend=self.backend, - base_url=base_url, - api_key=api_key, - temperature=temperature, - top_p=top_p, - debug_dir=debug_dir, - ) - self.toolbox = TroVEToolbox() - self._n_processed: int = 0 -``` - -(Note `trim_C` default is now 1.0 to match the toolbox change in Task 1; controllers passing the default get the new behavior.) - -- [ ] **Step 6.2: Update existing build_* call-sites to pass `task_family`** - -In `_multi_way_generation`, find each call to `build_create_prompt(question)` and `build_skip_prompt(question)` and the legacy `build_import_prompt(question, toolbox_str)`, replacing them with: - -```python - prompt = build_import_prompt(question, toolbox_str, task_family=self.task_family) -``` - -```python - prompt = build_create_prompt(question, task_family=self.task_family) -``` - -```python - prompt = build_skip_prompt(question, task_family=self.task_family) -``` - -Also update `parse_response(raw)` calls to `parse_response(raw, task_family=self.task_family)`. - -- [ ] **Step 6.3: Insert the IMPORT-with-tools branch in `_multi_way_generation`** - -Locate the `# --- IMPORT mode ---` section (currently around lines 254-274). Replace it with: - -```python - # --- IMPORT mode --- - toolbox_nonempty = bool(toolbox_str) - use_tools_branch = toolbox_nonempty and self.backend == "openai" - - if use_tools_branch: - import_candidates = self._generate_import_with_tools( - question, example_idx, reward_fn=reward_fn, entry=entry - ) - best_import_idx, best_import_score = self._select_best( - import_candidates, reward_fn=reward_fn, entry=entry - ) - best_import = import_candidates[best_import_idx] - best_import["_reward_score"] = best_import_score - elif toolbox_nonempty: - # Legacy text-based IMPORT (Anthropic or unforeseen non-OpenAI path). - import_candidates = [] - for _ in range(self.k): - prompt = build_import_prompt(question, toolbox_str, task_family=self.task_family) - raw = self.llm.call(prompt, self.model, max_tokens=DEFAULT_MAX_TOKENS, tag="trove_import") - parsed = parse_response(raw, task_family=self.task_family) - is_ok, out = run_solution( - parsed["solution_code"], - parsed["tools_code"], - self.toolbox.get_full_code(), - ) - import_candidates.append( - {**parsed, "is_success": is_ok, "exec_output": out, "tool_calls": [], "stopped_reason": "legacy"} - ) - best_import_idx, best_import_score = self._select_best( - import_candidates, reward_fn=reward_fn, entry=entry - ) - best_import = import_candidates[best_import_idx] - best_import["_reward_score"] = best_import_score - else: - best_import = { - "solution_code": "", "tools_code": "", "functions": [], - "is_success": False, "exec_output": "", - "tool_calls": [], "stopped_reason": "empty_toolbox", - "_reward_score": None, - } -``` - -- [ ] **Step 6.4: Add the `_generate_import_with_tools` method** - -Insert this new method into the `TroVEController` class, after `_multi_way_generation`: - -```python - def _generate_import_with_tools( - self, - question: str, - example_idx: int, - reward_fn: Optional[Callable] = None, - entry: Optional[dict] = None, - ) -> List[dict]: - """ - IMPORT-mode generation using native OpenAI tool calling. - Builds K trajectories; each trajectory may invoke toolbox functions - via tool_calls during the multi-turn loop. Returns K candidate dicts - compatible with _select_best. - """ - prompt = build_import_with_tools_prompt(question, task_family=self.task_family) - tools_schema = tools_api.toolbox_to_openai_tools(self.toolbox, topk=self.tool_schema_topk) - - candidates: List[dict] = [] - for i in range(self.k): - tag = f"trove_import_t{example_idx}_{i}" - messages = [{"role": "user", "content": prompt}] - on_tc = lambda tc: tools_api.dispatch_tool_call(self.toolbox, tc) - traj = self.llm.chat_with_tools( - messages=messages, - tools=tools_schema, - model=self.model, - max_tokens=DEFAULT_MAX_TOKENS, - max_tool_iters=self.max_tool_iters, - on_tool_call=on_tc, - tag=tag, - ) - parsed = parse_response(traj["final_text"], task_family=self.task_family) - is_ok, out = run_solution( - parsed["solution_code"], - parsed["tools_code"], - self.toolbox.get_full_code(), - ) - candidates.append( - { - **parsed, - "is_success": is_ok, - "exec_output": out, - "tool_calls": traj["tool_calls"], - "stopped_reason": traj["stopped_reason"], - "iterations": traj["iterations"], - } - ) - return candidates -``` - -- [ ] **Step 6.5: Wire `selection="consistency"` to the existing consistency selector** - -Replace `_select_best` (currently around lines 337-361) with: - -```python - def _select_best( - self, - candidates: List[dict], - reward_fn: Optional[Callable] = None, - entry: Optional[dict] = None, - ): - """ - Select the best candidate from a list of response dicts. - - Returns (best_index, score_or_None) where score is (reward, message) - when reward-based selection is used, or None otherwise. - - Selection strategy is governed by self.selection: - - "reward" (default): reward-based when reward_fn+entry provided, - falls back to consistency when not. - - "consistency": original TroVE majority-vote algorithm. - """ - if self.selection == "consistency": - return self._select_best_by_consistency(candidates), None - if reward_fn is not None and entry is not None: - return self._select_best_by_reward(candidates, reward_fn, entry) - return self._select_best_by_consistency(candidates), None -``` - -- [ ] **Step 6.6: Update `_update_library` to credit frequency from tool_calls** - -Replace `_update_library` (currently around lines 419-432) with: - -```python - def _update_library(self, mode: str, resp: dict, example_idx: int) -> None: - """Update toolbox based on winning mode (faithful to run_trove.py).""" - if mode == "import": - tool_calls = resp.get("tool_calls") or [] - if tool_calls: - # Native tool-calling path: credit by unique tool_call.function.name - # (defensive: sanitize and let toolbox.update_frequency filter unknowns). - unique_names = { - tc["name"].split("<|", 1)[0].strip() - for tc in tool_calls - if tc.get("name") - } - for name in unique_names: - if name: - self.toolbox.update_frequency(name, example_idx) - else: - # Legacy text-based IMPORT: credit functions parsed from **Tools**. - for func_dict in resp.get("functions", []): - name = func_dict.get("name", "") - if name: - self.toolbox.update_frequency(name, example_idx) - elif mode == "create" and resp.get("is_success"): - for func_dict in resp.get("functions", []): - self.toolbox.add(func_dict, example_idx) - - # SKIP: no library changes -``` - -- [ ] **Step 6.7: Add telemetry fields to `_make_result`** - -Replace `_make_result` (currently around lines 438-480) with: - -```python - def _make_result( - self, - task_input: dict, - task_type: str, - best_mode: str, - best_resp: dict, - is_success: bool, - output: str, - best_reward_score=None, - ) -> dict: - """ - Build a result dict compatible with main.py's _print_result() and - _append_task_output(). Adds passive TroVE telemetry fields. - """ - tool_calls = best_resp.get("tool_calls") or [] - tools_called = sorted({ - tc["name"].split("<|", 1)[0].strip() - for tc in tool_calls - if tc.get("name") - }) - candidate_names = {e["name"] for e in self.toolbox.snapshot()} - actually_called = sorted( - imported_callsites( - solution_code=best_resp.get("solution_code", ""), - tools_code=best_resp.get("tools_code", ""), - candidate_names=candidate_names, - ) - ) - import_eligible = len(self.toolbox) > 0 # state AFTER this task's update - # Note: import_eligible reflects the current toolbox state after - # _update_library has already run for this task. The analyzer should - # interpret this as "a non-empty toolbox existed at some point during - # this task's processing". For pre-task eligibility, infer from - # toolbox snapshots in adjacent tasks. - - return { - "task_type": task_type, - "original_prompt": str(task_input), - "solved": is_success, - "steps": 1, - "trace": [ - { - "step": 0, - "agent": "trove", - "action": best_mode, - "is_success": is_success, - } - ], - "solution": best_resp.get("solution_code", ""), - "library_snapshot": self.toolbox.snapshot(), - "cost_summary": {}, - "final_output": { - "answer": output, - "explanation": f"TroVE mode={best_mode}", - "confidence": "high" if is_success else "low", - "execution_result": output, - }, - "agent_messages": self.llm.get_task_log(), - "reward_history": [], - "best_reward": None, - "final_reward": None, - "_best_reward_score": best_reward_score, - # TroVE native-tool-calling telemetry - "won_mode": best_mode, - "import_eligible": import_eligible, - "import_was_winner": best_mode == "import", - "tool_calls": tool_calls, - "tool_call_count": len(tool_calls), - "tools_called": tools_called, - "actually_called": actually_called, - "trove_stopped_reason": best_resp.get("stopped_reason", ""), - } -``` - -- [ ] **Step 6.8: Sanity-check the controller imports and constructs** - -Run: `python -c "from symbolic_agent.baselines.trove.controller import TroVEController; c = TroVEController(api_key='unused', model='x', task_family='pbebench', selection='reward'); print(c.task_family, c.selection, c.backend, c.max_tool_iters, c.tool_schema_topk)"` - -Expected: `pbebench reward anthropic 8 10`. - -- [ ] **Step 6.9: Run all tests to confirm no regressions** - -Run: `python -m pytest symbolic_agent/baselines/trove/tests/ -v` - -Expected: 16 passed (10 from tools_api + 6 from parse_callsites + 4 more = 20 actually; verify count matches what was added). - -Actual expected: 6 (parse_callsites) + 10 (tools_api) = 16 passed. - -- [ ] **Step 6.10: Commit** - -```bash -git add symbolic_agent/baselines/trove/controller.py -git commit -m "$(cat <<'EOF' -feat(trove): controller branch for native IMPORT tool calling - -- Add task_family and selection params to TroVEController.__init__. -- IMPORT branch dispatches to _generate_import_with_tools when toolbox - is non-empty and backend is openai; otherwise falls back to legacy - text-based IMPORT. -- _generate_import_with_tools builds K multi-turn trajectories via - TroVELLMClient.chat_with_tools, parses **Solution** strictly for - pbebench, and runs the result through the executor. -- _update_library credits frequency by unique tool_call.function.name - for the native path; legacy path still credits parsed functions. -- _make_result emits won_mode, import_eligible, import_was_winner, - tool_calls, tool_call_count, tools_called, actually_called, - trove_stopped_reason as passive telemetry. -- _select_best honors selection="consistency" or "reward" (default). -EOF -)" -``` - ---- - -## Task 7: `main.py` CLI flags (`--trove-selection`, `--trove-task-family`) - -**Files:** -- Modify: `main.py:794-810` (add new flags) and `main.py:1002-1011` (pass through to controller) - -- [ ] **Step 7.1: Add the two new argparse flags** - -In `main.py`, after the existing `--trove-trim-every` argument (around line 810), insert: - -```python - parser.add_argument( - "--trove-selection", - choices=["reward", "consistency"], - default="reward", - help="[TroVE] Candidate selection strategy. 'reward' (default) uses " - "the per-task reward function with AST tie-breaking. " - "'consistency' uses the original TroVE majority-vote algorithm. " - "(default: reward)", - ) - parser.add_argument( - "--trove-task-family", - choices=["default", "pbebench"], - default="default", - help="[TroVE] Task family for prompt selection and parser strictness. " - "'pbebench' uses PBEBench-shaped few-shots and strict **Solution** " - "parsing (no fallback to any python block). (default: default)", - ) -``` - -- [ ] **Step 7.2: Plumb the flags into the `TroVEController` constructor** - -Find the `elif args.framework == "trove":` block (around line 1002) and replace the `controller = TroVEController(...)` call with: - -```python - elif args.framework == "trove": - controller = TroVEController( - api_key=api_key, - model=model, - base_url=base_url, - debug_dir=args.debug_dir, - k=args.trove_k, - trim_every=args.trove_trim_every, - task_family=args.trove_task_family, - selection=args.trove_selection, - ) - logger.info( - "Framework: TroVE (k=%d, trim_every=%d, task_family=%s, selection=%s)", - args.trove_k, args.trove_trim_every, args.trove_task_family, args.trove_selection, - ) -``` - -- [ ] **Step 7.3: Sanity-check the CLI parses both flags** - -Run: `python main.py --help 2>&1 | grep -E "trove-selection|trove-task-family"` - -Expected: two lines, one for each new flag, both showing the choices and defaults. - -- [ ] **Step 7.4: Sanity-check controller wires through** - -Construct an empty tasks file so the run finishes immediately after parsing args: - -```bash -echo '[]' > /tmp/_pbebench_empty.json -VLLM_API_KEY=EMPTY python main.py \ - --framework trove \ - --trove-task-family pbebench \ - --trove-selection reward \ - --tasks-file /tmp/_pbebench_empty.json \ - --model openai/gpt-oss-20b \ - --backend vllm \ - --base-url http://localhost:8000/v1 \ - 2>&1 | grep -E "Framework: TroVE|ERROR" | head -5 -``` - -Expected: `Framework: TroVE (k=5, trim_every=500, task_family=pbebench, selection=reward)` then an `ERROR: no records found` from the loader. Both confirm the flags parsed and the controller was constructed. - -- [ ] **Step 7.5: Commit** - -```bash -git add main.py -git commit -m "$(cat <<'EOF' -feat(trove): CLI flags --trove-selection and --trove-task-family - -- --trove-selection {reward,consistency} (default: reward). -- --trove-task-family {default,pbebench} (default: default). Plumbed - through to TroVEController; PBEBench runs should pass --trove-task-family - pbebench to enable PBEBench-shaped few-shots and strict **Solution** - parsing. -EOF -)" -``` - ---- - -## Task 8: Update vLLM launcher script with tool-calling flags - -**Files:** -- Modify: `scripts/launch_vllm_gpt_oss_120b.sh` - -- [ ] **Step 8.1: Add the three vLLM flags** - -Replace the body of `scripts/launch_vllm_gpt_oss_120b.sh` with: - -```bash -#!/bin/bash - -mkdir -p /tmp/$USER-tiktoken-cache /tmp/$USER-tmp -chmod 700 /tmp/$USER-tiktoken-cache /tmp/$USER-tmp -export TIKTOKEN_CACHE_DIR=/tmp/$USER-tiktoken-cache -export TMPDIR=/tmp/$USER-tmp - -ts=$(date +%Y%m%d_%H%M%S) - -# Required vLLM tool-calling flags (vLLM >= v0.16.0 for PR #28729): -# --enable-auto-tool-choice enables tool_choice="auto" -# --tool-call-parser openai parses gpt-oss Harmony commentary channel -# --reasoning-parser openai_gptoss routes analysis-channel content into -# message.reasoning_content -nohup python -m vllm.entrypoints.openai.api_server \ - --model "openai/gpt-oss-120b" \ - --tokenizer "openai/gpt-oss-120b" \ - --dtype auto \ - --port ${1} \ - --gpu-memory-utilization 0.95 \ - --tensor-parallel-size 2 \ - --enable-auto-tool-choice \ - --tool-call-parser openai \ - --reasoning-parser openai_gptoss \ - > vllm_logs/vllm_${1}_${ts}.log 2>&1 & echo $! > vllm_logs/vllm_${1}_${ts}.pid -``` - -- [ ] **Step 8.2: Lint the script** - -Run: `bash -n scripts/launch_vllm_gpt_oss_120b.sh && echo OK` - -Expected: `OK`. - -- [ ] **Step 8.3: Commit** - -```bash -git add scripts/launch_vllm_gpt_oss_120b.sh -git commit -m "$(cat <<'EOF' -chore(launcher): enable native tool calling for gpt-oss-120b vLLM server - -Add three flags required for OpenAI-compatible tool calling on gpt-oss -served by vLLM >= v0.16.0: - --enable-auto-tool-choice - --tool-call-parser openai - --reasoning-parser openai_gptoss - -Without these the controller's chat_with_tools loop sees no tool_calls -in the response and degrades to no-tool behavior. -EOF -)" -``` - ---- - -## Task 9: `scripts/analyze_trove_run.py` - -**Files:** -- Create: `scripts/analyze_trove_run.py` - -- [ ] **Step 9.1: Create the analysis script** - -Create `scripts/analyze_trove_run.py`: - -```python -#!/usr/bin/env python3 -"""Post-hoc analysis of a TroVE run JSONL output. - -Reads the per-task JSONL file produced by main.py --output-file and reports: - - Overall accuracy - - Final toolbox size - - Per-mode wins - - IMPORT-mode tool-use breakdown - - Top-10 most-called toolbox functions - -Usage: - python scripts/analyze_trove_run.py path/to/results.jsonl -""" - -from __future__ import annotations - -import argparse -import json -import sys -from collections import Counter -from pathlib import Path - - -def _load_rows(path: Path) -> list[dict]: - rows = [] - with path.open() as f: - for lineno, line in enumerate(f, 1): - line = line.strip() - if not line: - continue - try: - rows.append(json.loads(line)) - except json.JSONDecodeError as exc: - print(f"warning: line {lineno} is not valid JSON: {exc}", file=sys.stderr) - return rows - - -def _result_dict(row: dict) -> dict: - """Tolerant accessor: results are nested under 'result' in main.py's output.""" - return row.get("result") or row - - -def main() -> None: - parser = argparse.ArgumentParser(description=__doc__) - parser.add_argument("path", type=Path, help="Path to the TroVE results JSONL file") - args = parser.parse_args() - - rows = _load_rows(args.path) - if not rows: - print("ERROR: no rows loaded", file=sys.stderr) - sys.exit(1) - - n = len(rows) - results = [_result_dict(r) for r in rows] - - # Overall accuracy - solved = sum(1 for r in results if r.get("solved")) - print(f"=== Run summary: {args.path.name} ===") - print(f"Tasks: {n}") - print(f"Solved: {solved}/{n} ({100 * solved / n:.1f}%)") - - # Final toolbox size — take the snapshot from the last row. - last_snapshot = results[-1].get("library_snapshot") or [] - print(f"Final toolbox size: {len(last_snapshot)}") - - # Per-mode wins - mode_counter = Counter(r.get("won_mode", "?") for r in results) - print(f"Mode wins: {dict(mode_counter)}") - - # IMPORT-mode tool-use breakdown - import_eligible = [r for r in results if r.get("import_eligible")] - if not import_eligible: - print("No IMPORT-eligible tasks observed.") - else: - with_calls = [r for r in import_eligible if (r.get("tool_call_count") or 0) >= 1] - n_eligible = len(import_eligible) - n_with = len(with_calls) - mean_calls = ( - sum((r.get("tool_call_count") or 0) for r in import_eligible) / n_eligible - ) - all_calls = [tc for r in import_eligible for tc in (r.get("tool_calls") or [])] - n_calls_total = len(all_calls) - n_calls_ok = sum(1 for tc in all_calls if tc.get("ok")) - success_rate = (100 * n_calls_ok / n_calls_total) if n_calls_total else 0.0 - print( - f"IMPORT-eligible tasks: {n_eligible}\n" - f" Tasks with >=1 tool call: {n_with}/{n_eligible} ({100 * n_with / n_eligible:.1f}%)\n" - f" Mean tool calls / task: {mean_calls:.2f}\n" - f" Tool-call success rate: {n_calls_ok}/{n_calls_total} ({success_rate:.1f}%)" - ) - - # Top-10 most-called functions - name_counter: Counter = Counter() - for r in results: - for tc in r.get("tool_calls") or []: - name = (tc.get("name") or "").split("<|", 1)[0].strip() - if name: - name_counter[name] += 1 - if name_counter: - print("Top-10 most-called toolbox functions:") - for name, cnt in name_counter.most_common(10): - print(f" {cnt:4d} {name}") - else: - print("No tool calls recorded in this run.") - - -if __name__ == "__main__": - main() -``` - -- [ ] **Step 9.2: Make the script executable and lint-check** - -Run: `chmod +x scripts/analyze_trove_run.py && python -c "import ast; ast.parse(open('scripts/analyze_trove_run.py').read())" && echo OK` - -Expected: `OK`. - -- [ ] **Step 9.3: Smoke-test on synthetic data** - -Run: - -```bash -python -c " -import json, tempfile, subprocess -rows = [ - {'result': {'solved': True, 'won_mode': 'import', 'import_eligible': True, 'tool_call_count': 2, 'tool_calls': [{'name':'find_replace_chain','ok':True},{'name':'find_replace_chain','ok':True}], 'library_snapshot':[{'name':'find_replace_chain'}]}}, - {'result': {'solved': False, 'won_mode': 'create', 'import_eligible': False, 'tool_call_count': 0, 'tool_calls': [], 'library_snapshot':[{'name':'find_replace_chain'}]}}, -] -with tempfile.NamedTemporaryFile('w', suffix='.jsonl', delete=False) as f: - for r in rows: f.write(json.dumps(r) + '\n') - p = f.name -print(subprocess.check_output(['python','scripts/analyze_trove_run.py', p]).decode()) -" -``` - -Expected output contains `Solved: 1/2 (50.0%)`, `Final toolbox size: 1`, `Mode wins: {'import': 1, 'create': 1}`, `IMPORT-eligible tasks: 1`, `Tool-call success rate: 2/2 (100.0%)`, and a row `2 find_replace_chain` in the top-10. - -- [ ] **Step 9.4: Commit** - -```bash -git add scripts/analyze_trove_run.py -git commit -m "$(cat <<'EOF' -feat(trove): add analyze_trove_run.py for post-hoc telemetry reports - -Reads a TroVE JSONL output and reports overall accuracy, final toolbox -size, per-mode wins, IMPORT-mode tool-use breakdown (>=1 call rate, -mean calls/task, success rate), and the top-10 most-called toolbox -functions. Sanitizes Harmony control-token contamination in tool names -when aggregating. -EOF -)" -``` - ---- - -## Task 10: Rewrite `docs/deviations.md` - -**Files:** -- Create: `symbolic_agent/baselines/trove/docs/deviations.md` - -- [ ] **Step 10.1: Create the directory and the deviations doc** - -Create `symbolic_agent/baselines/trove/docs/deviations.md`: - -```markdown -# TroVE Implementation: Deviations and Faithful Elements - -This document tracks how this port differs from — and where it stays -faithful to — the original TroVE algorithm -([Wang et al., 2024](https://arxiv.org/abs/2401.12869), -[zorazrw/trove](https://github.com/zorazrw/trove)). - -## 1. Algorithmic deviations - -### 1.1 Native OpenAI tool calling for IMPORT mode -The original TroVE shows the model a `**Toolbox**` markdown block -listing top-k function signatures and asks it to write a `**Solution**` -plus `**Tools**` block referencing those functions by name. We replace -this for the IMPORT mode (when `backend == "openai"` and the toolbox is -non-empty) with **native OpenAI tool calling**: the toolbox is exposed -via the `tools=[...]` parameter of `chat.completions.create`, the model -emits structured `tool_calls` during its reasoning, and `dispatch_tool_call` -runs each one in the sandboxed executor and returns the stdout. This -makes function usage observable and credit-able from the trajectory -itself. - -### 1.2 Reward-based candidate selection (default) -The paper uses self-consistency (majority vote on stdout, AST tie-break) -to pick the best of K samples per mode. We default to **reward-based -selection**: every candidate is scored by the per-task reward function, -ties broken by minimum AST node count. This is more reliable on -PBEBench (program-list outputs rarely tie as strings). The original -self-consistency selector remains available via `--trove-selection consistency`. - -### 1.3 PBEBench-shaped few-shot examples -For `task_family="pbebench"` we replace the generic CREATE / SKIP / IMPORT -example pairs with PBEBench-shaped pairs that demonstrate `replace()` -chains and a small reusable helper (`find_replace_chain`). The legacy -default examples remain for `task_family="default"`. - -### 1.4 Strict **Solution** parsing for PBEBench -The legacy parser falls back to "first ```python``` block anywhere" when -no `**Solution**` block is present. For `task_family="pbebench"` this -fallback is disabled, preventing CoT scratchpad from being accidentally -promoted to the answer. - -## 2. Faithful elements - -- 3-mode generation (IMPORT, CREATE, SKIP). -- K samples per mode (default K=5, paper). -- AST-tie-breaking by node count (simplest solution wins). -- Periodic toolbox trimming with threshold `C·log_{20}(n)`, default - `C=1.0`, matching the original implementation. -- Frequency-based top-k retrieval for the toolbox view. -- Dict-keyed toolbox structure mirroring `utils/code.py`. -- Library updates: IMPORT credits frequency, CREATE adds new functions - on success, SKIP makes no library changes. - -## 3. Infrastructural patches - -- **JSONL-per-task checkpointing** via `--output-file`, with crash - resumption. -- **`reasoning_content` fallback** in `_call_openai` for `gpt-oss` Harmony - channel splits where the answer text lives in `message.reasoning_content`. -- **Executor timeout 60s** (vs. 10s in earlier versions of this port), - closer to the original's ~100s. -- **`<|`-truncation sanitizer** in `dispatch_tool_call` and - `_update_library`. Defensive workaround for the open vLLM - [PR #35906](https://github.com/vllm-project/vllm/pull/35906) covering - Harmony control-token leakage into tool names. When that PR lands - upstream the sanitizer becomes a no-op and is left in place. - -## 4. Backend coverage caveat - -Anthropic backend code paths exist and are exercised by CREATE / SKIP and -the legacy text-based IMPORT fallback, but **the smoke run and reported -numbers are vLLM-served `gpt-oss` only**. IMPORT-with-tools requires -the OpenAI/vLLM backend and is the only path we test end-to-end. - -## 5. vLLM version requirement - -- Minimum vLLM: **v0.16.0** (branch-cut 2026-02-08). -- Required upstream change: [PR #28729](https://github.com/vllm-project/vllm/pull/28729) - ("Multiple fixes for gpt-oss Chat Completion prompting"), merged - 2025-12-12. v0.16.0 is the first stable release branch-cut after the merge. -- Known open caveat: [PR #35906](https://github.com/vllm-project/vllm/pull/35906) - ("Sanitize leaked Harmony control tokens"), still open as of late - March 2026 — see §3 for the sanitizer mitigation. -``` - -- [ ] **Step 10.2: Verify the file renders** - -Run: `head -20 symbolic_agent/baselines/trove/docs/deviations.md` - -Expected: the document renders with the title on the first line. - -- [ ] **Step 10.3: Commit** - -```bash -git add symbolic_agent/baselines/trove/docs/deviations.md -git commit -m "$(cat <<'EOF' -docs(trove): rewrite deviations.md for native tool calling - -Document algorithmic deviations (native OpenAI tool calling for IMPORT, -reward-based selection by default, PBEBench-shaped few-shots, strict -**Solution** parsing for pbebench), faithful elements (3-mode generation, -K-sampling, AST tie-break, C*log_20(n) trimming with C=1.0), and -infrastructural patches (JSONL checkpointing, reasoning_content -fallback, 60s executor timeout, defensive <|-truncation sanitizer). - -Includes vLLM version requirement (>= v0.16.0 for PR #28729) and the -backend coverage caveat (smoke run is vLLM-served gpt-oss only). -EOF -)" -``` - ---- - -## Task 11: Pre-flight sanity check + 50-task smoke run + report - -**Files:** none modified. This is the validation task. - -- [ ] **Step 11.1: Re-launch vLLM with the new flags** - -The existing launcher is named `launch_vllm_gpt_oss_120b.sh` but the spec calls for `gpt-oss-20b`. Two options — pick one: - -(a) **Smoke on 120b directly** (no script change beyond Task 8). Run: - -```bash -bash scripts/launch_vllm_gpt_oss_120b.sh 8000 -``` - -Then in Tasks 11.2 and 11.4, replace `--model openai/gpt-oss-20b` with `--model openai/gpt-oss-120b`. - -(b) **Smoke on 20b** (one-line edit). In `scripts/launch_vllm_gpt_oss_120b.sh`, change `openai/gpt-oss-120b` → `openai/gpt-oss-20b` for both `--model` and `--tokenizer`, and lower `--tensor-parallel-size 2` → `--tensor-parallel-size 1` (20b fits on one GPU). Then: - -```bash -bash scripts/launch_vllm_gpt_oss_120b.sh 8000 -``` - -(Do not commit the edit — restore the file before the final commit, or rename the script if you want the 20b variant kept.) - -Then wait 60–120 seconds and confirm the server is up: - -Run: `curl -sS http://localhost:8000/v1/models | head -5` - -Expected: a JSON response listing the model you launched. - -- [ ] **Step 11.2: Pre-flight: one-task smoke** - -Run a single task to verify the tool-calling round-trip works end-to-end. The codebase has no `--num-tasks` flag, so we slice the first row out of the 50-task PBEBench-Lite file: - -```bash -mkdir -p outputs/trove_pbebench_preflight -head -n 1 data/pbebench/lite_pilot_tasks.jsonl > /tmp/_pbebench_one.jsonl -VLLM_API_KEY=EMPTY python main.py \ - --framework trove \ - --tasks-file /tmp/_pbebench_one.jsonl \ - --output-file outputs/trove_pbebench_preflight/results.jsonl \ - --model openai/gpt-oss-20b \ - --backend vllm \ - --base-url http://localhost:8000/v1 \ - --trove-task-family pbebench \ - --trove-selection reward \ - --trove-k 3 \ - --trove-trim-every 9999 \ - --max-tokens 4096 \ - --debug-dir outputs/trove_pbebench_preflight/debug -``` - -Expected: the run completes without crashing. The output file should contain one row. - -- [ ] **Step 11.3: Verify the tool-calling pre-flight check** - -This task starts with an empty toolbox so the IMPORT-with-tools branch will not run. Inspect the most recent debug-dir log file with `trove_create` or `trove_skip` in the name and confirm it contains a non-empty response: - -Run: `ls -t outputs/trove_pbebench_preflight/debug/trove_run_*/0001_*.json | head -1 | xargs python -c "import json,sys; d=json.load(open(sys.argv[1])); print('content length:', len(d['response']['content']))"` - -Expected: non-zero content length. If zero, the `reasoning_content` fallback (Task 1.3) is not engaging — debug before proceeding. - -- [ ] **Step 11.4: Run the 50-task smoke** - -`data/pbebench/lite_pilot_tasks.jsonl` is exactly 50 PBEBench-Lite tasks with per-task `reward: pbebench`, so no slicing or `--default-reward` flag is required. - -```bash -mkdir -p outputs/trove_pbebench_smoke -VLLM_API_KEY=EMPTY python main.py \ - --framework trove \ - --tasks-file data/pbebench/lite_pilot_tasks.jsonl \ - --output-file outputs/trove_pbebench_smoke/results.jsonl \ - --model openai/gpt-oss-20b \ - --backend vllm \ - --base-url http://localhost:8000/v1 \ - --trove-task-family pbebench \ - --trove-selection reward \ - --trove-k 3 \ - --trove-trim-every 9999 \ - --max-tokens 4096 \ - --debug-dir outputs/trove_pbebench_smoke/debug -``` - -Expected: ~30–60 minutes wall-clock on local vLLM. Run completes without crashes. Auto-resume from checkpoint is supported by `--output-file` if the run is interrupted. - -- [ ] **Step 11.5: Run the analysis script and capture the report** - -Run: `python scripts/analyze_trove_run.py outputs/trove_pbebench_smoke/results.jsonl | tee outputs/trove_pbebench_smoke/report.txt` - -Expected: the report shows accuracy, toolbox size, mode wins, IMPORT-mode tool-use breakdown, and top-10 functions. - -- [ ] **Step 11.6: Report numbers to the user (no prompt iteration)** - -Per the spec's "done criteria", report the contents of `outputs/trove_pbebench_smoke/report.txt` plus a short narrative paragraph noting any anomalies (e.g. `<|channel|>` contamination from PR #35906, `max_iters` stops, JSON-arg parse failures). - -**No prompt iteration. No threshold tuning. The numbers are what they are.** - ---- - -## Self-Review - -### 1. Spec coverage - -| Spec section | Implementing task | -|---|---| -| §3 Architecture overview | Tasks 1–8 collectively | -| §4 Data flow for IMPORT-with-tools | Tasks 4–6 | -| §5.1 New `tools_api.py` | Task 4 | -| §5.2 `_call_openai` reasoning fallback | Task 1 | -| §5.2 `chat_with_tools` method | Task 5 | -| §5.3 Controller `__init__` params, IMPORT branch, `_update_library`, `_make_result` | Task 6 | -| §5.4 `imported_callsites`, `task_family` in parse_response | Task 2 | -| §5.5 PBEBench prompts and IMPORT-with-tools prompt | Task 3 | -| §5.6 Trim `C=1.0` | Task 1 | -| §5.7 Executor timeout 60s | Task 1 | -| §5.8 main.py CLI flags | Task 7 | -| §5.9 vLLM launcher flags | Task 8 | -| §5.10 `analyze_trove_run.py` | Task 9 | -| §5.11 deviations.md rewrite | Task 10 | -| §6 Telemetry fields | Task 6.7 | -| §7 Implementation defaults | Tasks 4–6 | -| §8 Smoke run + done criteria | Task 11 | - -All sections accounted for. - -### 2. Placeholder scan - -No `TBD`, `TODO`, `implement later`, "appropriate", "various", or "fill in details" in any task. All test code is fully written (not "write tests for the above"). All file paths are exact. All commit messages are pre-written. - -### 3. Type and signature consistency - -- `imported_callsites(solution_code, tools_code, candidate_names)` — defined in Task 2, called in Task 6.7 with matching kwargs. -- `toolbox_to_openai_tools(toolbox, topk=10)` — defined in Task 4, called in Task 6.4. -- `dispatch_tool_call(toolbox, tool_call) -> str` — defined in Task 4, called via the `on_tc` closure in Task 6.4. -- `chat_with_tools(messages, tools, model, max_tokens, max_tool_iters, on_tool_call, tag)` — defined in Task 5, called in Task 6.4 with matching kwargs. -- `build_import_with_tools_prompt(question, task_family)` — defined in Task 3, called in Task 6.4. -- `build_import_prompt(question, toolbox_str, task_family)` — extended in Task 3, called in Task 6.3. -- `parse_response(text, task_family)` — extended in Task 2, called in Tasks 6.3 and 6.4. -- `TroVEController(__init__)` new params (`task_family`, `selection`, `max_tool_iters`, `tool_schema_topk`) — defined in Task 6.1, passed in Task 7.2 (only `task_family` and `selection` from CLI; the other two use defaults, which matches the spec's defaults table). - -All consistent. - -### 4. Plan quirks worth noting to the executor - -- Task 11.4 relies on the user's `task_index_25_direct_feedback.json` having at least 50 tasks. If it has fewer, swap to whichever PBEBench-Lite tasks file is available (the spec calls for "50 PBEBench-Lite tasks"; the exact filename is not load-bearing). -- Task 11.5 `tee` output captures the report for the user-facing message in 11.6. -- The `import_eligible` field in `_make_result` is computed *after* `_update_library` runs for the current task. The doc-comment in Task 6.7 explains the consequence; the analyzer in Task 9 doesn't depend on the pre-task value. -- Task 6.5's `_select_best` change wraps the existing reward/consistency selectors. When `selection="consistency"` is set, the `reward_fn` and `entry` arguments are ignored — that is intentional and matches the user's choice to keep both flags as opt-ins. - ---- - -## Execution Handoff - -Plan complete and saved to `docs/superpowers/plans/2026-04-25-trove-native-tool-calling.md`. Two execution options: - -**1. Subagent-Driven (recommended)** — I dispatch a fresh subagent per task, review between tasks, fast iteration. - -**2. Inline Execution** — Execute tasks in this session using executing-plans, batch execution with checkpoints. - -Which approach? diff --git a/docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md b/docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md deleted file mode 100644 index ff0fc0a1..00000000 --- a/docs/superpowers/specs/2026-04-25-trove-native-tool-calling-design.md +++ /dev/null @@ -1,374 +0,0 @@ -# TroVE Native Tool Calling — Design Spec - -**Date:** 2026-04-25 -**Branch:** `trove_baseline` -**Status:** Approved (sectional review complete; self-review applied) - ---- - -## 1. Problem statement - -Our existing TroVE port (`symbolic_agent/baselines/trove/*`) faithfully implements the original 3-mode generation (IMPORT / CREATE / SKIP) via free-form text prompts. When run on PBEBench with `gpt-oss-20b` / `gpt-oss-120b` served via vLLM, two failure modes are observed: - -1. **The toolbox is populated but never used.** The model emits `**Toolbox**` and `**Solution**` blocks that ignore previously-induced functions even when those functions match the task family. CoT shows the model "discovering" the same primitive sequence repeatedly. -2. **CoT and final code are decoupled.** Even when the prompt names a toolbox helper, the model's reasoning channel does not interleave concrete invocations of that helper — calls only appear (if at all) in the final code, with no per-call signal we can audit. - -The user requirement is: **the model's chain-of-thought should interleave with concrete function calls into the induced toolbox**. The mechanism for this is **native OpenAI tool calling**: the toolbox is exposed via the `tools=[...]` parameter of `chat.completions.create`, and the model emits structured `tool_calls` during its reasoning that vLLM dispatches back to us. Each tool call is real, auditable, and credited toward toolbox frequency. - -This spec adapts the IMPORT mode of TroVE to use that mechanism, while keeping CREATE and SKIP modes text-based and preserving the rest of the algorithm faithful to the paper. - ---- - -## 2. Goals and non-goals - -### Goals - -- IMPORT-mode trajectories are produced by a multi-turn loop where `gpt-oss` calls toolbox functions natively via `tool_calls`. -- Frequency credit reflects what the model actually called, not what appeared in text. -- The 3-way generation (IMPORT, CREATE, SKIP), K-sampling, reward-based candidate selection, AST tie-breaking, and `C·log_{20}(n)` trimming all remain faithful to the original TroVE algorithm. -- The smoke run produces enough telemetry (per-task `tool_calls` lists, per-mode wins, function-frequency table) to attribute any accuracy delta vs. the no-toolbox baseline to actual tool usage. -- "Done" = code complete + 50-task PBEBench-Lite smoke run on `gpt-oss-20b` + numbers reported. **No prompt iteration to chase performance targets.** - -### Non-goals - -- We do not change CREATE or SKIP mode generation. They remain single-shot text completions exactly as today. -- We do not pre-seed the toolbox. -- We do not change reward semantics, the PBEBench harness, or the executor's I/O contract. -- We do not chase a specific accuracy target. Per the original TroVE methodology, we report what the algorithm produces. -- We do not test or report Anthropic backend numbers — only vLLM-served `gpt-oss`. - ---- - -## 3. Architecture overview - -```mermaid -flowchart TD - Task[PBEBench task] --> Controller[TroVEController._multi_way_generation] - Controller --> ImportBranch{toolbox non-empty
AND backend == openai?} - Controller --> Create[CREATE mode
K text-only completions] - Controller --> Skip[SKIP mode
K text-only completions] - - ImportBranch -->|yes| ImportTools[_generate_import_with_tools
K multi-turn tool-calling trajectories] - ImportBranch -->|no, Anthropic or empty| LegacyImport[legacy text-based IMPORT
defensive fallback path] - - ImportTools --> ToolsApi[tools_api.toolbox_to_openai_tools
top-k toolbox -> OpenAI tool schemas] - ImportTools --> ChatLoop[llm.chat_with_tools
multi-turn loop, max_tool_iters=8] - ChatLoop --> Vllm[vLLM /v1/chat/completions
--tool-call-parser openai
--reasoning-parser openai_gptoss] - Vllm --> Dispatcher[tools_api.dispatch_tool_call
sandbox execute via executor.run_solution] - Dispatcher --> ChatLoop - - ImportTools --> ImportCands[K IMPORT candidates
final assistant text + tool_call trajectory] - Create --> CreateCands[K CREATE candidates] - Skip --> SkipCands[K SKIP candidates] - LegacyImport --> ImportCands - - ImportCands --> Pick[_select_best_by_reward
tie-break by AST node count] - CreateCands --> Pick - SkipCands --> Pick - - Pick --> Library[_update_library
credit frequency from tool_calls] - Library --> Trim[periodic toolbox.trim
C * log_20 n_processed, C=1.0] -``` - -**One-line summary:** Only the IMPORT branch changes. Everything else (CREATE, SKIP, K-sampling, selection, library updates, trimming) stays where the existing port already has it. - ---- - -## 4. Data flow for IMPORT-with-tools - -```mermaid -sequenceDiagram - participant Ctrl as TroVEController - participant Tools as tools_api - participant LLM as TroVELLMClient.chat_with_tools - participant vLLM - participant Exec as executor.run_solution - - Ctrl->>Tools: toolbox_to_openai_tools(toolbox, topk=10) - Tools-->>Ctrl: tools_schema (list[dict]) - Ctrl->>LLM: chat_with_tools(messages, tools_schema, model, max_tool_iters=8) - loop iter 1..N (N <= 8) - LLM->>vLLM: chat.completions.create(messages, tools=tools_schema) - vLLM-->>LLM: assistant message (content + reasoning_content + tool_calls) - alt tool_calls present - LLM->>Tools: dispatch_tool_call(toolbox, tool_call) - Tools->>Exec: run_solution(toolbox_src + call_expr, task_inputs) - Exec-->>Tools: stdout (truncated to 4096 chars) or error - Tools-->>LLM: tool result string - LLM->>LLM: append assistant + tool messages, loop - else no tool_calls - LLM-->>Ctrl: trajectory (final text + recorded tool_calls) - end - end - Ctrl->>Ctrl: parse **Solution** block from final text - Ctrl->>Ctrl: credit frequency by unique tool_call.function.name -``` - ---- - -## 5. Components - -### 5.1 New file: `symbolic_agent/baselines/trove/tools_api.py` - -Two pure functions; no state. - -**`toolbox_to_openai_tools(toolbox: TroVEToolbox, topk: int = 10) -> list[dict]`** - -- Selects the top-k entries by frequency (matching the existing `format_toolbox(topk=10)` view). -- For each entry, executes the toolbox source via `exec(toolbox.get_full_code(), namespace)` into a fresh dict, then reads `inspect.signature(namespace[fn_name])` to enumerate parameters and annotations. -- Builds an OpenAI `chat.completions` tool dict: - ```json - { - "type": "function", - "function": { - "name": "", - "description": "", - "parameters": { - "type": "object", - "properties": {"": {"type": ""}, ...}, - "required": [] - } - } - } - ``` -- Type inference: `int → integer`, `float → number`, `bool → boolean`, `list/tuple → array`, `dict → object`, anything else (or unannotated) → `string`. Numeric and string defaults: pass through to the schema as `default`. Anything else: omit the default. -- Functions with `*args` or `**kwargs` are excluded from the tool list (we cannot generate a meaningful schema; this is rare for induced TroVE helpers and is logged to the debug dir for inspection). - -**`dispatch_tool_call(toolbox: TroVEToolbox, tool_call) -> str`** - -- Sanitizes the tool name: `name = tool_call.function.name.split("<|", 1)[0]`. This is a defensive 2-line workaround for the open vLLM bug tracked by PR #35906 (Harmony control tokens leaking into tool names like `find_replace_chain<|channel|>commentary`). If/when #35906 lands upstream, this becomes a no-op. -- If `name` is not in `toolbox`, returns the JSON string `{"error": "tool '' not in toolbox"}` (the model can recover). -- Parses `tool_call.function.arguments` as JSON; on parse error returns `{"error": "argument JSON parse failed: "}`. -- Builds a one-liner call expression: `print(repr((**)))`. -- Runs `executor.run_solution` with `toolbox.get_full_code() + "\n" + call_expr` and `inputs={}`. (PBEBench task inputs are not needed at the function-call level — the model passes inputs as arguments.) -- Returns the captured stdout truncated to **4096 characters** (UTF-8 codepoints, not bytes — simpler to truncate without splitting a codepoint), or the error message on non-zero exit. - -### 5.2 Modify: `symbolic_agent/baselines/trove/llm.py` - -**`TroVELLMClient._call_openai`** - -- After reading `response.choices[0].message.content`, fall back to `getattr(response.choices[0].message, "reasoning_content", "")` when `content` is empty/None. This handles `gpt-oss` Harmony channel splits where the answer lands in the reasoning channel for non-tool-calling text completions (CREATE, SKIP, and legacy IMPORT). No change to the function signature. - -**New method: `TroVELLMClient.chat_with_tools(messages, tools, model, max_tokens, max_tool_iters=8, on_tool_call, tag) -> dict`** - -- Returns `{"final_text": str, "tool_calls": list[dict], "iterations": int, "stopped_reason": str}`. - - `final_text` is the assistant message content from the final iteration (`""` if none). - - `tool_calls` is the ordered list of recorded calls, each `{"name": str, "args_preview": str (≤200 chars), "result_preview": str (≤200 chars), "ok": bool}`. -- Implements the multi-turn loop: - 1. Append the user message. - 2. POST `chat.completions.create(model, messages, tools, tool_choice="auto", max_tokens)`. - 3. If `message.tool_calls` is empty: record `final_text` (with `reasoning_content` fallback) and return. - 4. Otherwise: append the assistant message verbatim, then for each `tool_call` invoke `on_tool_call(tool_call)` (the controller passes a closure that calls `tools_api.dispatch_tool_call`). Append a `{"role": "tool", "tool_call_id": ..., "content": }` message per call. - 5. Increment iteration counter; if `iterations >= max_tool_iters`, stop with `stopped_reason="max_iters"` and return what we have. -- Defensive guard: raises `NotImplementedError("chat_with_tools requires the openai backend")` on `self.backend == "anthropic"`. This guard is **never tripped in normal flow** because the controller branches on `self.backend == "openai"` before calling. It exists only to fail loudly if a future caller invokes the method directly. -- Uses the same 3-attempt retry, the same per-call debug logging (writing one JSON file per LLM round-trip into the existing `_debug_dir` with the tag suffixed by `_iter{n}`), and the same token accounting as `_call_openai`. - -### 5.3 Modify: `symbolic_agent/baselines/trove/controller.py` - -**`__init__`** — add two parameters: -- `task_family: str = "default"` — passed through to `prompts.build_*_prompt` and `parse.parse_response`. -- `selection: str = "reward"` — `"reward"` (default) uses the existing `_select_best_by_reward`; `"consistency"` uses the existing `_select_best_by_consistency`. - -**`_multi_way_generation`** — change the IMPORT branch only: -- If `self.backend == "openai"` AND `len(self.toolbox) > 0`: call new `_generate_import_with_tools(task, K)`. -- Else: call the existing legacy text-based IMPORT path. (Anthropic and empty-toolbox both fall through here; the latter is correct because there are no tools to expose anyway.) -- CREATE and SKIP branches: unchanged. - -**New method: `_generate_import_with_tools(task, K) -> list[Candidate]`** - -- Builds the IMPORT-with-tools prompt via `prompts.build_import_with_tools_prompt(task, task_family=self.task_family)` (no `**Toolbox**` markdown — the toolbox is conveyed via the `tools=[...]` parameter). -- Builds the tool schema once per task: `tools_schema = tools_api.toolbox_to_openai_tools(self.toolbox, topk=10)`. -- For `i in range(K)`, calls `self.llm.chat_with_tools(...)` with the tag `f"trove_import_{task.id}_{i}"`. -- Each returned trajectory becomes one Candidate. Solution code is parsed from the final text via `parse.parse_response(final_text, task_family="pbebench")` (strict `**Solution**` block; no fallback to "any python block"). -- Empty `final_text` → empty solution code → reward=0 → naturally loses in selection. - -**`_update_library`** — for `mode == "import"`, credit frequency by **unique `tool_call.function.name`** entries in the trajectory: -- `unique_names = {sanitize(tc["name"]) for tc in trajectory.tool_calls}` where `sanitize` is the same `<|`-truncation used in `dispatch_tool_call` (defensive symmetry). -- For each name, call `self.toolbox.update_frequency(name, example_idx)`. Names not present in the toolbox are silently no-ops thanks to the existing filter at `toolbox.py:68` — hallucinated tool names contribute nothing to frequency. Real tool calls (names matching a toolbox entry) get one credit per task per unique name. - -**`_make_result`** — emit passive telemetry fields per task. Add to the result dict (no behavior changes): -- `won_mode: "import" | "create" | "skip"` -- `import_eligible: bool` (true iff toolbox was non-empty when the task ran) -- `import_was_winner: bool` -- `tool_calls: list[{name, args_preview, result_preview, ok}]` (only populated when the IMPORT-with-tools path ran) -- `tool_call_count: int` -- `tools_called: list[str]` (unique names actually called) -- `actually_called: list[str]` (functions from `toolbox` that appear as call-sites in the AST of the winning `**Solution**` code; computed via `parse.imported_callsites`) - -### 5.4 Modify: `symbolic_agent/baselines/trove/parse.py` - -**New helper: `imported_callsites(solution_code: str, tools_code: str, candidate_names: set[str]) -> set[str]`** -- AST-walks `solution_code`, returns the subset of `candidate_names` that appear as `Call` targets (handles bare `Name` and `Attribute` callees like `toolbox.find_replace_chain`). -- Used by `_make_result.actually_called`. - -**Modify `parse_response`** — add `task_family: str = "default"` parameter: -- For `task_family == "pbebench"`, do not fall back to `_extract_any_python_block` if the `**Solution**` block is missing — return empty solution code instead. This enforces strict format adherence and prevents the parser from accidentally promoting CoT scratchpad to the answer. -- For all other families, behavior is unchanged. - -### 5.5 Modify: `symbolic_agent/baselines/trove/prompts.py` - -- Add PBEBench-shaped few-shot examples: `_CREATE_EXAMPLE_PBEBENCH` and `_SKIP_EXAMPLE_PBEBENCH`. Each demonstrates a sequence of `replace()` operations and (in CREATE's case) a small reusable helper such as `find_replace_chain(s, pairs)` so the model has a concrete pattern to imitate. -- Add **`_IMPORT_INSTRUCTION_FOR_TOOLS`** and **`_IMPORT_EXAMPLE_FOR_TOOLS`**: the prompt for IMPORT-with-tools mode. These do *not* include a `**Toolbox**` markdown block (the toolbox is conveyed via the `tools=[...]` parameter). They instruct the model to use the available tools when helpful and to produce a final answer in a `**Solution**` block. -- Add **`build_import_with_tools_prompt(task, task_family)`** and refactor `build_import_prompt`, `build_create_prompt`, `build_skip_prompt` to accept `task_family` and dispatch to the appropriate example set. -- Make `_FORMAT_OVERRIDE` conditional: empty string for `task_family == "pbebench"` (the new PBEBench examples model the desired format directly); existing override for other families. - -### 5.6 Modify: `symbolic_agent/baselines/trove/toolbox.py` - -- `TroVEToolbox.trim`: change default `C: float = 0.5` → `C: float = 1.0` to match the original TroVE implementation. - -### 5.7 Modify: `symbolic_agent/baselines/trove/executor.py` - -- `DEFAULT_TIMEOUT = 10` → `DEFAULT_TIMEOUT = 60`. Closer to the original TroVE's ~100s; gives PBEBench's `replace()`-chain solutions and the multi-turn tool dispatch enough headroom on local vLLM. - -### 5.8 Modify: `main.py` - -- Add CLI flag `--trove-selection {reward,consistency}` with `default="reward"`. Plumb to `TroVEController(selection=args.trove_selection)`. -- When `--dataset pbebench` is specified, pass `task_family="pbebench"` to the controller. Otherwise pass `"default"`. - -### 5.9 Modify: `scripts/launch_vllm_gpt_oss_120b.sh` - -Add three flags to the `vllm.entrypoints.openai.api_server` invocation: -- `--enable-auto-tool-choice` — enables `tool_choice="auto"` to actually fire tool calls. -- `--tool-call-parser openai` — the parser that knows how to extract `tool_calls` from the `gpt-oss` Harmony commentary channel. -- `--reasoning-parser openai_gptoss` — routes Harmony analysis-channel content into `message.reasoning_content` rather than dropping it. - -### 5.10 New file: `scripts/analyze_trove_run.py` - -Read a TroVE JSONL output and print: -- Overall accuracy (pass rate). -- Final toolbox size. -- Per-mode wins (counts of `won_mode == "import"`, `"create"`, `"skip"`). -- IMPORT-mode behavior breakdown: - - Tasks with `import_eligible == True` and `tool_call_count >= 1` (rate). - - Mean `tool_call_count` across IMPORT-eligible tasks. - - Tool-call success rate: fraction of `tool_calls` entries with `ok == True`. -- Top-10 most-called toolbox functions (by total call count across the run). - -### 5.11 Rewrite: `symbolic_agent/baselines/trove/docs/deviations.md` - -(Path may need creation if it doesn't exist.) Three sections: - -1. **Algorithmic deviations:** - - Native OpenAI tool calling for IMPORT mode (replaces the original text-based "model selects from `**Toolbox**` markdown" mechanism). - - Reward-based candidate selection by default (vs. self-consistency in the paper); self-consistency available via `--trove-selection consistency`. - - PBEBench-shaped few-shot examples in CREATE and SKIP prompts. - -2. **Faithful elements:** 3-mode generation, K-sampling per mode, AST-tie-breaking by node count, `C·log_{20}(n)` periodic trimming with `C=1.0`, frequency-based top-k retrieval for the toolbox view, dict-keyed toolbox structure mirroring `utils/code.py`. - -3. **Infrastructural patches:** JSONL-per-task checkpointing, `reasoning_content` fallback in `_call_openai`, executor timeout 60s, defensive `<|`-truncation sanitizer in the tool-call dispatcher (workaround for open vLLM PR #35906 covering Harmony control-token leakage). - -4. **Backend coverage caveat:** Anthropic backend code paths are still present and exercised by CREATE / SKIP / legacy IMPORT, but the smoke run and reported numbers are vLLM-served `gpt-oss` only. IMPORT-with-tools requires the OpenAI/vLLM backend. - ---- - -## 6. Telemetry to be collected - -Per task (in the JSONL row): - -| Field | Type | Source | -|---|---|---| -| `won_mode` | string | controller `_make_result` | -| `import_eligible` | bool | `len(toolbox) > 0` at task start | -| `import_was_winner` | bool | `won_mode == "import"` | -| `tool_calls` | list[dict] | `chat_with_tools` recorded list | -| `tool_call_count` | int | `len(tool_calls)` | -| `tools_called` | list[str] | unique names from `tool_calls` | -| `actually_called` | list[str] | `parse.imported_callsites(winning_solution, ...)` | - -Per run (computed by `scripts/analyze_trove_run.py`): - -- Overall accuracy -- Final toolbox size -- Mode-win histogram -- IMPORT-mode tool-use rate, mean calls/task, success rate -- Top-10 most-called functions - ---- - -## 7. Implementation defaults - -| Choice | Value | Rationale | -|---|---|---| -| `K` (samples per mode) | 3 | Matches existing controller; matches paper | -| Tool schema top-k | 10 | Matches existing `format_toolbox(topk=10)` | -| `max_tool_iters` | 8 | Allows multi-step compositions; bounded for safety | -| Tool result truncation | 4096 characters | Avoids truncating mid-codepoint; safe for JSON | -| Trim coefficient `C` | 1.0 | Matches the original TroVE `λ = log_{20}(n)` | -| Executor timeout | 60s | PBEBench `replace()`-chains + multi-turn dispatch | -| Selection default | `reward` | Existing PBEBench reward signal is reliable | -| Tool name sanitization | `name.split("<\|", 1)[0]` | Defensive vs. open vLLM PR #35906 | - ---- - -## 8. Smoke run - -**Command (filled when ready to execute):** - -```bash -# Launch vLLM (after script is updated with the three new flags) -bash scripts/launch_vllm_gpt_oss_120b.sh 8000 - -# Run TroVE on 50 PBEBench-Lite tasks with gpt-oss-20b -python main.py \ - --dataset pbebench \ - --baseline trove \ - --model gpt-oss-20b \ - --backend openai \ - --base-url http://localhost:8000/v1 \ - --num-tasks 50 \ - --trove-selection reward \ - --debug-dir ./outputs/trove_pbebench_smoke - -# Analyze -python scripts/analyze_trove_run.py outputs/trove_pbebench_smoke/results.jsonl -``` - -**Pre-flight check.** Before kicking off the full 50-task run, run a single one-task smoke and verify: -1. The OpenAI client request payload contains `tools=[...]` with at least one entry once the toolbox has been populated. -2. The first response with a non-empty toolbox returns at least one `tool_call` from vLLM (visible in the debug log JSON for that round-trip). - -If `message.tool_calls` is None or missing on a non-empty-toolbox task, **verify all three vLLM flags (`--enable-auto-tool-choice`, `--tool-call-parser openai`, `--reasoning-parser openai_gptoss`) are present in the launcher script**, restart vLLM, and re-run the sanity check before proceeding. - -**Done criteria.** - -- All code changes merged on `trove_baseline`. -- Smoke run completes without crashes. -- Reported numbers (in plain text or a brief markdown summary): - - Overall accuracy (pass rate over 50 tasks) - - Final toolbox size - - Mode-win counts - - IMPORT tool-use rate among IMPORT-eligible tasks - - Top-10 most-called functions - - A short narrative of any anomalies observed (e.g. `<|channel|>` contamination from PR #35906, `max_iters` stops, JSON-arg parse failures). - -We **do not** iterate on prompts, schemas, or thresholds to chase a target number. The numbers are what they are. - ---- - -## 9. vLLM version requirement and known caveats - -- **Minimum vLLM:** v0.16.0 (branch-cut 2026-02-08). Latest as of writing is v0.20.0. -- **Required upstream change:** PR #28729 ("Multiple fixes for gpt-oss Chat Completion prompting"), merged 2025-12-12 by `@chaunceyjiang`. Without this, multi-turn tool-call flows fail to round-trip the analysis/commentary channels correctly. v0.16.0 is the first stable release branch-cut after the merge. -- **Known open caveat:** PR #35906 ("Sanitize leaked Harmony control tokens in tool names and recipients") is **still open** as of late March 2026. Symptoms when this hits us: tool names contain Harmony tags, e.g. `find_replace_chain<|channel|>commentary`. Mitigation: the `<|`-truncation sanitizer in `dispatch_tool_call` and `_update_library`. If/when #35906 lands upstream, the sanitizer becomes a no-op and we leave it in place. - ---- - -## 10. Cost envelope (smoke run upper bound) - -Per task baseline (no IMPORT branch, e.g. first ~10 tasks before the toolbox is populated): K=3 across CREATE and SKIP only = 6 single-shot calls + 3 legacy IMPORT (no-op when toolbox empty, but the call is still made) = 9 round-trips. - -Per IMPORT-eligible task (~40 of 50): K=3 multi-turn IMPORT trajectories × up to 8 iterations each + 1 final no-tool turn = up to 27 calls; plus 6 for CREATE and SKIP = up to 33 round-trips. - -Total upper bound: 40·33 + 10·9 = **1410 round-trips** for the 50-task smoke. Acceptable for local vLLM. - ---- - -## 11. Out of scope (explicit) - -- Any change to PBEBench dataset/loader/scoring. -- Any change to CREATE or SKIP generation paths. -- Pre-seeding the toolbox. -- Toolbox persistence across runs. -- Any change to reward semantics. -- Any per-task or per-prompt iteration after the smoke run lands. -- Anthropic backend smoke runs.