Skip to content

feat: programmatic tool calling (the script tool)#50

Merged
agjs merged 4 commits into
mainfrom
feat/script-tool
Jun 26, 2026
Merged

feat: programmatic tool calling (the script tool)#50
agjs merged 4 commits into
mainfrom
feat/script-tool

Conversation

@agjs

@agjs agjs commented Jun 26, 2026

Copy link
Copy Markdown
Owner

What

Adds an opt-in `script` tool (`TSFORGE_SCRIPT=1`) — Programmatic Tool Calling. The model writes ONE TypeScript program that calls tools through generated `./tsforge-tools` stubs, collapsing a mechanical multi-step tool chain into a single model turn. Only the script's stdout returns to the model; intermediate results never enter context.

Why

Attacks the one axis the correctness gate is blind to — token + latency cost. Exploration/multi-file work that today costs N model turns (read 8 files, fetch+compare packages, transform-then-write across files) becomes one turn. Inspired by hermes-agent's PTC.

How (no new powers, just ergonomics)

Each stub call POSTs `{tool,args}` to a loopback RPC server that dispatches through the existing `executeTool` chokepoint — so scope, the unified policy, the write-guard, mutation accounting, and the gate all still apply. Same trust level as `run`. Bounded by a wall-clock timeout (kill), a per-script call cap, and output condensing. RPC subset excludes scaffolds/installer/yield and `script` itself (no recursion); token-gated and serialized. Not advertised in plan mode and rejected at dispatch there.

Also adds `TOOL_SPECS` — one source of truth for per-tool flags (`readOnly`, `scriptExposable`); `READ_ONLY_TOOL_NAMES` + the script-exposable subset now derive from it.

Tests

15 new (`script-tool.test.ts` + gating/accounting): stub generation, single-turn batching + call accounting, real in/out-of-scope writes through `executeTool`, plan-mode rejection, call cap, timeout kill, token/recursion/non-exposable guards, serialization, registry equivalence. Full `bun run validate` green (1616 pass).

Eval status (gating this PR)

  • ✅ Real-model smoke (DeepSeek, `fix-regression`, script on): 100% pass, no regression — harness runs end-to-end with the tool.
  • Win-proving A/B (`TSFORGE_FEATURE_VARIANTS=script`, TTSR + tokens at equal pass-rate, on exploration-heavy seeds) — to run before merge. Ships only if the sweep shows a real, non-regressing cost reduction.

🤖 Generated with Claude Code

…ag registry)

Add an opt-in `script` tool (TSFORGE_SCRIPT=1) that lets the model write ONE
TypeScript program calling tools through generated `./tsforge-tools` stubs,
collapsing a mechanical multi-step tool chain into a single model turn. Only the
script's stdout returns to the model — intermediate results never enter context.

Mechanism: doScript writes the stub module + the model's code to a temp dir,
starts a loopback RPC server (Bun.serve on 127.0.0.1, one-time token), and runs
`bun run script.ts`. Each stub call POSTs {tool,args} back to the server, which
dispatches through the EXISTING executeTool chokepoint — so scope, the unified
policy, the write-guard, mutation accounting, and the gate all still apply. The
model gains ergonomics, not new powers (same trust level as `run`).

Bounded by a wall-clock timeout (kill), a per-script tool-call cap, and output
condensing. The RPC subset excludes the scaffolds, the dependency installer,
yield, and `script` itself (no recursion); requests are token-gated and
serialized so concurrent stub calls can't interleave a mutation. Not advertised
in plan mode and rejected at dispatch there, so the "no writes while planning"
guarantee holds.

Also introduces TOOL_SPECS in agent.constants.ts — one source of truth for
per-tool flags (readOnly, scriptExposable) that READ_ONLY_TOOL_NAMES and the
script-exposable subset now derive from, replacing hand-kept sets.

Tests: 15 new (stub generation, single-turn batching + call accounting, real
in/out-of-scope writes through executeTool, plan-mode rejection, call cap,
timeout kill, token/recursion/non-exposable guards, serialization, registry
equivalence). Full `bun run validate` green (1616 pass).

Eval validation (A/B: TSFORGE_SCRIPT off vs on, TTSR + token cost at
equal-or-better gate pass-rate) to follow — the feature ships only if the sweep
shows a real, non-regressing cost reduction.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new script tool that enables programmatic tool calling by executing a TypeScript program to batch multiple tool calls into a single turn. The feedback highlights two critical issues: first, creating the temporary directory in the system temp folder breaks module resolution for project dependencies and can leak resources on server initialization errors; second, executing multiple writes within a single script turn bypasses the write-guard and touched tracking because the current tracking mechanism only records the last written file.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/core/src/loop/tools/script-tool.ts Outdated
Comment thread packages/core/src/loop/turn.ts
agjs added 2 commits June 26, 2026 21:07
…r-use

The first A/B showed `script` was neutral-to-negative on create-heavy tasks:
the model reached for it on trivial independent creates (which it already
batches into one turn), adding a cycle. Two changes fix and prove it:

1. Retarget the guidance: use `script` ONLY when the change to many files
   DEPENDS on first reading each file (a read→act loop the model otherwise
   splits into a read turn + an edit turn). Explicitly steer away from
   independent batch-creates/edits. This kills the over-use.

2. Add the `migrate` eval seed — a brownfield codemod (8 services, each edited
   using a tier read from its own header comment) — the read-dependent shape
   where PTC should help.

A/B (DeepSeek, temp 0):
- migrate (read-dependent codemod), pooled n=20/variant:
  off 60% pass, ~3.0 cyc, ~17s — stuck ~40% of runs
  on  95% pass, ~1.7 cyc, ~13s — stuck ~5% of runs   (Fisher p≈0.008)
- simple controls (validators/fixtures/handlers): on == off cycles, equal or
  slightly faster, equal quality — over-use regression gone.

Given a real win on its target shape and no regression elsewhere, flip `script`
to DEFAULT-ON with a `TSFORGE_NO_SCRIPT` kill switch (matching NO_LSP/NO_GIT) —
no opt-in flag for users to think about. It makes no network calls, so
default-on keeps eval sweeps deterministic. The sweep's `script` A/B dimension
now toggles `TSFORGE_NO_SCRIPT` (inverted, like `git`).

Tests updated for default-on (gating: present by default, withheld under
TSFORGE_NO_SCRIPT). Full `bun run validate` green (1616 pass).
@agjs agjs marked this pull request as ready for review June 26, 2026 20:08
@agjs

agjs commented Jun 26, 2026

Copy link
Copy Markdown
Owner Author

Eval: proven, now default-on

Tuned the tool after the first A/B showed over-use on trivial tasks, then re-measured (DeepSeek, temp 0).

Win — read-dependent multi-file codemod (migrate seed, pooled n=20/variant):

Pass Cycles Time Stuck
script off 60% ~3.0 ~17s ~40% of runs
script on 95% ~1.7 ~13s ~5% of runs

Fisher exact p≈0.008. Doing the codemod manually makes the model thrash and stall ~2 in 5 runs; a script makes it reliable, ~38% fewer cycles, ~23% faster, same quality.

No regression — simple controls (validators, fixtures, handlers, n=5 each): script on == off on cycles (1.0–1.2), equal-or-slightly-faster, equal quality. The earlier over-use (fixtures on was 2.2 cyc) is gone after the guidance retarget (now 1.2 == baseline).

Decision: flipped script to default-on with a TSFORGE_NO_SCRIPT kill switch (matching NO_LSP/NO_GIT) — no opt-in flag for users. It makes no network calls, so default-on keeps eval sweeps deterministic.

Full bun run validate green (1616 pass). New migrate seed + sweep script A/B dimension included for reproducibility.

…s (PR #50 review)

Two critical issues from Gemini review:

1. The script ran from a temp dir under the system tmpdir, so a script importing
   a project dependency (zod, etc.) failed module resolution, and the dir leaked
   if startRpcServer threw before the try block. Create the temp dir inside
   ctx.cwd (hidden .tsforge-script-* prefix so eslint/tsc ignore it) so Node/Bun
   resolution walks up to the workspace node_modules + relative imports, and move
   the server start inside try so the dir is always cleaned up.

2. runToolCalls tracked a single wrote.path, overwritten per edit/create event —
   so a script that writes N files only recorded the LAST in touched and only
   write-guarded that one; the rest bypassed the write-guard and change-scoped
   rules (test-sibling-required). Collect ALL written in-scope paths in a Set and
   recordTouched + write-guard each.

Tests: script resolves a workspace node_modules dep; a 3-file script records all
three in touched (state.edits=3) and leaves no temp dir behind. validate green
(1618 pass).
@agjs agjs merged commit 79d64f2 into main Jun 26, 2026
8 checks passed
@agjs agjs deleted the feat/script-tool branch June 26, 2026 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant