This plan defines how to test TMP's main hypotheses. It is meant to be updated as prototypes mature.
- TMP reduces agent tool-call count.
- TMP reduces input and output token usage.
- TMP reduces invalid or hallucinated operations.
- TMP reduces time-to-correct-action.
- TMP improves terminal completion relevance.
- TMP improves success rate when tasks span CLI, API, SQL, workflows, and scripts.
- TMP reduces noisy command output while preserving the signal needed for the next action.
Every task should be run in at least two modes:
baseline: agent/user has normal repo access but no TMP map.tmp_assisted: agent/user can use TMP compile, list, resolve, and invoke flows.
Optional future modes:
tmp_completion: terminal completion scenario using TMP candidates.tmp_registry: task begins by installing a registry schema.tmp_generated_schema: task begins with agent-generated schema draft plus verification.tmp_output_policy: task invokes a mapped command and receives an operation-aware output summary.tmp_rtk: task resolves through TMP and compresses CLI output through RTK or an RTK-compatible adapter.tmp_generated_rtk_filter: task uses TMP to draft an RTK-compatible filter for a command RTK does not already support.
Record these fields for each run:
{
"run_id": "2026-06-03T00-00-00Z-example",
"task_id": "cli.parser_tests",
"mode": "tmp_assisted",
"surface": "cli",
"agent": "codex",
"model": "user_selected",
"tool_calls": 0,
"input_tokens": 0,
"output_tokens": 0,
"raw_output_bytes": 0,
"shaped_output_bytes": 0,
"output_signal_preserved": true,
"wall_time_ms": 0,
"success": false,
"invalid_operation_attempted": false,
"user_clarifications": 0,
"failed_attempts": 0,
"notes": ""
}| ID | Surface | Task | Success Criteria |
|---|---|---|---|
cli.unit_tests |
CLI | Run unit tests for a named package or module. | Correct test command runs. |
cli.bin_completion |
Completion | Complete cargo run --bin <TAB>. |
Candidates match workspace binaries. |
api.deploy_staging |
API | Create a staging deployment. | Correct endpoint and approval classification. |
sql.recent_failures |
SQL | Show recent failed jobs. | Valid read-only query with bounded limit. |
sql.block_mutation |
SQL | Try to delete old rows. | TMP classifies as write/destructive and blocks or requests approval. |
workflow.release_candidate |
Workflow | Run release candidate workflow. | Correct ordered workflow selected. |
script.sync_data |
Script | Run data sync for a target environment. | Correct script args and environment selected. |
output.cargo_test |
Output policy | Run tests and summarize failures. | Summary preserves failures, panics, compiler errors, counts, exit status, and raw-output pointer. |
output.tmp_rtk_cargo_test |
Output policy | Resolve tests with TMP and compress command output with RTK. | TMP records operation metadata and RTK reduces shell noise without losing failure signal. |
output.generated_rtk_filter |
Output policy | Generate an RTK-compatible filter for an unsupported local command. | Draft filter includes samples, tests, raw-output pointer, and remains unverified until reviewed. |
output.unsupported_rtk_command |
Output policy | Run a command that RTK does not rewrite. | TMP detects no RTK compressor and chooses generated draft, TMP-native policy, or raw passthrough. |
output.git_status |
Output policy | Summarize repository status. | Summary preserves branch and grouped changed paths without boilerplate. |
output.git_diff |
Output policy | Summarize a large diff. | Summary preserves changed files and bounded relevant hunks with raw-output pointer. |
output.rg |
Output policy | Summarize search results. | Summary preserves match files, counts, and bounded snippets. |
output.logs |
Output policy | Summarize service logs. | Summary preserves recent errors, warnings, service names, timestamps, and trace IDs. |
registry.install_reuse |
Registry | Install and reuse a schema. | Schema installs and resolves a task. |
missing.schema |
Failure | Ask for an unmapped action. | TMP fails closed with no invented operation. |
- Prepare a fixture repo or environment for the scenario.
- Reset generated files and caches.
- Run the baseline mode.
- Record metrics.
- Reset the environment again.
- Run the TMP-assisted mode.
- Record metrics.
- Compare deltas.
Recommended derived metrics:
tool_call_reduction = baseline_tool_calls - tmp_tool_callstoken_reduction = baseline_tokens - tmp_tokensoutput_reduction_ratio = 1 - shaped_output_bytes / raw_output_byteslatency_reduction_ms = baseline_wall_time_ms - tmp_wall_time_msinvalid_operation_delta = baseline_invalid_attempts - tmp_invalid_attempts
For TMP plus RTK scenarios, compare at least three modes:
baseline: raw command output reaches the agent.tmp_output_policy: TMP resolves and shapes output with a built-in policy.tmp_rtk: TMP resolves the operation and delegates CLI output compression to RTK.tmp_generated_rtk_filter: TMP drafts an RTK-compatible filter when RTK does not already support the command.
This tells us whether the combined approach improves both sides of the problem: fewer exploratory calls from TMP and smaller command results from RTK.
These are proposed targets for early validation:
- Reduce tool calls by at least 50% for common CLI resolution tasks.
- Reduce token usage by at least 30% for tasks requiring command discovery.
- Reduce shaped command output by at least 60% for noisy command classes while preserving required failure/status signal.
- Generate RTK-compatible draft filters for unsupported but line-oriented commands with at least one passing sample test.
- Keep invalid operation attempts at zero for verified TMP mappings.
- Achieve completion precision at 5 of at least 90% for dynamic CLI completions.
- Fail closed for 100% of missing-schema tasks.
For each benchmark run, save:
- Agent transcript or terminal log.
- TMP command outputs.
- Raw command output and TMP-shaped output for output-policy scenarios.
- Generated
.tmp/context.mdand.tmp/commands.jsonwhen relevant. - Token/tool-call accounting if available from the agent.
- Final operation invoked.
- Whether the agent needed to request raw output to proceed.
- Whether generated output filters were reviewed, trusted, and verified before use.
- Pass/fail outcome.
- Which agents should be included in the first comparison?
- Which benchmark fixtures should be maintained in this repository?
- Should token counts be collected through agent APIs, logs, or manual estimates?
- Should TMP expose a benchmark command, or should benchmarks stay as external fixtures initially?
- What risk classification vocabulary should be stable before SQL/API benchmarks?
- Which output policies should be built first: tests, diffs, search, status, or logs?
- What threshold defines "signal preserved" for each output policy?
- Should
tmp generate rtkemit.rtk/filters.toml, TMP-native policy JSON, or both? - How should TMP detect that RTK already supports a command: call
rtk rewrite, inspect a registry export, or maintain compatibility metadata?