diff --git a/apps/web/src/content/docs/docs/targets/cli-provider.mdx b/apps/web/src/content/docs/docs/targets/cli-provider.mdx new file mode 100644 index 00000000..f1eb91e6 --- /dev/null +++ b/apps/web/src/content/docs/docs/targets/cli-provider.mdx @@ -0,0 +1,145 @@ +--- +title: CLI Provider +description: Wrap any shell command as an evaluation target +sidebar: + order: 4 +--- + +The `cli` provider runs an arbitrary shell command per test case and captures its output as the target's response. It's the escape hatch that lets you evaluate *anything* that exposes a command-line entry point — your own agent, a third-party CLI, a stub that prints a fixed answer, a script that calls an in-house microservice, etc. + +Because the contract is "we invoke a command and read a file," almost any useful composition pattern (sanity-checking your grader against a known-good answer, diffing two implementations, driving a batch mode) can be built on top without any new primitives. + +## Minimal example + +```yaml +# .agentv/targets.yaml +targets: + - name: my_agent + provider: cli + command: python agent.py --prompt {PROMPT} --out {OUTPUT_FILE} + grader_target: azure-base # required if your evals use LLM graders +``` + +Your `agent.py` reads the prompt, writes its response to the path passed as `--out`, and exits `0`. That's it. + +## Command contract + +Before each test case, AgentV renders the `command` template and spawns it as a shell process. The command has two responsibilities: + +1. **Read the input** via one of the placeholders below. +2. **Write the response to `{OUTPUT_FILE}`** — AgentV reads *that file*, not your stdout. + +When the process exits successfully, AgentV parses the contents of `{OUTPUT_FILE}` and treats it as the target's response. Non-zero exits, timeouts, and unreadable output files are surfaced as test errors with the underlying stderr/exit code. + +### Template placeholders + +Use these in `command`; AgentV substitutes them per test case. + +| Placeholder | What it expands to | +|---|---| +| `{PROMPT}` | The test case's input text, shell-escaped. | +| `{PROMPT_FILE}` | Path to a temp file containing the prompt (use this when the input is large enough to blow past shell argv limits). | +| `{OUTPUT_FILE}` | Path to a temp file the command **must** write to. Deleted after the run unless `keep_temp_files: true`. | +| `{FILES}` | Space-separated paths of any input files attached to the test case, formatted via `files_format`. | +| `{EVAL_ID}` | Unique identifier of the current test case — useful for logging or per-case scratch dirs. | +| `{ATTEMPT}` | Retry attempt number (0 on the first try). | + +### Output file format + +AgentV tries to parse `{OUTPUT_FILE}` as JSON first. If it parses and contains any of these keys, they're picked up; if it doesn't parse, the entire content is treated as the assistant's message text. + +```jsonc +{ + "output": [ // preferred: full message array + { "role": "assistant", "content": "..." } + ], + "text": "...", // fallback: plain assistant text + "token_usage": { "input": 123, "output": 456, "cached": 0 }, + "cost_usd": 0.0042, + "duration_ms": 1800 +} +``` + +For the common case, plain text is fine: + +```bash +echo "Hello, world!" > {OUTPUT_FILE} +``` + +## Configuration fields + +| Field | Type | Required | Default | Description | +|---|---|---|---|---| +| `name` | string | yes | — | Target identifier used in eval configs. | +| `provider` | literal `"cli"` | yes | — | Selects this provider. | +| `command` | string | yes | — | Shell command template. | +| `timeout_seconds` | number | no | — | Kill the process if it runs longer than this. | +| `cwd` | string | no | eval dir | Working directory. Relative paths resolve against the eval file. | +| `files_format` | string | no | `{path}` | How each entry in `{FILES}` is formatted. Placeholders: `{path}`, `{basename}`. | +| `verbose` | boolean | no | `false` | Log the rendered command and cwd to stdout. Useful for debugging template substitution. | +| `keep_temp_files` | boolean | no | `false` | Preserve `{PROMPT_FILE}` / `{OUTPUT_FILE}` after the run — handy while iterating on your command. | +| `healthcheck` | object | no | — | Pre-run health check (HTTP or command); the eval aborts if it fails. | +| `workers` | number | no | — | Concurrent test-case executions against this target. | +| `provider_batching` | boolean | no | `false` | Run all cases in one command invocation — see [Batching](#batching). | +| `grader_target` | string | no | — | LLM target used by this target's LLM graders. Required if your evals use LLM-based graders. | + +## Batching + +For targets where spin-up cost dominates per-case work (e.g. loading a model, authenticating), set `provider_batching: true`. AgentV invokes the command *once*, hands it a JSONL stream of cases, and expects a JSONL response keyed by each case's `id`: + +```yaml +targets: + - name: batched_agent + provider: cli + provider_batching: true + command: python agent.py --batch-in {PROMPT_FILE} --batch-out {OUTPUT_FILE} +``` + +`{PROMPT_FILE}` contains one JSON object per line with an `id` and the case's inputs; your command writes one line per case to `{OUTPUT_FILE}`, each carrying the matching `id` plus the same output shape as the non-batched case. + +## Pattern: Oracle validation (sanity-check your grader) + +A common question when building a new eval: **"if my grader scores my agent poorly, is the agent wrong or is the grader wrong?"** The classical testing answer is to run a known-correct reference ("the oracle") through the same grader — if a perfect answer doesn't pass, the grader is the bug. + +AgentV has no dedicated "oracle" feature because the `cli` provider already composes into one. Declare a second target that prints your known-good answer into `{OUTPUT_FILE}`, run the same eval against it, and assert a perfect score: + +```yaml +# .agentv/targets.yaml +targets: + - name: my_agent + provider: cli + command: python agent.py --prompt {PROMPT} --out {OUTPUT_FILE} + grader_target: azure-base + + - name: oracle + provider: cli + command: cp fixtures/{EVAL_ID}.expected.txt {OUTPUT_FILE} + grader_target: azure-base +``` + +```bash +# While iterating on your grader, run the oracle first. +# If it doesn't score 100%, fix the grader before trusting any agent results. +agentv eval my.EVAL.yaml --target oracle + +# Then run the real target. +agentv eval my.EVAL.yaml --target my_agent +``` + +A few practical notes: + +- `{EVAL_ID}` in the oracle command lets one target serve an entire eval suite — just ship one `fixtures/.expected.txt` per case. Alternatively, read the expected output from wherever your rubric already keeps it. +- If the oracle doesn't reach 100%, that's the bug. Do not proceed to scoring real agents until it does. +- If the oracle *does* reach 100%, low scores on real agents are a signal about the agent, not the grader. +- The same composition works for other meta-tests: a "deliberately wrong" target that should score 0, a "mostly right" target pinned at a known partial score, etc. + +The pattern needs no special config field, no directory convention, and no flag — it's just a second target that happens to know the answer. + +## Debugging + +When a `cli` target misbehaves: + +1. Set `verbose: true` to see the rendered command and cwd. +2. Set `keep_temp_files: true` and inspect `{PROMPT_FILE}` / `{OUTPUT_FILE}` after the run. +3. Run the rendered command by hand with those files and check it exits `0` and writes the expected output shape. +4. If the output looks right but grading is off, check the JSON schema — a typo in `output` vs `output_messages` silently falls back to "treat whole file as plain text." diff --git a/apps/web/src/content/docs/docs/targets/configuration.mdx b/apps/web/src/content/docs/docs/targets/configuration.mdx index 92befdd6..e7840864 100644 --- a/apps/web/src/content/docs/docs/targets/configuration.mdx +++ b/apps/web/src/content/docs/docs/targets/configuration.mdx @@ -56,7 +56,7 @@ already-exported secrets into `.env`. | `pi-coding-agent` | Agent | Pi Coding Agent | | `vscode` | Agent | VS Code with Copilot | | `vscode-insiders` | Agent | VS Code Insiders | -| `cli` | Agent | Any CLI command | +| `cli` | Agent | Any CLI command — see [CLI Provider](/docs/targets/cli-provider) | | `mock` | Testing | Mock provider for dry runs | ## Referencing Targets in Evals diff --git a/apps/web/src/content/docs/docs/targets/custom-providers.mdx b/apps/web/src/content/docs/docs/targets/custom-providers.mdx index aa652bd3..737ba57b 100644 --- a/apps/web/src/content/docs/docs/targets/custom-providers.mdx +++ b/apps/web/src/content/docs/docs/targets/custom-providers.mdx @@ -2,7 +2,7 @@ title: Custom Providers (SDK) description: Implement native TypeScript providers using the ProviderRegistry API sidebar: - order: 5 + order: 6 --- Custom providers let you implement evaluation targets in TypeScript instead of shelling out to a CLI command. This is useful when you want to call an HTTP API, use an SDK, or implement custom logic that goes beyond what the CLI provider supports. diff --git a/apps/web/src/content/docs/docs/targets/retry.mdx b/apps/web/src/content/docs/docs/targets/retry.mdx index eba64794..b0176952 100644 --- a/apps/web/src/content/docs/docs/targets/retry.mdx +++ b/apps/web/src/content/docs/docs/targets/retry.mdx @@ -2,7 +2,7 @@ title: Retry Configuration description: Configure automatic retry with exponential backoff sidebar: - order: 4 + order: 5 --- Configure automatic retry with exponential backoff for transient failures.