From 412c1c6c0b75299bcb28ec28fcc93fd26afb6d62 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Thu, 23 Apr 2026 07:06:25 +0200 Subject: [PATCH 1/2] docs(agentv-bench): clean up stale grader references after #1148 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Follow-up to #1148. Three small clarifications: - grader.md: fix stale `bench-dir` example (`results/export/` → `results/runs/`) and clarify that `bench-dir` already includes the `` segment. The output path spec in Step 9 assumes this. - grader.md: annotate Field Descriptions to distinguish fields consumed by `pipeline bench` (`score`, `assertions[]`) from fields kept on disk for traceability. Also remove `execution_metrics` and `timing` from the list — #1148 dropped them from the JSON example but the descriptions still referenced them. - SKILL.md Phase 1: add a one-liner on how orchestrators detect which tests need Phase 2 — check whether `/llm_graders/` has any `.json` files. `pipeline input` only populates it for `llm-grader` assertions. Co-Authored-By: Claude Opus 4.7 (1M context) --- plugins/agentv-dev/skills/agentv-bench/SKILL.md | 2 +- .../agentv-dev/skills/agentv-bench/agents/grader.md | 10 +++++++--- 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/plugins/agentv-dev/skills/agentv-bench/SKILL.md b/plugins/agentv-dev/skills/agentv-bench/SKILL.md index 89d108e7..67df5b62 100644 --- a/plugins/agentv-dev/skills/agentv-bench/SKILL.md +++ b/plugins/agentv-dev/skills/agentv-bench/SKILL.md @@ -193,7 +193,7 @@ This evaluates all deterministic assertions against `response.md` files. Two typ Both types are configured by `pipeline input` into `code_graders/.json` and graded by `pipeline grade`. Results are written to `/code_grader_results/.json`. Alternatively, pass `--grader-type code` to `pipeline run` to run these inline. -**Do not dispatch LLM grader subagents for tests that only have `contains`, `regex`, or other built-in assertions** — `pipeline grade` handles them entirely, at zero cost. +**Do not dispatch LLM grader subagents for tests that only have `contains`, `regex`, or other built-in assertions** — `pipeline grade` handles them entirely, at zero cost. To detect which tests need Phase 2, check whether `/llm_graders/` contains any `.json` config files — `pipeline input` only writes there for `llm-grader` assertions. Tests with an empty (or missing) `llm_graders/` directory are done after Phase 1. **Phase 2: LLM grading** (semantic — do NOT skip this phase) diff --git a/plugins/agentv-dev/skills/agentv-bench/agents/grader.md b/plugins/agentv-dev/skills/agentv-bench/agents/grader.md index 1f176b57..d071b969 100644 --- a/plugins/agentv-dev/skills/agentv-bench/agents/grader.md +++ b/plugins/agentv-dev/skills/agentv-bench/agents/grader.md @@ -18,7 +18,7 @@ You are the grader for an AgentV evaluation test case. You have two jobs: **grad - `eval-path`: Path to the eval YAML file - `test-id`: The test case ID - `response-file`: Path to the executor's response (e.g., `response.md`) -- `bench-dir`: Path to the bench run directory (e.g., `.agentv/results/export//`) +- `bench-dir`: Path to the test's parent directory — the run directory qualified by evalset name when the eval.yaml has a `name` field. Example: `.agentv/results/runs////`, or `.agentv/results/runs///` when the eval.yaml has no `name`. The grader writes results under `{bench-dir}/{test-id}/...`. - `timing-file`: Path to `timing.json` (for execution-metrics/latency/cost assertions) ## Process @@ -196,10 +196,14 @@ Do **NOT** write directly to `grading.json` — that file is produced by `agentv ### Field Descriptions +`pipeline bench` consumes only `score` and `assertions[]` from this file when merging into the canonical `grading.json`. The remaining fields are preserved on disk for human review and downstream tooling, but do not flow into the merged output. + +**Consumed by `pipeline bench`:** +- **score**: Weighted overall score for this grader (0.0-1.0) - **assertions**: Array of per-assertion results — `text` (assertion description), `passed` (boolean), `evidence` (cited quote or description) + +**Kept for traceability (not merged):** - **summary**: Aggregate stats — `passed`, `failed`, `total`, `pass_rate` (0.0-1.0) -- **execution_metrics**: From executor metrics/timing — tool call counts, output size. Omit if not available. -- **timing**: From `timing-file` — executor and total duration in seconds. Omit if not available. - **claims**: Extracted and verified claims — `claim` (statement), `type` (factual/process/quality), `verified` (boolean), `evidence` - **user_notes_summary**: Issues from executor notes — `uncertainties[]`, `needs_review[]`, `workarounds[]`. Empty arrays if no notes found. - **eval_feedback**: Suggestions for improving the evals — `suggestions[]` (array of `{assertion?, reason}`), `overall` (brief assessment) From d4c8f025c9d71ac0815c3f4c161584646be6dc82 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Thu, 23 Apr 2026 08:55:53 +0200 Subject: [PATCH 2/2] fix(pipeline): align subagent-mode suite fallback with CLI mode MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Before this change, `pipeline input` and `pipeline run` resolved the suite (evalset) directory name from `suite.metadata?.name` only. If the eval.yaml had no `name` field, `suite.metadata` was undefined and the artifact layout skipped the suite segment entirely — writing to `//...` directly. CLI mode (`agentv eval`, via `artifact-writer.ts:buildArtifactSubdir` consuming `test.suite`) already falls back through `metadata.name` → eval-file basename (stripping `.eval.yaml`) → `'eval'`. This fallback is applied by the loaders (`yaml-parser.ts:317-324`, `jsonl-parser.ts:165`) and attached to every test as `test.suite`. Switch `pipeline input` and `pipeline run` to read `tests[0]?.suite` so subagent-mode runs produce the same `//` layout CLI mode produces. `pipeline bench` and `pipeline grade` consume `manifest.suite` which is written by these two, so they pick up the change automatically — no consumer edits needed. Docs in the agentv-bench skill updated to match: `` is now always present in the artifact tree, not conditional. Regression test covers the no-`name` case — previously resolved to `//`, now resolves to `/no-name//` matching `no-name.eval.yaml`'s basename. Co-Authored-By: Claude Opus 4.7 (1M context) --- apps/cli/src/commands/pipeline/input.ts | 5 ++++- apps/cli/src/commands/pipeline/run.ts | 5 ++++- .../eval/pipeline/fixtures/no-name.eval.yaml | 8 ++++++++ apps/cli/test/commands/eval/pipeline/input.test.ts | 14 ++++++++++++++ .../skills/agentv-bench/agents/grader.md | 2 +- .../agentv-bench/references/subagent-pipeline.md | 4 ++-- 6 files changed, 33 insertions(+), 5 deletions(-) create mode 100644 apps/cli/test/commands/eval/pipeline/fixtures/no-name.eval.yaml diff --git a/apps/cli/src/commands/pipeline/input.ts b/apps/cli/src/commands/pipeline/input.ts index 5f24a87f..2bdbb217 100644 --- a/apps/cli/src/commands/pipeline/input.ts +++ b/apps/cli/src/commands/pipeline/input.ts @@ -134,7 +134,10 @@ export const evalInputCommand = command({ // No targets file found — subagent-as-target mode } - const suiteName = suite.metadata?.name?.trim() ?? ''; + // Use tests[0].suite — loaders (yaml-parser, jsonl-parser) already apply the + // metadata.name → filename-basename → 'eval' fallback. This keeps subagent-mode + // artifact layout aligned with CLI mode (artifact-writer.ts:buildArtifactSubdir). + const suiteName = tests[0]?.suite?.trim() ?? ''; const safeSuiteName = suiteName ? suiteName.replace(/[\/\\:*?"<>|]/g, '_') : ''; const testIds: string[] = []; diff --git a/apps/cli/src/commands/pipeline/run.ts b/apps/cli/src/commands/pipeline/run.ts index 5e2758fb..161c69cd 100644 --- a/apps/cli/src/commands/pipeline/run.ts +++ b/apps/cli/src/commands/pipeline/run.ts @@ -158,7 +158,10 @@ export const evalRunCommand = command({ // No targets file — subagent-as-target mode } - const suiteName = suite.metadata?.name?.trim() ?? ''; + // Use tests[0].suite — loaders (yaml-parser, jsonl-parser) already apply the + // metadata.name → filename-basename → 'eval' fallback. This keeps subagent-mode + // artifact layout aligned with CLI mode (artifact-writer.ts:buildArtifactSubdir). + const suiteName = tests[0]?.suite?.trim() ?? ''; const safeSuiteName = suiteName ? suiteName.replace(/[\/\\:*?"<>|]/g, '_') : ''; const testIds: string[] = []; diff --git a/apps/cli/test/commands/eval/pipeline/fixtures/no-name.eval.yaml b/apps/cli/test/commands/eval/pipeline/fixtures/no-name.eval.yaml new file mode 100644 index 00000000..a793e9c2 --- /dev/null +++ b/apps/cli/test/commands/eval/pipeline/fixtures/no-name.eval.yaml @@ -0,0 +1,8 @@ +tests: + - id: test-01 + input: hello world + criteria: Response echoes the input + assertions: + - name: contains_hello + type: contains + value: hello diff --git a/apps/cli/test/commands/eval/pipeline/input.test.ts b/apps/cli/test/commands/eval/pipeline/input.test.ts index c7546e45..cbb4e54a 100644 --- a/apps/cli/test/commands/eval/pipeline/input.test.ts +++ b/apps/cli/test/commands/eval/pipeline/input.test.ts @@ -128,4 +128,18 @@ describe('pipeline input', () => { expect(regexGrader.type).toBe('regex'); expect(regexGrader.value).toBe('h[aeiou]llo'); }); + + it('falls back to eval file basename for suite directory when name is absent', async () => { + const { execa } = await import('execa'); + const noNameEvalPath = join(FIXTURE_DIR, 'no-name.eval.yaml'); + await execa('bun', [CLI_ENTRY, 'pipeline', 'input', noNameEvalPath, '--out', OUT_DIR]); + + const input = JSON.parse( + await readFile(join(OUT_DIR, 'no-name', 'test-01', 'input.json'), 'utf8'), + ); + expect(input.input[0].content).toBe('hello world'); + + const manifest = JSON.parse(await readFile(join(OUT_DIR, 'manifest.json'), 'utf8')); + expect(manifest.suite).toBe('no-name'); + }); }); diff --git a/plugins/agentv-dev/skills/agentv-bench/agents/grader.md b/plugins/agentv-dev/skills/agentv-bench/agents/grader.md index d071b969..c34cd879 100644 --- a/plugins/agentv-dev/skills/agentv-bench/agents/grader.md +++ b/plugins/agentv-dev/skills/agentv-bench/agents/grader.md @@ -18,7 +18,7 @@ You are the grader for an AgentV evaluation test case. You have two jobs: **grad - `eval-path`: Path to the eval YAML file - `test-id`: The test case ID - `response-file`: Path to the executor's response (e.g., `response.md`) -- `bench-dir`: Path to the test's parent directory — the run directory qualified by evalset name when the eval.yaml has a `name` field. Example: `.agentv/results/runs////`, or `.agentv/results/runs///` when the eval.yaml has no `name`. The grader writes results under `{bench-dir}/{test-id}/...`. +- `bench-dir`: Path to the test's parent directory — the run directory qualified by evalset name. Example: `.agentv/results/runs////`. The evalset name comes from the eval.yaml `name` field; when absent, it falls back to the eval file's basename (e.g. `my-suite.eval.yaml` → `my-suite`), matching CLI mode. The grader writes results under `{bench-dir}/{test-id}/...`. - `timing-file`: Path to `timing.json` (for execution-metrics/latency/cost assertions) ## Process diff --git a/plugins/agentv-dev/skills/agentv-bench/references/subagent-pipeline.md b/plugins/agentv-dev/skills/agentv-bench/references/subagent-pipeline.md index 23994de9..fabff41b 100644 --- a/plugins/agentv-dev/skills/agentv-bench/references/subagent-pipeline.md +++ b/plugins/agentv-dev/skills/agentv-bench/references/subagent-pipeline.md @@ -147,11 +147,11 @@ The path hierarchy mirrors the CLI mode: `` comes from the `name` the eval.yaml. The target is recorded in `manifest.json` — one run = one target. ``` -.agentv/results/runs// +.agentv/results/runs/// ├── manifest.json ← eval metadata, target, test_ids ├── index.jsonl ← per-test scores ├── benchmark.json ← aggregate statistics -└── / ← from eval.yaml "name" field (omitted if absent) +└── / ← eval.yaml "name" field, or eval file basename if absent (same as CLI mode) └── / ← test case id ├── input.json ← test input text + messages ├── invoke.json ← target command or agent instructions