From 412c1c6c0b75299bcb28ec28fcc93fd26afb6d62 Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Thu, 23 Apr 2026 07:06:25 +0200
Subject: [PATCH 1/2] docs(agentv-bench): clean up stale grader references
 after #1148
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Follow-up to #1148. Three small clarifications:

- grader.md: fix stale `bench-dir` example (`results/export/` →
  `results/runs/`) and clarify that `bench-dir` already includes the
  `<evalset>` segment. The output path spec in Step 9 assumes this.

- grader.md: annotate Field Descriptions to distinguish fields
  consumed by `pipeline bench` (`score`, `assertions[]`) from fields
  kept on disk for traceability. Also remove `execution_metrics` and
  `timing` from the list — #1148 dropped them from the JSON example
  but the descriptions still referenced them.

- SKILL.md Phase 1: add a one-liner on how orchestrators detect
  which tests need Phase 2 — check whether `<test-id>/llm_graders/`
  has any `.json` files. `pipeline input` only populates it for
  `llm-grader` assertions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 plugins/agentv-dev/skills/agentv-bench/SKILL.md        |  2 +-
 .../agentv-dev/skills/agentv-bench/agents/grader.md    | 10 +++++++---
 2 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/plugins/agentv-dev/skills/agentv-bench/SKILL.md b/plugins/agentv-dev/skills/agentv-bench/SKILL.md
index 89d108e7..67df5b62 100644
--- a/plugins/agentv-dev/skills/agentv-bench/SKILL.md
+++ b/plugins/agentv-dev/skills/agentv-bench/SKILL.md
@@ -193,7 +193,7 @@ This evaluates all deterministic assertions against `response.md` files. Two typ
 
 Both types are configured by `pipeline input` into `code_graders/<name>.json` and graded by `pipeline grade`. Results are written to `<test-id>/code_grader_results/<name>.json`. Alternatively, pass `--grader-type code` to `pipeline run` to run these inline.
 
-**Do not dispatch LLM grader subagents for tests that only have `contains`, `regex`, or other built-in assertions** — `pipeline grade` handles them entirely, at zero cost.
+**Do not dispatch LLM grader subagents for tests that only have `contains`, `regex`, or other built-in assertions** — `pipeline grade` handles them entirely, at zero cost. To detect which tests need Phase 2, check whether `<test-id>/llm_graders/` contains any `.json` config files — `pipeline input` only writes there for `llm-grader` assertions. Tests with an empty (or missing) `llm_graders/` directory are done after Phase 1.
 
 **Phase 2: LLM grading** (semantic — do NOT skip this phase)
 
diff --git a/plugins/agentv-dev/skills/agentv-bench/agents/grader.md b/plugins/agentv-dev/skills/agentv-bench/agents/grader.md
index 1f176b57..d071b969 100644
--- a/plugins/agentv-dev/skills/agentv-bench/agents/grader.md
+++ b/plugins/agentv-dev/skills/agentv-bench/agents/grader.md
@@ -18,7 +18,7 @@ You are the grader for an AgentV evaluation test case. You have two jobs: **grad
 - `eval-path`: Path to the eval YAML file
 - `test-id`: The test case ID
 - `response-file`: Path to the executor's response (e.g., `response.md`)
-- `bench-dir`: Path to the bench run directory (e.g., `.agentv/results/export/<timestamp>/`)
+- `bench-dir`: Path to the test's parent directory — the run directory qualified by evalset name when the eval.yaml has a `name` field. Example: `.agentv/results/runs/<experiment>/<timestamp>/<evalset-name>/`, or `.agentv/results/runs/<experiment>/<timestamp>/` when the eval.yaml has no `name`. The grader writes results under `{bench-dir}/{test-id}/...`.
 - `timing-file`: Path to `timing.json` (for execution-metrics/latency/cost assertions)
 
 ## Process
@@ -196,10 +196,14 @@ Do **NOT** write directly to `grading.json` — that file is produced by `agentv
 
 ### Field Descriptions
 
+`pipeline bench` consumes only `score` and `assertions[]` from this file when merging into the canonical `grading.json`. The remaining fields are preserved on disk for human review and downstream tooling, but do not flow into the merged output.
+
+**Consumed by `pipeline bench`:**
+- **score**: Weighted overall score for this grader (0.0-1.0)
 - **assertions**: Array of per-assertion results — `text` (assertion description), `passed` (boolean), `evidence` (cited quote or description)
+
+**Kept for traceability (not merged):**
 - **summary**: Aggregate stats — `passed`, `failed`, `total`, `pass_rate` (0.0-1.0)
-- **execution_metrics**: From executor metrics/timing — tool call counts, output size. Omit if not available.
-- **timing**: From `timing-file` — executor and total duration in seconds. Omit if not available.
 - **claims**: Extracted and verified claims — `claim` (statement), `type` (factual/process/quality), `verified` (boolean), `evidence`
 - **user_notes_summary**: Issues from executor notes — `uncertainties[]`, `needs_review[]`, `workarounds[]`. Empty arrays if no notes found.
 - **eval_feedback**: Suggestions for improving the evals — `suggestions[]` (array of `{assertion?, reason}`), `overall` (brief assessment)

From d4c8f025c9d71ac0815c3f4c161584646be6dc82 Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Thu, 23 Apr 2026 08:55:53 +0200
Subject: [PATCH 2/2] fix(pipeline): align subagent-mode suite fallback with
 CLI mode
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Before this change, `pipeline input` and `pipeline run` resolved the
suite (evalset) directory name from `suite.metadata?.name` only. If the
eval.yaml had no `name` field, `suite.metadata` was undefined and the
artifact layout skipped the suite segment entirely — writing to
`<run-dir>/<test-id>/...` directly.

CLI mode (`agentv eval`, via `artifact-writer.ts:buildArtifactSubdir`
consuming `test.suite`) already falls back through `metadata.name` →
eval-file basename (stripping `.eval.yaml`) → `'eval'`. This fallback
is applied by the loaders (`yaml-parser.ts:317-324`,
`jsonl-parser.ts:165`) and attached to every test as `test.suite`.

Switch `pipeline input` and `pipeline run` to read `tests[0]?.suite`
so subagent-mode runs produce the same `<evalset>/<test-id>/` layout
CLI mode produces. `pipeline bench` and `pipeline grade` consume
`manifest.suite` which is written by these two, so they pick up the
change automatically — no consumer edits needed.

Docs in the agentv-bench skill updated to match: `<evalset>` is now
always present in the artifact tree, not conditional.

Regression test covers the no-`name` case — previously resolved to
`<run-dir>/<test-id>/`, now resolves to `<run-dir>/no-name/<test-id>/`
matching `no-name.eval.yaml`'s basename.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 apps/cli/src/commands/pipeline/input.ts            |  5 ++++-
 apps/cli/src/commands/pipeline/run.ts              |  5 ++++-
 .../eval/pipeline/fixtures/no-name.eval.yaml       |  8 ++++++++
 apps/cli/test/commands/eval/pipeline/input.test.ts | 14 ++++++++++++++
 .../skills/agentv-bench/agents/grader.md           |  2 +-
 .../agentv-bench/references/subagent-pipeline.md   |  4 ++--
 6 files changed, 33 insertions(+), 5 deletions(-)
 create mode 100644 apps/cli/test/commands/eval/pipeline/fixtures/no-name.eval.yaml

diff --git a/apps/cli/src/commands/pipeline/input.ts b/apps/cli/src/commands/pipeline/input.ts
index 5f24a87f..2bdbb217 100644
--- a/apps/cli/src/commands/pipeline/input.ts
+++ b/apps/cli/src/commands/pipeline/input.ts
@@ -134,7 +134,10 @@ export const evalInputCommand = command({
       // No targets file found — subagent-as-target mode
     }
 
-    const suiteName = suite.metadata?.name?.trim() ?? '';
+    // Use tests[0].suite — loaders (yaml-parser, jsonl-parser) already apply the
+    // metadata.name → filename-basename → 'eval' fallback. This keeps subagent-mode
+    // artifact layout aligned with CLI mode (artifact-writer.ts:buildArtifactSubdir).
+    const suiteName = tests[0]?.suite?.trim() ?? '';
     const safeSuiteName = suiteName ? suiteName.replace(/[\/\\:*?"<>|]/g, '_') : '';
 
     const testIds: string[] = [];
diff --git a/apps/cli/src/commands/pipeline/run.ts b/apps/cli/src/commands/pipeline/run.ts
index 5e2758fb..161c69cd 100644
--- a/apps/cli/src/commands/pipeline/run.ts
+++ b/apps/cli/src/commands/pipeline/run.ts
@@ -158,7 +158,10 @@ export const evalRunCommand = command({
       // No targets file — subagent-as-target mode
     }
 
-    const suiteName = suite.metadata?.name?.trim() ?? '';
+    // Use tests[0].suite — loaders (yaml-parser, jsonl-parser) already apply the
+    // metadata.name → filename-basename → 'eval' fallback. This keeps subagent-mode
+    // artifact layout aligned with CLI mode (artifact-writer.ts:buildArtifactSubdir).
+    const suiteName = tests[0]?.suite?.trim() ?? '';
     const safeSuiteName = suiteName ? suiteName.replace(/[\/\\:*?"<>|]/g, '_') : '';
 
     const testIds: string[] = [];
diff --git a/apps/cli/test/commands/eval/pipeline/fixtures/no-name.eval.yaml b/apps/cli/test/commands/eval/pipeline/fixtures/no-name.eval.yaml
new file mode 100644
index 00000000..a793e9c2
--- /dev/null
+++ b/apps/cli/test/commands/eval/pipeline/fixtures/no-name.eval.yaml
@@ -0,0 +1,8 @@
+tests:
+  - id: test-01
+    input: hello world
+    criteria: Response echoes the input
+    assertions:
+      - name: contains_hello
+        type: contains
+        value: hello
diff --git a/apps/cli/test/commands/eval/pipeline/input.test.ts b/apps/cli/test/commands/eval/pipeline/input.test.ts
index c7546e45..cbb4e54a 100644
--- a/apps/cli/test/commands/eval/pipeline/input.test.ts
+++ b/apps/cli/test/commands/eval/pipeline/input.test.ts
@@ -128,4 +128,18 @@ describe('pipeline input', () => {
     expect(regexGrader.type).toBe('regex');
     expect(regexGrader.value).toBe('h[aeiou]llo');
   });
+
+  it('falls back to eval file basename for suite directory when name is absent', async () => {
+    const { execa } = await import('execa');
+    const noNameEvalPath = join(FIXTURE_DIR, 'no-name.eval.yaml');
+    await execa('bun', [CLI_ENTRY, 'pipeline', 'input', noNameEvalPath, '--out', OUT_DIR]);
+
+    const input = JSON.parse(
+      await readFile(join(OUT_DIR, 'no-name', 'test-01', 'input.json'), 'utf8'),
+    );
+    expect(input.input[0].content).toBe('hello world');
+
+    const manifest = JSON.parse(await readFile(join(OUT_DIR, 'manifest.json'), 'utf8'));
+    expect(manifest.suite).toBe('no-name');
+  });
 });
diff --git a/plugins/agentv-dev/skills/agentv-bench/agents/grader.md b/plugins/agentv-dev/skills/agentv-bench/agents/grader.md
index d071b969..c34cd879 100644
--- a/plugins/agentv-dev/skills/agentv-bench/agents/grader.md
+++ b/plugins/agentv-dev/skills/agentv-bench/agents/grader.md
@@ -18,7 +18,7 @@ You are the grader for an AgentV evaluation test case. You have two jobs: **grad
 - `eval-path`: Path to the eval YAML file
 - `test-id`: The test case ID
 - `response-file`: Path to the executor's response (e.g., `response.md`)
-- `bench-dir`: Path to the test's parent directory — the run directory qualified by evalset name when the eval.yaml has a `name` field. Example: `.agentv/results/runs/<experiment>/<timestamp>/<evalset-name>/`, or `.agentv/results/runs/<experiment>/<timestamp>/` when the eval.yaml has no `name`. The grader writes results under `{bench-dir}/{test-id}/...`.
+- `bench-dir`: Path to the test's parent directory — the run directory qualified by evalset name. Example: `.agentv/results/runs/<experiment>/<timestamp>/<evalset-name>/`. The evalset name comes from the eval.yaml `name` field; when absent, it falls back to the eval file's basename (e.g. `my-suite.eval.yaml` → `my-suite`), matching CLI mode. The grader writes results under `{bench-dir}/{test-id}/...`.
 - `timing-file`: Path to `timing.json` (for execution-metrics/latency/cost assertions)
 
 ## Process
diff --git a/plugins/agentv-dev/skills/agentv-bench/references/subagent-pipeline.md b/plugins/agentv-dev/skills/agentv-bench/references/subagent-pipeline.md
index 23994de9..fabff41b 100644
--- a/plugins/agentv-dev/skills/agentv-bench/references/subagent-pipeline.md
+++ b/plugins/agentv-dev/skills/agentv-bench/references/subagent-pipeline.md
@@ -147,11 +147,11 @@ The path hierarchy mirrors the CLI mode: `<evalset-name>` comes from the `name`
 the eval.yaml. The target is recorded in `manifest.json` — one run = one target.
 
 ```
-.agentv/results/runs/<timestamp>/
+.agentv/results/runs/<experiment>/<timestamp>/
 ├── manifest.json                    ← eval metadata, target, test_ids
 ├── index.jsonl                      ← per-test scores
 ├── benchmark.json                   ← aggregate statistics
-└── <evalset-name>/                  ← from eval.yaml "name" field (omitted if absent)
+└── <evalset-name>/                  ← eval.yaml "name" field, or eval file basename if absent (same as CLI mode)
     └── <test-id>/                   ← test case id
         ├── input.json               ← test input text + messages
         ├── invoke.json              ← target command or agent instructions