From e3416def745f5683cbef605897fd6f7ab03ecc84 Mon Sep 17 00:00:00 2001 From: Christopher Date: Wed, 15 Apr 2026 12:41:57 +0000 Subject: [PATCH 1/3] refactor(core): rename evaluators to graders Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- AGENTS.md | 34 +- README.md | 2 +- apps/cli/src/commands/eval/artifact-writer.ts | 8 +- .../cli/src/commands/eval/benchmark-writer.ts | 4 +- apps/cli/src/commands/eval/commands/assert.ts | 2 +- apps/cli/src/commands/eval/html-writer.ts | 6 +- apps/cli/src/commands/inspect/score.ts | 40 +-- apps/cli/src/commands/inspect/show.ts | 2 +- apps/cli/src/commands/pipeline/bench.ts | 2 +- apps/cli/src/commands/pipeline/input.ts | 10 +- apps/cli/src/commands/pipeline/run.ts | 8 +- apps/cli/src/commands/results/failures.ts | 2 +- apps/cli/src/commands/results/validate.ts | 2 +- .../commands/eval/artifact-writer.test.ts | 10 +- .../commands/eval/benchmark-writer.test.ts | 4 +- .../results/export-e2e-providers.test.ts | 2 +- apps/cli/test/commands/results/export.test.ts | 2 +- apps/cli/test/commands/results/report.test.ts | 8 +- apps/cli/test/commands/trace/trace.test.ts | 6 +- apps/studio/src/components/EvalDetail.tsx | 10 +- docs/plans/1109-grader-rename.md | 13 + examples/features/README.md | 16 +- .../features/agent-skills-evals/README.md | 4 +- .../multi-provider-skill-trigger.EVAL.yaml | 2 +- examples/features/basic/README.md | 2 +- .../basic/evals/check_python_keywords.py | 4 +- .../features/basic/evals/dataset.eval.yaml | 10 +- examples/features/batch-cli/README.md | 4 +- .../graders/check-batch-cli-output.ts | 2 +- examples/features/benchmark-tooling/README.md | 4 +- examples/features/code-grader-sdk/README.md | 4 +- .../code-grader-with-llm-calls/README.md | 12 +- .../evals/contextual-precision.eval.yaml | 2 +- .../evals/contextual-recall.eval.yaml | 2 +- .../scripts/contextual-precision.ts | 8 +- .../scripts/contextual-recall.ts | 6 +- examples/features/composite/README.md | 10 +- .../composite/evals/dataset.eval.yaml | 4 +- .../composite/prompts/conflict-resolution.md | 4 +- .../scripts/safety-gate-aggregator.js | 2 +- .../copilot-log-eval/.agentv/targets.yaml | 2 +- examples/features/copilot-log-eval/README.md | 10 +- .../evals/skill-trigger.EVAL.yaml | 2 +- .../evals/dataset.eval.yaml | 6 +- .../deterministic-evaluators/README.md | 12 +- .../evals/dataset.eval.yaml | 4 +- .../graders/assertions.ts | 2 +- .../features/document-extraction/README.md | 12 +- .../evals/confusion-metrics.eval.yaml | 2 +- .../evals/field-accuracy.eval.yaml | 8 +- .../document-extraction/fixtures/README.md | 8 +- .../graders/fuzzy_match.ts | 4 +- .../graders/header_confusion_metrics.ts | 2 +- .../graders/line_item_matching.ts | 2 +- .../graders/multi_field_fuzzy.ts | 2 +- .../document-extraction/mock_extractor.ts | 2 +- .../scripts/aggregate_metrics.ts | 10 +- .../execution-metrics/evals/dataset.eval.yaml | 10 +- .../scripts/check-metrics-present.ts | 2 +- .../file-changes/evals/dataset.eval.yaml | 2 +- examples/features/import-claude/README.md | 6 +- .../features/latency-assertions/README.md | 6 +- .../evals/dataset.eval.yaml | 4 +- .../latency-assertions/mock-latency-agent.ts | 2 +- .../multi-turn-conversation-live/README.md | 2 +- .../multi-turn-conversation/README.md | 4 +- examples/features/nlp-metrics/README.md | 4 +- .../nlp-metrics/evals/dataset.eval.yaml | 6 +- .../features/prompt-template-sdk/README.md | 2 +- .../evals/dataset.eval.yaml | 6 +- examples/features/rubric/README.md | 2 +- .../features/rubric/evals/check_syntax.py | 2 +- .../features/rubric/evals/dataset.eval.yaml | 16 +- .../evals/dataset.eval.yaml | 6 +- .../tool-evaluation-plugins/README.md | 4 +- .../evals/dataset.eval.yaml | 2 +- .../graders/tool-args-f1.ts | 4 +- .../graders/tool-call-f1.ts | 6 +- .../evals/trace-file-demo.eval.yaml | 2 +- .../evals/dataset.eval.yaml | 12 +- .../tool-trajectory-simple/mock-agent.ts | 4 +- examples/features/trace-evaluation/README.md | 8 +- .../evals/dataset.eval.yaml | 2 +- .../features/weighted-evaluators/README.md | 10 +- .../evals/dataset.eval.yaml | 16 +- .../prompts/experimental-check.md | 6 +- examples/showcase/README.md | 8 +- .../cross-repo-sync/evals/dataset.eval.yaml | 4 +- .../workspace-template/AGENTS.md | 4 +- .../skills/cross-repo-sync.md | 4 +- .../evals/validate_output.py | 4 +- .../showcase/evaluator-conformance/EVAL.yaml | 14 +- .../showcase/evaluator-conformance/README.md | 30 +- .../conformance-check.ts | 22 +- .../evaluators/keyword-grader.ts | 2 +- .../evaluator-conformance/fixtures.yaml | 14 +- examples/showcase/export-screening/README.md | 6 +- .../showcase/multi-model-benchmark/README.md | 8 +- .../evals/benchmark.eval.yaml | 2 +- .../offline-grader-benchmark/README.md | 2 +- .../scripts/score-grader-benchmark.ts | 10 +- .../psychotherapy/evals/validate_output.py | 4 +- .../tool-evaluation-plugins/README.md | 10 +- .../scripts/efficiency-scorer.ts | 2 +- .../scripts/pairwise-tool-compare.ts | 2 +- .../scripts/tool-selection-grader.ts | 4 +- .../tool-eval-demo.eval.yaml | 2 +- packages/core/src/evaluation/assertions.ts | 2 +- packages/core/src/evaluation/baseline.ts | 14 +- packages/core/src/evaluation/evaluate.ts | 12 +- packages/core/src/evaluation/evaluators.ts | 5 - .../core/src/evaluation/evaluators/index.ts | 93 ----- packages/core/src/evaluation/graders.ts | 5 + .../{evaluators => graders}/assertions.ts | 0 .../code-grader.ts} | 14 +- .../{evaluators => graders}/composite.ts | 62 ++-- .../{evaluators => graders}/cost.ts | 20 +- .../execution-metrics.ts | 20 +- .../{evaluators => graders}/field-accuracy.ts | 16 +- packages/core/src/evaluation/graders/index.ts | 83 +++++ .../{evaluators => graders}/inline-assert.ts | 6 +- .../{evaluators => graders}/latency.ts | 20 +- .../llm-grader-prompt.ts | 28 +- .../{evaluators => graders}/llm-grader.ts | 98 +++--- .../prompt-resolution.ts | 8 +- .../{evaluators => graders}/scoring.ts | 0 .../{evaluators => graders}/skill-trigger.ts | 10 +- .../{evaluators => graders}/token-usage.ts | 20 +- .../tool-trajectory.ts | 16 +- .../{evaluators => graders}/types.ts | 22 +- .../evaluation/loaders/agent-skills-parser.ts | 8 +- .../{evaluator-parser.ts => grader-parser.ts} | 102 +++--- .../src/evaluation/loaders/jsonl-parser.ts | 10 +- packages/core/src/evaluation/orchestrator.ts | 128 +++---- .../registry/assertion-discovery.ts | 12 +- ...iltin-evaluators.ts => builtin-graders.ts} | 231 ++++++------ .../evaluation/registry/grader-discovery.ts | 23 +- ...aluator-registry.ts => grader-registry.ts} | 54 +-- .../core/src/evaluation/registry/index.ts | 10 +- .../core/src/evaluation/template-variables.ts | 4 +- packages/core/src/evaluation/trace.ts | 4 +- packages/core/src/evaluation/types.ts | 159 +++++---- .../evaluation/validation/eval-file.schema.ts | 4 +- .../evaluation/validation/eval-validator.ts | 8 +- .../evaluation/validation/prompt-validator.ts | 2 +- packages/core/src/evaluation/yaml-parser.ts | 16 +- .../core/src/import/transcript-provider.ts | 2 +- packages/core/src/index.ts | 23 +- .../core/test/evaluation/baseline.test.ts | 10 +- ...est.ts => code-grader-file-backed.test.ts} | 10 +- ...test.ts => code-grader-multimodal.test.ts} | 12 +- .../evaluation/evaluators_variables.test.ts | 26 +- .../test/evaluation/execution-metrics.test.ts | 6 +- .../{evaluators.test.ts => graders.test.ts} | 144 ++++---- .../assertions.test.ts | 2 +- .../composite-threshold.test.ts | 44 +-- .../execution-metrics.test.ts | 98 +++--- .../inline-assert.test.ts | 16 +- .../{evaluators => graders}/negation.test.ts | 4 +- .../prompt-resolution.test.ts | 2 +- .../skill-trigger.test.ts | 48 ++- .../evaluation/llm-grader-multimodal.test.ts | 16 +- ...r-parser.test.ts => grader-parser.test.ts} | 330 +++++++++--------- .../core/test/evaluation/orchestrator.test.ts | 30 +- .../core/test/evaluation/token-usage.test.ts | 12 +- ...test.ts => tool-trajectory-grader.test.ts} | 200 +++++------ packages/eval/README.md | 8 +- packages/eval/src/runtime.ts | 6 - packages/eval/src/target-client.ts | 2 +- .../agentv-dev/skills/agentv-bench/SKILL.md | 18 +- .../skills/agentv-bench/agents/analyzer.md | 48 +-- .../skills/agentv-bench/agents/comparator.md | 22 +- .../references/description-optimization.md | 2 +- .../references/environment-adaptation.md | 2 +- .../agentv-bench/references/eval-yaml-spec.md | 10 +- .../migrating-from-skill-creator.md | 6 +- .../skills/agentv-eval-writer/SKILL.md | 4 +- .../references/rubric-evaluator.md | 4 +- .../skills/agentv-trace-analyst/SKILL.md | 2 +- 179 files changed, 1587 insertions(+), 1616 deletions(-) create mode 100644 docs/plans/1109-grader-rename.md delete mode 100644 packages/core/src/evaluation/evaluators.ts delete mode 100644 packages/core/src/evaluation/evaluators/index.ts create mode 100644 packages/core/src/evaluation/graders.ts rename packages/core/src/evaluation/{evaluators => graders}/assertions.ts (100%) rename packages/core/src/evaluation/{evaluators/code-evaluator.ts => graders/code-grader.ts} (97%) rename packages/core/src/evaluation/{evaluators => graders}/composite.ts (89%) rename packages/core/src/evaluation/{evaluators => graders}/cost.ts (71%) rename packages/core/src/evaluation/{evaluators => graders}/execution-metrics.ts (92%) rename packages/core/src/evaluation/{evaluators => graders}/field-accuracy.ts (96%) create mode 100644 packages/core/src/evaluation/graders/index.ts rename packages/core/src/evaluation/{evaluators => graders}/inline-assert.ts (83%) rename packages/core/src/evaluation/{evaluators => graders}/latency.ts (69%) rename packages/core/src/evaluation/{evaluators => graders}/llm-grader-prompt.ts (88%) rename packages/core/src/evaluation/{evaluators => graders}/llm-grader.ts (94%) rename packages/core/src/evaluation/{evaluators => graders}/prompt-resolution.ts (93%) rename packages/core/src/evaluation/{evaluators => graders}/scoring.ts (100%) rename packages/core/src/evaluation/{evaluators => graders}/skill-trigger.ts (91%) rename packages/core/src/evaluation/{evaluators => graders}/token-usage.ts (81%) rename packages/core/src/evaluation/{evaluators => graders}/tool-trajectory.ts (97%) rename packages/core/src/evaluation/{evaluators => graders}/types.ts (87%) rename packages/core/src/evaluation/loaders/{evaluator-parser.ts => grader-parser.ts} (95%) rename packages/core/src/evaluation/registry/{builtin-evaluators.ts => builtin-graders.ts} (63%) rename packages/core/src/evaluation/registry/{evaluator-registry.ts => grader-registry.ts} (61%) rename packages/core/test/evaluation/{code-evaluator-file-backed.test.ts => code-grader-file-backed.test.ts} (92%) rename packages/core/test/evaluation/{code-evaluator-multimodal.test.ts => code-grader-multimodal.test.ts} (97%) rename packages/core/test/evaluation/{evaluators.test.ts => graders.test.ts} (93%) rename packages/core/test/evaluation/{evaluators => graders}/assertions.test.ts (97%) rename packages/core/test/evaluation/{evaluators => graders}/composite-threshold.test.ts (91%) rename packages/core/test/evaluation/{evaluators => graders}/execution-metrics.test.ts (84%) rename packages/core/test/evaluation/{evaluators => graders}/inline-assert.test.ts (82%) rename packages/core/test/evaluation/{evaluators => graders}/negation.test.ts (94%) rename packages/core/test/evaluation/{evaluators => graders}/prompt-resolution.test.ts (97%) rename packages/core/test/evaluation/{evaluators => graders}/skill-trigger.test.ts (84%) rename packages/core/test/evaluation/loaders/{evaluator-parser.test.ts => grader-parser.test.ts} (81%) rename packages/core/test/evaluation/{tool-trajectory-evaluator.test.ts => tool-trajectory-grader.test.ts} (86%) diff --git a/AGENTS.md b/AGENTS.md index 336fb0500..0ec1d5789 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -19,13 +19,13 @@ AgentV's core should remain minimal. Complex or domain-specific logic belongs in **Extension points (prefer these over adding built-ins):** - `code-grader` scripts for custom evaluation logic -- `llm-grader` evaluators with custom prompt files for domain-specific LLM grading +- `llm-grader` graders with custom prompt files for domain-specific LLM grading - CLI wrappers that consume AgentV's JSON/JSONL output for post-processing (aggregation, comparison, reporting) -**Ask yourself:** "Can this be achieved with existing primitives + a plugin or wrapper?" If yes, it should not be a built-in. This includes adding config overrides to existing evaluators — if a niche provider needs custom tool-name matching, that's a code-grader, not a new config field. +**Ask yourself:** "Can this be achieved with existing primitives + a plugin or wrapper?" If yes, it should not be a built-in. This includes adding config overrides to existing graders — if a niche provider needs custom tool-name matching, that's a code-grader, not a new config field. ### 2. Built-ins for Primitives Only -Built-in evaluators provide **universal primitives** that users compose. A primitive is: +Built-in graders provide **universal primitives** that users compose. A primitive is: - Stateless and deterministic - Has a single, clear responsibility - Cannot be trivially composed from other primitives @@ -77,11 +77,11 @@ AI agents are the primary users of AgentV—not humans reading docs. Design for ## Project Structure - `packages/core/` - Evaluation engine, providers, grading - - `src/evaluation/registry/` - Extensible evaluator registry (EvaluatorRegistry, assertion discovery) + - `src/evaluation/registry/` - Extensible grader registry (EvaluatorRegistry, assertion discovery) - `src/evaluation/providers/provider-registry.ts` - Provider plugin registry - `src/evaluation/evaluate.ts` - `evaluate()` programmatic API - `src/evaluation/config.ts` - `defineConfig()` for typed agentv.config.ts -- `packages/eval/` - Lightweight assertion SDK (`defineAssertion`, `defineCodeJudge`) +- `packages/eval/` - Lightweight assertion SDK (`defineAssertion`, `defineCodeGrader`) - `apps/cli/` - Command-line interface (published as `agentv`) - `src/commands/create/` - Scaffold commands (`agentv create assertion/eval`) - `examples/features/sdk-*` - SDK usage examples (custom assertion, programmatic API, config file) @@ -261,9 +261,9 @@ Tests should be lean and focused on what matters. Follow these principles: - **Regression tests > comprehensive tests.** A test that would have caught the bug is worth more than five tests that exercise happy paths. - **Tests are executable contracts.** When a module's behavioral contract changes, the tests must reflect the new contract — not just the happy path. If you change what a function promises, update its tests to assert the new promise. -### Verifying Evaluator Changes +### Verifying Grader Changes -Unit tests alone are insufficient for evaluator changes. After implementing or modifying evaluators: +Unit tests alone are insufficient for grader changes. After implementing or modifying graders: 1. **Copy `.env` to the worktree** if running in a git worktree (e2e tests need environment variables): ```bash @@ -272,7 +272,7 @@ Unit tests alone are insufficient for evaluator changes. After implementing or m ```powershell Copy-Item D:/path/to/main/.env .env ``` - Do not claim e2e or evaluator verification results unless this preflight has passed. + Do not claim e2e or grader verification results unless this preflight has passed. 2. **Run an actual eval** with a real example file: ```bash @@ -280,13 +280,13 @@ Unit tests alone are insufficient for evaluator changes. After implementing or m ``` 3. **Inspect the results JSONL** to verify: - - The correct evaluator type is invoked (check `scores[].type`) + - The correct grader type is invoked (check `scores[].type`) - Scores are calculated as expected - Assertions array reflects the evaluation logic (each entry has `text`, `passed`, optional `evidence`) 4. **Update baseline files** if output format changes (e.g., type name renames). Baseline files live alongside eval YAML files as `*.baseline.jsonl` and contain expected `scores[].type` values. There are 30+ baseline files across `examples/`. -5. **Note:** `--dry-run` returns schema-valid mock responses (`{}` as output, zeroed `tokenUsage`). Built-in graders will not crash, but scores are meaningless. Use it for testing harness flow, not evaluator logic. +5. **Note:** `--dry-run` returns schema-valid mock responses (`{}` as output, zeroed `tokenUsage`). Built-in graders will not crash, but scores are meaningless. Use it for testing harness flow, not grader logic. ### Completing Work — E2E Checklist @@ -307,11 +307,11 @@ Before marking any branch as ready for review, complete this checklist: - **Green (with your changes):** Run the identical scenario with your branch. Confirm the fix or feature works correctly from the end user's perspective. Capture the output. - **Document both** red and green results in the PR description or comments so reviewers can see the before/after evidence. - For evaluator changes, this means running a real eval (not `--dry-run`) and inspecting the output JSONL. For CLI/UX changes, this means running the CLI command and verifying the console output. + For grader changes, this means running a real eval (not `--dry-run`) and inspecting the output JSONL. For CLI/UX changes, this means running the CLI command and verifying the console output. -4. **Verify no regressions** in areas adjacent to your changes (e.g., if you changed evaluator parsing, run an eval that exercises different evaluator types). +4. **Verify no regressions** in areas adjacent to your changes (e.g., if you changed grader parsing, run an eval that exercises different grader types). -5. **Live eval verification**: For changes affecting scoring, thresholds, or evaluator behavior, run at least one real eval with a live provider (not `--dry-run`) and verify the output JSONL has correct scores, verdicts, and execution status. +5. **Live eval verification**: For changes affecting scoring, thresholds, or grader behavior, run at least one real eval with a live provider (not `--dry-run`) and verify the output JSONL has correct scores, verdicts, and execution status. 6. **Studio UX verification**: For changes affecting config, scoring display, or studio API, use `agent-browser` to verify the studio UI still renders and functions correctly (settings page loads, pass/fail indicators are correct, config saves work). @@ -323,15 +323,15 @@ When making changes to functionality: 1. **Docs site** (`apps/web/src/content/docs/`): Update human-readable documentation on agentv.dev. This is the comprehensive reference. -2. **Skill files** (`plugins/agentv-dev/skills/agentv-eval-builder/`): Update the AI-focused reference card if the change affects YAML schema, evaluator types, or CLI commands. Keep concise — link to docs site for details. +2. **Skill files** (`plugins/agentv-dev/skills/agentv-eval-builder/`): Update the AI-focused reference card if the change affects YAML schema, grader types, or CLI commands. Keep concise — link to docs site for details. 3. **Examples** (`examples/`): Update any example code, scripts, or eval YAML files that exercise the changed functionality. Examples are both documentation and integration tests. 4. **README.md**: Keep minimal. Links point to agentv.dev. -## Evaluator Type System +## Grader Type System -Evaluator types use **kebab-case** everywhere (matching promptfoo convention): +Grader types use **kebab-case** everywhere (matching promptfoo convention): - **YAML config:** `type: llm-grader`, `type: is-json`, `type: execution-metrics` - **Internal TypeScript:** `EvaluatorKind = 'llm-grader' | 'is-json' | ...` @@ -340,7 +340,7 @@ Evaluator types use **kebab-case** everywhere (matching promptfoo convention): **Source of truth:** `EVALUATOR_KIND_VALUES` array in `packages/core/src/evaluation/types.ts` -**Backward compatibility:** Snake_case is accepted in YAML (`llm_judge` → `llm-grader`) via `normalizeEvaluatorType()` in `evaluator-parser.ts`. Single-word types (`contains`, `equals`, `regex`, `latency`, `cost`) have no separator and are unchanged. +**Backward compatibility:** Snake_case is accepted in YAML (`llm_judge` → `llm-grader`) via `normalizeGraderType()` in `grader-parser.ts`. Single-word types (`contains`, `equals`, `regex`, `latency`, `cost`) have no separator and are unchanged. **Two type definitions exist:** - `EvaluatorKind` in `packages/core/src/evaluation/types.ts` — internal, canonical diff --git a/README.md b/README.md index 23415f112..fc9df19dc 100644 --- a/README.md +++ b/README.md @@ -107,7 +107,7 @@ console.log(`${summary.passed}/${summary.total} passed`); Full docs at [agentv.dev/docs](https://agentv.dev/docs/getting-started/introduction/). - [Eval files](https://agentv.dev/docs/evaluation/eval-files/) — format and structure -- [Custom evaluators](https://agentv.dev/docs/evaluators/custom-evaluators/) — code graders in any language +- [Custom graders](https://agentv.dev/docs/graders/custom-graders/) — code graders in any language - [Rubrics](https://agentv.dev/docs/evaluation/rubrics/) — structured criteria scoring - [Targets](https://agentv.dev/docs/targets/configuration/) — configure agents and providers - [Compare results](https://agentv.dev/docs/tools/compare/) — A/B testing and regression detection diff --git a/apps/cli/src/commands/eval/artifact-writer.ts b/apps/cli/src/commands/eval/artifact-writer.ts index c7844ccf4..1184cd801 100644 --- a/apps/cli/src/commands/eval/artifact-writer.ts +++ b/apps/cli/src/commands/eval/artifact-writer.ts @@ -4,7 +4,7 @@ import path from 'node:path'; import { DEFAULT_THRESHOLD, type EvaluationResult, - type EvaluatorResult, + type GraderResult, toTranscriptJsonLines, } from '@agentv/core'; import { toSnakeCaseDeep } from '../../utils/case-conversion.js'; @@ -227,9 +227,7 @@ function buildAssertions(result: EvaluationResult): GradingArtifact['assertions' // Build graders list // --------------------------------------------------------------------------- -function buildEvaluators( - scores: readonly EvaluatorResult[] | undefined, -): GradingArtifact['graders'] { +function buildEvaluators(scores: readonly GraderResult[] | undefined): GradingArtifact['graders'] { if (!scores || scores.length === 0) { return undefined; } @@ -370,7 +368,7 @@ export function buildBenchmarkArtifact( runSummary[target] = entry as (typeof runSummary)[string]; } - // Per-evaluator summary across all results + // Per-grader summary across all results const evaluatorScores = new Map(); for (const result of results) { if (result.scores) { diff --git a/apps/cli/src/commands/eval/benchmark-writer.ts b/apps/cli/src/commands/eval/benchmark-writer.ts index 562dd8a87..671de1160 100644 --- a/apps/cli/src/commands/eval/benchmark-writer.ts +++ b/apps/cli/src/commands/eval/benchmark-writer.ts @@ -32,10 +32,10 @@ function computeStats(values: readonly number[]): BenchmarkStats { } /** - * Compute per-test pass_rate from evaluator scores. + * Compute per-test pass_rate from grader scores. * * For each test, pass_rate = count(evaluator.score >= 0.8) / total_evaluators. - * If no per-evaluator scores exist, falls back to the top-level result score + * If no per-grader scores exist, falls back to the top-level result score * with the same threshold (>= 0.8 → 1.0, else 0.0). */ function computePassRate(result: EvaluationResult): number { diff --git a/apps/cli/src/commands/eval/commands/assert.ts b/apps/cli/src/commands/eval/commands/assert.ts index b8e415574..519fbfc84 100644 --- a/apps/cli/src/commands/eval/commands/assert.ts +++ b/apps/cli/src/commands/eval/commands/assert.ts @@ -62,7 +62,7 @@ export const evalAssertCommand = command({ process.exit(1); } - // Build payload matching CodeEvaluator's expected format (snake_case). + // Build payload matching CodeGrader's expected format (snake_case). // Include all fields that defineCodeGrader validates as required. const payload = JSON.stringify( { diff --git a/apps/cli/src/commands/eval/html-writer.ts b/apps/cli/src/commands/eval/html-writer.ts index affff458f..1dd17eee4 100644 --- a/apps/cli/src/commands/eval/html-writer.ts +++ b/apps/cli/src/commands/eval/html-writer.ts @@ -500,10 +500,10 @@ const SCRIPT = ` h+='

Output

'+esc(r.output?JSON.stringify(r.output,null,2):"")+"
"; h+=""; - /* evaluator results */ + /* grader results */ if(r.scores&&r.scores.length>0){ - h+="

Evaluator Results

"; - h+=''; + h+="

Grader Results

"; + h+='
EvaluatorScoreStatusAssertions
'; for(var i=0;i=0.5?"pass":"fail"; var evAssertions=ev.assertions||[]; diff --git a/apps/cli/src/commands/inspect/score.ts b/apps/cli/src/commands/inspect/score.ts index da986096c..3abdc9ca8 100644 --- a/apps/cli/src/commands/inspect/score.ts +++ b/apps/cli/src/commands/inspect/score.ts @@ -2,9 +2,9 @@ import { type EvalTest, type EvaluationContext, type EvaluationScore, - type Evaluator, - type EvaluatorConfig, - type EvaluatorDispatchContext, + type Grader, + type GraderConfig, + type GraderDispatchContext, type Message, type Provider, type ProviderRequest, @@ -24,7 +24,7 @@ import { } from './utils.js'; /** - * Evaluator types that work without an LLM provider. + * Grader types that work without an LLM provider. */ const SUPPORTED_TYPES = [ 'contains', @@ -52,7 +52,7 @@ function parseKeyValues(s: string): Record { } /** - * Parse an inline evaluator spec string into an EvaluatorConfig. + * Parse an inline evaluator spec string into an GraderConfig. * * Supported formats: * contains:value @@ -64,7 +64,7 @@ function parseKeyValues(s: string): Record { * token-usage:max_total=N,max_input=N,max_output=N * execution-metrics:max_tool_calls=N,max_tokens=N,max_llm_calls=N,... */ -export function parseAssertSpec(spec: string): EvaluatorConfig { +export function parseAssertSpec(spec: string): GraderConfig { const colonIdx = spec.indexOf(':'); // Normalize snake_case to kebab-case for backward compat const type = (colonIdx === -1 ? spec : spec.slice(0, colonIdx)).replace(/_/g, '-'); @@ -73,31 +73,31 @@ export function parseAssertSpec(spec: string): EvaluatorConfig { switch (type) { case 'contains': if (!params) throw new Error('contains requires a value: contains:'); - return { name: 'contains', type: 'contains', value: params } as EvaluatorConfig; + return { name: 'contains', type: 'contains', value: params } as GraderConfig; case 'regex': if (!params) throw new Error('regex requires a pattern: regex:'); - return { name: 'regex', type: 'regex', value: params } as EvaluatorConfig; + return { name: 'regex', type: 'regex', value: params } as GraderConfig; case 'is-json': - return { name: 'is-json', type: 'is-json' } as EvaluatorConfig; + return { name: 'is-json', type: 'is-json' } as GraderConfig; case 'equals': if (!params) throw new Error('equals requires a value: equals:'); - return { name: 'equals', type: 'equals', value: params } as EvaluatorConfig; + return { name: 'equals', type: 'equals', value: params } as GraderConfig; case 'latency': { const threshold = Number(params); if (!params || Number.isNaN(threshold)) throw new Error('latency requires a threshold in ms: latency:'); - return { name: 'latency', type: 'latency', threshold } as EvaluatorConfig; + return { name: 'latency', type: 'latency', threshold } as GraderConfig; } case 'cost': { const budget = Number(params); if (!params || Number.isNaN(budget)) throw new Error('cost requires a budget in USD: cost:'); - return { name: 'cost', type: 'cost', budget } as EvaluatorConfig; + return { name: 'cost', type: 'cost', budget } as GraderConfig; } case 'token-usage': { @@ -106,7 +106,7 @@ export function parseAssertSpec(spec: string): EvaluatorConfig { if (kv.max_total) config.max_total = Number(kv.max_total); if (kv.max_input) config.max_input = Number(kv.max_input); if (kv.max_output) config.max_output = Number(kv.max_output); - return config as EvaluatorConfig; + return config as GraderConfig; } case 'execution-metrics': { @@ -120,12 +120,12 @@ export function parseAssertSpec(spec: string): EvaluatorConfig { if (kv.max_tokens) config.max_tokens = Number(kv.max_tokens); if (kv.max_cost_usd) config.max_cost_usd = Number(kv.max_cost_usd); if (kv.max_duration_ms) config.max_duration_ms = Number(kv.max_duration_ms); - return config as EvaluatorConfig; + return config as GraderConfig; } default: throw new Error( - `Unsupported evaluator type: "${type}". Supported: ${SUPPORTED_TYPES.join(', ')}`, + `Unsupported grader type: "${type}". Supported: ${SUPPORTED_TYPES.join(', ')}`, ); } } @@ -171,7 +171,7 @@ const stubProvider: Provider = { /** * A no-op evaluator stub used as the required llmGrader in the dispatch context. */ -const stubLlmGrader: Evaluator = { +const stubLlmGrader: Grader = { kind: 'llm-grader', evaluate(): EvaluationScore { throw new Error('trace score does not support LLM-based evaluators'); @@ -189,12 +189,12 @@ interface ScoreResult { async function runScore( results: RawResult[], - evaluatorConfig: EvaluatorConfig, + evaluatorConfig: GraderConfig, testIdFilter?: string, ): Promise { const registry = createBuiltinRegistry(); - const dispatchContext: EvaluatorDispatchContext = { + const dispatchContext: GraderDispatchContext = { llmGrader: stubLlmGrader, registry, }; @@ -308,7 +308,7 @@ export const traceScoreCommand = command({ long: 'assert', short: 'a', description: - 'Evaluator spec: contains:, regex:, is-json, equals:, latency:, cost:, token-usage:, execution-metrics:', + 'Grader spec: contains:, regex:, is-json, equals:, latency:, cost:, token-usage:, execution-metrics:', }), testId: option({ type: optional(string), @@ -324,7 +324,7 @@ export const traceScoreCommand = command({ }, handler: async ({ file, assert: assertSpec, testId, format }) => { // Parse the evaluator spec - let evaluatorConfig: EvaluatorConfig; + let evaluatorConfig: GraderConfig; try { evaluatorConfig = parseAssertSpec(assertSpec); } catch (err) { diff --git a/apps/cli/src/commands/inspect/show.ts b/apps/cli/src/commands/inspect/show.ts index 50e12f7e7..c738a4aad 100644 --- a/apps/cli/src/commands/inspect/show.ts +++ b/apps/cli/src/commands/inspect/show.ts @@ -46,7 +46,7 @@ function renderFlatTrace(result: RawResult): string { } /** - * Render per-evaluator scores inline. + * Render per-grader scores inline. */ function renderScores(scores: { name: string; score: number; type: string }[]): string { return scores diff --git a/apps/cli/src/commands/pipeline/bench.ts b/apps/cli/src/commands/pipeline/bench.ts index 1a57a2db1..7fe4db49f 100644 --- a/apps/cli/src/commands/pipeline/bench.ts +++ b/apps/cli/src/commands/pipeline/bench.ts @@ -28,7 +28,7 @@ interface EvaluatorScore { export const evalBenchCommand = command({ name: 'bench', - description: 'Merge evaluator scores and produce benchmark artifacts', + description: 'Merge grader scores and produce benchmark artifacts', args: { exportDir: positional({ type: string, diff --git a/apps/cli/src/commands/pipeline/input.ts b/apps/cli/src/commands/pipeline/input.ts index caffe39c4..5f24a87f0 100644 --- a/apps/cli/src/commands/pipeline/input.ts +++ b/apps/cli/src/commands/pipeline/input.ts @@ -22,7 +22,7 @@ import { readFile } from 'node:fs/promises'; import { mkdir, writeFile } from 'node:fs/promises'; import { dirname, join, relative, resolve } from 'node:path'; -import type { CodeEvaluatorConfig, EvaluatorConfig, LlmGraderEvaluatorConfig } from '@agentv/core'; +import type { CodeGraderConfig, GraderConfig, LlmGraderConfig } from '@agentv/core'; /** Assertion types that can be graded deterministically without external scripts or LLMs. */ const BUILTIN_ASSERTION_TYPES = new Set([ @@ -212,7 +212,7 @@ export const evalInputCommand = command({ async function writeGraderConfigs( testDir: string, - assertions: readonly EvaluatorConfig[], + assertions: readonly GraderConfig[], evalDir: string, ): Promise { const codeGradersDir = join(testDir, 'code_graders'); @@ -227,7 +227,7 @@ async function writeGraderConfigs( await mkdir(codeGradersDir, { recursive: true }); hasCodeGraders = true; } - const config = assertion as CodeEvaluatorConfig; + const config = assertion as CodeGraderConfig; await writeJson(join(codeGradersDir, `${config.name}.json`), { name: config.name, type: 'code-grader', @@ -241,7 +241,7 @@ async function writeGraderConfigs( await mkdir(llmGradersDir, { recursive: true }); hasLlmGraders = true; } - const config = assertion as LlmGraderEvaluatorConfig; + const config = assertion as LlmGraderConfig; let promptContent = ''; if (config.resolvedPromptPath) { @@ -266,7 +266,7 @@ async function writeGraderConfigs( await mkdir(codeGradersDir, { recursive: true }); hasCodeGraders = true; } - const config = assertion as EvaluatorConfig & { value?: unknown; flags?: string }; + const config = assertion as GraderConfig & { value?: unknown; flags?: string }; await writeJson(join(codeGradersDir, `${config.name}.json`), { name: config.name, type: config.type, diff --git a/apps/cli/src/commands/pipeline/run.ts b/apps/cli/src/commands/pipeline/run.ts index 67235123b..5e2758fb2 100644 --- a/apps/cli/src/commands/pipeline/run.ts +++ b/apps/cli/src/commands/pipeline/run.ts @@ -18,7 +18,7 @@ import { tmpdir } from 'node:os'; import { dirname, join, relative, resolve } from 'node:path'; import { deriveCategory, loadTestSuite } from '@agentv/core'; -import type { CodeEvaluatorConfig, EvaluatorConfig, LlmGraderEvaluatorConfig } from '@agentv/core'; +import type { CodeGraderConfig, GraderConfig, LlmGraderConfig } from '@agentv/core'; import { command, number, oneOf, option, optional, positional, string } from 'cmd-ts'; import { buildDefaultRunDir } from '../eval/result-layout.js'; @@ -391,7 +391,7 @@ async function writeJson(filePath: string, data: unknown): Promise { async function writeGraderConfigs( testDir: string, - assertions: readonly EvaluatorConfig[], + assertions: readonly GraderConfig[], evalDir: string, ): Promise { const codeGradersDir = join(testDir, 'code_graders'); @@ -406,7 +406,7 @@ async function writeGraderConfigs( await mkdir(codeGradersDir, { recursive: true }); hasCodeGraders = true; } - const config = assertion as CodeEvaluatorConfig; + const config = assertion as CodeGraderConfig; await writeJson(join(codeGradersDir, `${config.name}.json`), { name: config.name, command: config.command, @@ -419,7 +419,7 @@ async function writeGraderConfigs( await mkdir(llmGradersDir, { recursive: true }); hasLlmGraders = true; } - const config = assertion as LlmGraderEvaluatorConfig; + const config = assertion as LlmGraderConfig; let promptContent = ''; if (config.resolvedPromptPath) { try { diff --git a/apps/cli/src/commands/results/failures.ts b/apps/cli/src/commands/results/failures.ts index 8b3709bb3..314d66887 100644 --- a/apps/cli/src/commands/results/failures.ts +++ b/apps/cli/src/commands/results/failures.ts @@ -30,7 +30,7 @@ export function formatFailures(results: EvaluationResult[]): FailureEntry[] { evidence: a.evidence, })); - // Fall back to per-evaluator assertions + // Fall back to per-grader assertions if (assertions.length === 0 && r.scores) { assertions = r.scores.flatMap((s) => (s.assertions ?? []).map((a) => ({ diff --git a/apps/cli/src/commands/results/validate.ts b/apps/cli/src/commands/results/validate.ts index 991ffe7df..f993eb328 100644 --- a/apps/cli/src/commands/results/validate.ts +++ b/apps/cli/src/commands/results/validate.ts @@ -144,7 +144,7 @@ function checkIndexJsonl(runDir: string): { diagnostics: Diagnostic[]; entries: if (!entry.scores || !Array.isArray(entry.scores) || entry.scores.length === 0) { diagnostics.push({ severity: 'warning', - message: `index.jsonl line ${i + 1} (${entry.test_id ?? '?'}): missing 'scores[]' array — dashboard may not show per-evaluator breakdown`, + message: `index.jsonl line ${i + 1} (${entry.test_id ?? '?'}): missing 'scores[]' array — dashboard may not show per-grader breakdown`, }); } else { for (let j = 0; j < entry.scores.length; j++) { diff --git a/apps/cli/test/commands/eval/artifact-writer.test.ts b/apps/cli/test/commands/eval/artifact-writer.test.ts index 26cd097ac..b5296127e 100644 --- a/apps/cli/test/commands/eval/artifact-writer.test.ts +++ b/apps/cli/test/commands/eval/artifact-writer.test.ts @@ -2,7 +2,7 @@ import { afterEach, beforeEach, describe, expect, it } from 'bun:test'; import { readFile, readdir, rm } from 'node:fs/promises'; import path from 'node:path'; -import type { EvaluationResult, EvaluatorResult } from '@agentv/core'; +import type { EvaluationResult, GraderResult } from '@agentv/core'; import { type AggregateGradingArtifact, @@ -33,7 +33,7 @@ function makeResult(overrides: Partial = {}): EvaluationResult } as EvaluationResult; } -function makeEvaluatorResult(overrides: Partial = {}): EvaluatorResult { +function makeEvaluatorResult(overrides: Partial = {}): GraderResult { return { name: 'grader-1', type: 'llm-grader', @@ -43,7 +43,7 @@ function makeEvaluatorResult(overrides: Partial = {}): Evaluato { text: 'criterion-b', passed: false }, ], ...overrides, - } as EvaluatorResult; + } as GraderResult; } // --------------------------------------------------------------------------- @@ -99,7 +99,7 @@ describe('buildGradingArtifact', () => { }); }); - it('uses top-level assertions when no evaluator scores', () => { + it('uses top-level assertions when no grader scores', () => { const result = makeResult({ assertions: [ { text: 'ok-1', passed: true }, @@ -267,7 +267,7 @@ describe('buildBenchmarkArtifact', () => { expect(benchmark.run_summary['gpt-4'].time_seconds.stddev).toBe(15); }); - it('includes per-evaluator summary', () => { + it('includes per-grader summary', () => { const results = [ makeResult({ scores: [makeEvaluatorResult({ name: 'quality', type: 'llm-grader', score: 0.9 })], diff --git a/apps/cli/test/commands/eval/benchmark-writer.test.ts b/apps/cli/test/commands/eval/benchmark-writer.test.ts index 1a12bdf9e..56349ecd0 100644 --- a/apps/cli/test/commands/eval/benchmark-writer.test.ts +++ b/apps/cli/test/commands/eval/benchmark-writer.test.ts @@ -18,7 +18,7 @@ function makeResult(overrides: Partial = {}): EvaluationResult } describe('buildBenchmarkJson', () => { - it('computes pass_rate from per-evaluator scores', () => { + it('computes pass_rate from per-grader scores', () => { const results = [ makeResult({ scores: [ @@ -34,7 +34,7 @@ describe('buildBenchmarkJson', () => { expect(benchmark.run_summary.with_skill.pass_rate.stddev).toBe(0); }); - it('falls back to top-level score when no evaluator scores', () => { + it('falls back to top-level score when no grader scores', () => { const results = [makeResult({ score: 0.9 }), makeResult({ score: 0.5 })]; const benchmark = buildBenchmarkJson(results); // First passes (>= 0.8 → 1.0), second fails (< 0.8 → 0.0), mean = 0.5 diff --git a/apps/cli/test/commands/results/export-e2e-providers.test.ts b/apps/cli/test/commands/results/export-e2e-providers.test.ts index fafcec1a7..5cd562b44 100644 --- a/apps/cli/test/commands/results/export-e2e-providers.test.ts +++ b/apps/cli/test/commands/results/export-e2e-providers.test.ts @@ -471,7 +471,7 @@ describe('export e2e — multi-provider metrics verification', () => { expect(grading.execution_metrics.tool_calls.Read).toBe(2); expect(grading.execution_metrics.tool_calls.Write).toBe(1); - // Evaluators + // Graders expect(grading.graders).toHaveLength(1); expect(grading.graders?.[0].name).toBe('accuracy'); }); diff --git a/apps/cli/test/commands/results/export.test.ts b/apps/cli/test/commands/results/export.test.ts index 2ef02e04c..75f599e33 100644 --- a/apps/cli/test/commands/results/export.test.ts +++ b/apps/cli/test/commands/results/export.test.ts @@ -323,7 +323,7 @@ describe('results export', () => { ); }); - it('should include per-evaluator summary in benchmark when scores present', async () => { + it('should include per-grader summary in benchmark when scores present', async () => { const outputDir = path.join(tempDir, 'output'); const content = toJsonl(RESULT_FULL, RESULT_PARTIAL); diff --git a/apps/cli/test/commands/results/report.test.ts b/apps/cli/test/commands/results/report.test.ts index e33b5de87..e2040eeea 100644 --- a/apps/cli/test/commands/results/report.test.ts +++ b/apps/cli/test/commands/results/report.test.ts @@ -4,7 +4,7 @@ import { tmpdir } from 'node:os'; import path from 'node:path'; import vm from 'node:vm'; -import type { EvaluationResult, EvaluatorResult } from '@agentv/core'; +import type { EvaluationResult, GraderResult } from '@agentv/core'; import { writeArtifactsFromResults } from '../../../src/commands/eval/artifact-writer.js'; import { @@ -17,8 +17,8 @@ function makeScore( name: string, type: string, score: number, - assertions: EvaluatorResult['assertions'], -): EvaluatorResult { + assertions: GraderResult['assertions'], +): GraderResult { return { name, type, @@ -130,7 +130,7 @@ describe('results report', () => { expect(html).toContain('Assertions'); expect(html).toContain('assertion-badge'); expect(html).not.toContain('Grader Results'); - expect(html).not.toContain('Evaluator Results'); + expect(html).not.toContain('Grader Results'); }); it('emits an inline report script that parses successfully', async () => { diff --git a/apps/cli/test/commands/trace/trace.test.ts b/apps/cli/test/commands/trace/trace.test.ts index 808586cbc..eb4375018 100644 --- a/apps/cli/test/commands/trace/trace.test.ts +++ b/apps/cli/test/commands/trace/trace.test.ts @@ -541,12 +541,12 @@ describe('parseAssertSpec', () => { }); describe('unsupported types', () => { - it('should throw on unknown evaluator type', () => { - expect(() => parseAssertSpec('llm-grader')).toThrow('Unsupported evaluator type'); + it('should throw on unknown grader type', () => { + expect(() => parseAssertSpec('llm-grader')).toThrow('Unsupported grader type'); }); it('should throw on empty spec', () => { - expect(() => parseAssertSpec('')).toThrow('Unsupported evaluator type'); + expect(() => parseAssertSpec('')).toThrow('Unsupported grader type'); }); }); }); diff --git a/apps/studio/src/components/EvalDetail.tsx b/apps/studio/src/components/EvalDetail.tsx index 09c41aa9b..89acf0306 100644 --- a/apps/studio/src/components/EvalDetail.tsx +++ b/apps/studio/src/components/EvalDetail.tsx @@ -4,7 +4,7 @@ * * Layout: compact header → tabs → full-height content area. * Scores and assertions are only visible in the Checks tab. - * Assertions are grouped by evaluator name. + * Assertions are grouped by grader name. */ import { useState } from 'react'; @@ -144,7 +144,7 @@ function AssertionCard({ assertion }: { assertion: AssertionEntry }) { } /** - * Checks tab: overall score → per-evaluator scores → assertions → failure reasons. + * Checks tab: overall score → per-grader scores → assertions → failure reasons. * Assertions are grouped by evaluator when per-score assertion data is available. */ function ChecksTab({ result }: { result: EvalResult }) { @@ -201,10 +201,10 @@ function ChecksTab({ result }: { result: EvalResult }) { - {/* Per-evaluator scores */} + {/* Per-grader scores */} {result.scores && result.scores.length > 0 && (
-

Evaluator Scores

+

Grader Scores

{result.scores.map((s, i) => (
@@ -224,7 +224,7 @@ function ChecksTab({ result }: { result: EvalResult }) { {useGrouped ? (
{scoresWithAssertions.map((s, si) => { - const graderLabel = s.name ?? s.type ?? `Evaluator ${si + 1}`; + const graderLabel = s.name ?? s.type ?? `Grader ${si + 1}`; return (

diff --git a/docs/plans/1109-grader-rename.md b/docs/plans/1109-grader-rename.md new file mode 100644 index 000000000..6683027dd --- /dev/null +++ b/docs/plans/1109-grader-rename.md @@ -0,0 +1,13 @@ +Problem: Hard-rename internal AgentV terminology from Evaluator to Grader across core, SDK exports, tests, docs, and UI copy, without changing YAML kind strings or `scores[].type`. + +Implementation plan: +1. Move core evaluator source/tests to `graders/` and rename registry/loader files plus exported TS symbols (`Evaluator` -> `Grader`, `EvaluatorRegistry` -> `GraderRegistry`, etc.). +2. Update dependent packages and applications (`packages/eval`, CLI, Studio) to consume the renamed symbols and user-facing terminology. +3. Sweep examples, docs, plugins, and repo guidance for concept-noun `evaluator` references; add the breaking-change migration notes required by the issue. +4. Run required validation, capture the live eval wire-format check, smoke-check Studio labels, and open/push the draft PR. + +Scope guardrails: +- Keep YAML kind strings unchanged. +- Keep `scores[].type` unchanged. +- Keep `evaluation/` and `evaluate()` unchanged. +- Do not add compatibility aliases for removed `Evaluator*` symbols. diff --git a/examples/features/README.md b/examples/features/README.md index 72d50e39d..1f05d2751 100644 --- a/examples/features/README.md +++ b/examples/features/README.md @@ -9,7 +9,7 @@ Focused examples for specific AgentV capabilities. Find your use case below, the |---------|-------------| | [basic](basic/) | Core schema: input, expected output, file references, multi-turn | | [basic-jsonl](basic-jsonl/) | Load test cases from an external JSONL file | -| [default-evaluators](default-evaluators/) | Apply the same assertions to every test without repeating them | +| [default-graders](default-graders/) | Apply the same assertions to every test without repeating them | --- @@ -17,9 +17,9 @@ Focused examples for specific AgentV capabilities. Find your use case below, the | Example | Description | |---------|-------------| | [rubric](rubric/) | Boolean rubric criteria — pass/fail each with a code grader or LLM check | -| [weighted-evaluators](weighted-evaluators/) | Multiple named `llm-grader` assertions with per-evaluator weights | +| [weighted-graders](weighted-graders/) | Multiple named `llm-grader` assertions with per-grader weights | | [composite](composite/) | Safety gate and weighted aggregation patterns | -| [threshold-evaluator](threshold-evaluator/) | Pass a test if a configurable percentage of sub-evaluators pass | +| [threshold-grader](threshold-grader/) | Pass a test if a configurable percentage of sub-graders pass | | [multi-turn-conversation](multi-turn-conversation/) | Grade a multi-turn conversation with per-turn score breakdowns | | [preprocessors](preprocessors/) | Convert `ContentFile` outputs into grader-readable text before `llm-grader` runs | @@ -30,7 +30,7 @@ Focused examples for specific AgentV capabilities. Find your use case below, the |---------|-------------| | [assert](assert/) | Core built-ins: `contains`, `regex`, `is-json`, `equals`, `starts_with`, `ends_with` | | [assert-extended](assert-extended/) | Extended variants: `contains_any`, `icontains`, `icontains_all`, regex flags | -| [deterministic-evaluators](deterministic-evaluators/) | Full showcase of all deterministic assertion types | +| [deterministic-graders](deterministic-graders/) | Full showcase of all deterministic assertion types | | [nlp-metrics](nlp-metrics/) | ROUGE, BLEU, cosine/Jaccard similarity, Levenshtein as code graders | --- @@ -143,8 +143,8 @@ Focused examples for specific AgentV capabilities. Find your use case below, the | [compare](compare/) | Benchmarking | | [composite](composite/) | LLM grading | | [copilot-log-eval](copilot-log-eval/) | Offline evaluation | -| [default-evaluators](default-evaluators/) | Getting started | -| [deterministic-evaluators](deterministic-evaluators/) | Deterministic assertions | +| [default-graders](default-graders/) | Getting started | +| [deterministic-graders](deterministic-graders/) | Deterministic assertions | | [document-extraction](document-extraction/) | Observability & export | | [env-interpolation](env-interpolation/) | Dataset & input | | [eval-assert-demo](eval-assert-demo/) | Custom graders | @@ -169,7 +169,7 @@ Focused examples for specific AgentV capabilities. Find your use case below, the | [sdk-programmatic-api](sdk-programmatic-api/) | TypeScript SDK | | [suite-level-input](suite-level-input/) | Dataset & input | | [suite-level-input-files](suite-level-input-files/) | Dataset & input | -| [threshold-evaluator](threshold-evaluator/) | LLM grading | +| [threshold-grader](threshold-grader/) | LLM grading | | [tool-evaluation-plugins](tool-evaluation-plugins/) | Tool & agent evaluation | | [tool-trajectory-advanced](tool-trajectory-advanced/) | Tool & agent evaluation | | [tool-trajectory-simple](tool-trajectory-simple/) | Tool & agent evaluation | @@ -177,7 +177,7 @@ Focused examples for specific AgentV capabilities. Find your use case below, the | [trace-evaluation](trace-evaluation/) | Tool & agent evaluation | | [trial-output-consistency](trial-output-consistency/) | Benchmarking | | [trials](trials/) | Benchmarking | -| [weighted-evaluators](weighted-evaluators/) | LLM grading | +| [weighted-graders](weighted-graders/) | LLM grading | | [workspace-multi-repo](workspace-multi-repo/) | Workspace & targets | | [workspace-setup-script](workspace-setup-script/) | Workspace & targets | | [workspace-shared-config](workspace-shared-config/) | Workspace & targets | diff --git a/examples/features/agent-skills-evals/README.md b/examples/features/agent-skills-evals/README.md index 6ecb2481c..5469aad6a 100644 --- a/examples/features/agent-skills-evals/README.md +++ b/examples/features/agent-skills-evals/README.md @@ -95,7 +95,7 @@ bun apps/cli/src/cli.ts eval multi-provider-skill-trigger.EVAL.yaml \ --target copilot --targets ../.agentv/targets.yaml ``` -The `skill-trigger` evaluator automatically handles each provider's tool-call format: +The `skill-trigger` grader automatically handles each provider's tool-call format: | Provider | Detection method | |----------|-----------------| @@ -112,4 +112,4 @@ Using skill: Viewing ... ``` -The evaluator scans the entire conversation transcript (not just the first tool call), so a preamble meta-skill like `using-superpowers` firing before `csv-analyzer` still results in a pass. +The grader scans the entire conversation transcript (not just the first tool call), so a preamble meta-skill like `using-superpowers` firing before `csv-analyzer` still results in a pass. diff --git a/examples/features/agent-skills-evals/multi-provider-skill-trigger.EVAL.yaml b/examples/features/agent-skills-evals/multi-provider-skill-trigger.EVAL.yaml index 33d26c7bd..0ae04bb25 100644 --- a/examples/features/agent-skills-evals/multi-provider-skill-trigger.EVAL.yaml +++ b/examples/features/agent-skills-evals/multi-provider-skill-trigger.EVAL.yaml @@ -16,7 +16,7 @@ # agentv eval this-file.EVAL.yaml --target copilot --targets ../.agentv/targets.yaml # agentv eval this-file.EVAL.yaml --target codex --targets ../.agentv/targets.yaml # -# The evaluator automatically resolves the correct tool names for each +# The grader automatically resolves the correct tool names for each # provider. No provider-specific config needed in test cases. workspace: diff --git a/examples/features/basic/README.md b/examples/features/basic/README.md index 76220f42d..437ff57e4 100644 --- a/examples/features/basic/README.md +++ b/examples/features/basic/README.md @@ -8,7 +8,7 @@ Demonstrates core AgentV schema features with minimal setup. - File references for content - Conversation threading with multiple messages - Array content format (text + file references) -- Multiple evaluators per test case +- Multiple graders per test case ## Running diff --git a/examples/features/basic/evals/check_python_keywords.py b/examples/features/basic/evals/check_python_keywords.py index c14fd89f1..10a5658d2 100644 --- a/examples/features/basic/evals/check_python_keywords.py +++ b/examples/features/basic/evals/check_python_keywords.py @@ -1,6 +1,6 @@ #!/usr/bin/env python3 """ -Code evaluator script: Check for required Python keywords in generated code. +Code grader script: Check for required Python keywords in generated code. This script demonstrates how to write custom validation logic for eval cases. See ../../README.md for the complete I/O contract specification. @@ -95,7 +95,7 @@ def main(): # Return error result error_result = { "score": 0.0, - "assertions": [{"text": f"Evaluator error: {str(e)}", "passed": False}], + "assertions": [{"text": f"Grader error: {str(e)}", "passed": False}], } print(json.dumps(error_result, indent=2)) sys.exit(1) diff --git a/examples/features/basic/evals/dataset.eval.yaml b/examples/features/basic/evals/dataset.eval.yaml index ab9067a73..7309596c9 100644 --- a/examples/features/basic/evals/dataset.eval.yaml +++ b/examples/features/basic/evals/dataset.eval.yaml @@ -2,7 +2,7 @@ # Demonstrates schema features with real file references and minimal redundancy name: basic -description: Example showing basic features, conversation threading, multiple evaluators +description: Example showing basic features, conversation threading, multiple graders # File-level default target execution: @@ -56,8 +56,8 @@ tests: - Add input validation for edge cases # ========================================== - # Example 2: Advanced features - conversation_id, multiple evaluators - # Demonstrates: conversation threading, execution config, target override, evaluators + # Example 2: Advanced features - conversation_id, multiple graders + # Demonstrates: conversation threading, execution config, target override, graders # Note: Optimization (ACE, etc.) is configured separately in opts/*.yaml files # ========================================== - id: code-gen-python-comprehensive @@ -72,10 +72,10 @@ tests: execution: target: llm - # Multiple evaluators - supports both code-based and LLM graders + # Multiple graders - supports both code-based and LLM graders assertions: - name: keyword_check - type: code-grader # Code evaluators handle regex, keywords, linting, etc. + type: code-grader # Code graders handle regex, keywords, linting, etc. command: ["uv", "run", "check_python_keywords.py"] cwd: . # Working directory for script execution - name: code_correctness diff --git a/examples/features/batch-cli/README.md b/examples/features/batch-cli/README.md index eebaaecdb..b4d4ed975 100644 --- a/examples/features/batch-cli/README.md +++ b/examples/features/batch-cli/README.md @@ -10,7 +10,7 @@ This example demonstrates an **external batch runner** pattern for a (synthetic) 3. **Batch processing**: `batch-cli-runner.ts` reads the CSV and applies synthetic AML screening rules, writing **actual responses** as JSONL to a temporary file. Each JSONL record includes `output` with `tool_calls` for trace extraction. -4. **Evaluation**: AgentV compares the actual JSONL output against the ground truth in `evals/dataset.eval.yaml` using evaluators like `code_grader` and `tool_trajectory`. +4. **Evaluation**: AgentV compares the actual JSONL output against the ground truth in `evals/dataset.eval.yaml` using graders like `code_grader` and `tool_trajectory`. ## Batch error handling (missing JSONL id) @@ -45,7 +45,7 @@ The batch runner outputs JSONL records with `output` containing `tool_calls`: } ``` -The `tool_trajectory` evaluator extracts tool calls directly from `output[].tool_calls[]`. This is the primary format - no separate `trace` field is required. +The `tool_trajectory` grader extracts tool calls directly from `output[].tool_calls[]`. This is the primary format - no separate `trace` field is required. ## Files diff --git a/examples/features/batch-cli/graders/check-batch-cli-output.ts b/examples/features/batch-cli/graders/check-batch-cli-output.ts index 4c8787d48..fc50aa518 100644 --- a/examples/features/batch-cli/graders/check-batch-cli-output.ts +++ b/examples/features/batch-cli/graders/check-batch-cli-output.ts @@ -1,6 +1,6 @@ #!/usr/bin/env bun /** - * Batch CLI Output Evaluator - Code Grader + * Batch CLI Output Grader - Code Grader * * Validates that the batch CLI runner produces the expected decision * by comparing candidate output against expected_output or input. diff --git a/examples/features/benchmark-tooling/README.md b/examples/features/benchmark-tooling/README.md index dc336d10e..7660e1b55 100644 --- a/examples/features/benchmark-tooling/README.md +++ b/examples/features/benchmark-tooling/README.md @@ -244,7 +244,7 @@ bun examples/features/benchmark-tooling/scripts/significance-test.ts baseline.js ## benchmark-report -Generates a consolidated benchmark summary across models and metrics from result JSONL files. Produces per-target aggregates (mean, std dev, median, pass rate, 95% CI) and per-metric breakdowns when evaluator-level scores are present. +Generates a consolidated benchmark summary across models and metrics from result JSONL files. Produces per-target aggregates (mean, std dev, median, pass rate, 95% CI) and per-metric breakdowns when grader-level scores are present. ### Usage @@ -274,7 +274,7 @@ bun examples/features/benchmark-tooling/scripts/benchmark-report.ts ./by-target/ **Per-Target Summary** includes for each model: record count, mean score, standard deviation, median, min, max, pass rate, and 95% confidence interval. -**Per-Target Metric Breakdown** appears when records contain evaluator-level `scores[]` arrays, showing mean and spread for each evaluator (e.g., accuracy, latency) per target. +**Per-Target Metric Breakdown** appears when records contain grader-level `scores[]` arrays, showing mean and spread for each grader (e.g., accuracy, latency) per target. **Machine-readable JSON** output (`--json`) returns a structured `BenchmarkReport` object with `summary`, `per_target`, `per_target_metrics`, and `overall` fields. diff --git a/examples/features/code-grader-sdk/README.md b/examples/features/code-grader-sdk/README.md index 0db2f52a8..5bdb91fbf 100644 --- a/examples/features/code-grader-sdk/README.md +++ b/examples/features/code-grader-sdk/README.md @@ -1,10 +1,10 @@ # Code Grader SDK Helper -Demonstrates how a TypeScript code_grader evaluator can use `defineCodeGrader` from `@agentv/eval` for a declarative, zero-boilerplate approach. +Demonstrates how a TypeScript code_grader grader can use `defineCodeGrader` from `@agentv/eval` for a declarative, zero-boilerplate approach. ## Files -- `evals/dataset.eval.yaml`: Example test that uses a code_grader evaluator. +- `evals/dataset.eval.yaml`: Example test that uses a code_grader grader. - `scripts/verify-attachments.ts`: Code grader script using `defineCodeGrader`. - `evals/example.txt`, `evals/python.instructions.md`: Attachment fixtures. diff --git a/examples/features/code-grader-with-llm-calls/README.md b/examples/features/code-grader-with-llm-calls/README.md index e060a5753..8cefee89b 100644 --- a/examples/features/code-grader-with-llm-calls/README.md +++ b/examples/features/code-grader-with-llm-calls/README.md @@ -1,6 +1,6 @@ # Code Grader with LLM Calls -This example demonstrates how code grader evaluators can make LLM calls through a secure local proxy without needing direct API credentials. +This example demonstrates how code grader graders can make LLM calls through a secure local proxy without needing direct API credentials. This example implements two RAG metrics: - **Contextual Precision**: Evaluates whether relevant documents are ranked higher @@ -12,7 +12,7 @@ This metric evaluates whether your retriever ranks relevant documents higher tha ### How It Works -1. **Multiple Grader Calls**: For each retrieval node, the evaluator makes an LLM call to determine binary relevance (relevant=1, irrelevant=0) +1. **Multiple Grader Calls**: For each retrieval node, the grader makes an LLM call to determine binary relevance (relevant=1, irrelevant=0) 2. **Weighted Precision**: Calculates precision at each rank position, rewarding relevant nodes that appear earlier 3. **Final Score**: Average of precision values at relevant positions @@ -145,14 +145,14 @@ The target proxy is designed with security in mind: - Binds to **loopback only** (127.0.0.1) - not accessible from network - Uses **bearer token authentication** - unique per execution - Enforces **max_calls limit** - prevents runaway costs -- **Auto-shutdown** - proxy terminates when evaluator completes +- **Auto-shutdown** - proxy terminates when grader completes ## Configuration -Enable target access by adding a `target` block to your `code_grader` evaluator: +Enable target access by adding a `target` block to your `code_grader` grader: ```yaml -evaluators: +graders: - name: contextual_precision type: code-grader command: [bun, run, scripts/contextual-precision.ts] @@ -199,7 +199,7 @@ console.log(`Available targets: ${info.availableTargets.join(', ')}`); ## Target Override -Use different targets for different purposes within the same evaluator: +Use different targets for different purposes within the same grader: ```typescript // Use a coding agent for complex tasks diff --git a/examples/features/code-grader-with-llm-calls/evals/contextual-precision.eval.yaml b/examples/features/code-grader-with-llm-calls/evals/contextual-precision.eval.yaml index 8feff8abc..719f651af 100644 --- a/examples/features/code-grader-with-llm-calls/evals/contextual-precision.eval.yaml +++ b/examples/features/code-grader-with-llm-calls/evals/contextual-precision.eval.yaml @@ -1,4 +1,4 @@ -# Contextual Precision Evaluator +# Contextual Precision Grader # # Evaluates whether relevant retrieval nodes are ranked higher than irrelevant ones. # This metric rewards retrievers that surface relevant content first. diff --git a/examples/features/code-grader-with-llm-calls/evals/contextual-recall.eval.yaml b/examples/features/code-grader-with-llm-calls/evals/contextual-recall.eval.yaml index 52e406fdf..f6c713c0c 100644 --- a/examples/features/code-grader-with-llm-calls/evals/contextual-recall.eval.yaml +++ b/examples/features/code-grader-with-llm-calls/evals/contextual-recall.eval.yaml @@ -1,4 +1,4 @@ -# Contextual Recall Evaluator +# Contextual Recall Grader # # Evaluates whether retrieval context covers all statements in the expected answer. # This metric identifies gaps where retrieval failed to surface needed information. diff --git a/examples/features/code-grader-with-llm-calls/scripts/contextual-precision.ts b/examples/features/code-grader-with-llm-calls/scripts/contextual-precision.ts index 3ce4fc8d8..2c2e0ea91 100644 --- a/examples/features/code-grader-with-llm-calls/scripts/contextual-precision.ts +++ b/examples/features/code-grader-with-llm-calls/scripts/contextual-precision.ts @@ -1,6 +1,6 @@ #!/usr/bin/env bun /** - * Contextual Precision Evaluator + * Contextual Precision Grader * * Implements the Contextual Precision metric for RAG systems. * This metric evaluates whether relevant retrieval nodes are ranked higher @@ -12,7 +12,7 @@ * Retrieval context is extracted from expected_output.tool_calls output, * which represents the expected agent behavior (calling a retrieval tool). * - * Requires `target: { max_calls: N }` in the evaluator YAML config, + * Requires `target: { max_calls: N }` in the grader YAML config, * where N >= number of retrieval context nodes to evaluate. */ import { createTargetClient, defineCodeGrader } from '@agentv/eval'; @@ -70,7 +70,7 @@ export default defineCodeGrader(async (input) => { score: 0, assertions: [ { - text: 'Target not available - ensure `target` block is configured in evaluator YAML', + text: 'Target not available - ensure `target` block is configured in grader YAML', passed: false, }, ], @@ -94,7 +94,7 @@ Is this node relevant to answering the question? Respond with JSON only: "reasoning": "brief explanation" }`, systemPrompt: - 'You are a precise relevance evaluator for RAG systems. Determine if a retrieved node contains information useful for answering the given question. Output valid JSON only.', + 'You are a precise relevance grader for RAG systems. Determine if a retrieved node contains information useful for answering the given question. Output valid JSON only.', target: 'gemini-llm', // Override: use gemini-llm for relevance checks })); diff --git a/examples/features/code-grader-with-llm-calls/scripts/contextual-recall.ts b/examples/features/code-grader-with-llm-calls/scripts/contextual-recall.ts index 2742d1dc9..403efbc4b 100644 --- a/examples/features/code-grader-with-llm-calls/scripts/contextual-recall.ts +++ b/examples/features/code-grader-with-llm-calls/scripts/contextual-recall.ts @@ -1,6 +1,6 @@ #!/usr/bin/env bun /** - * Contextual Recall Evaluator + * Contextual Recall Grader * * Implements the Contextual Recall metric for RAG systems. * This metric evaluates whether the retrieval context contains enough relevant @@ -16,7 +16,7 @@ * Retrieval context is extracted from expected_output.tool_calls output, * which represents the expected agent behavior (calling a retrieval tool). * - * Requires `target: { max_calls: N }` in the evaluator YAML config, + * Requires `target: { max_calls: N }` in the grader YAML config, * where N >= 2 (one for statement extraction + one for attribution check). */ import { createTargetClient, defineCodeGrader } from '@agentv/eval'; @@ -92,7 +92,7 @@ export default defineCodeGrader(async (input) => { score: 0, assertions: [ { - text: 'Target not available - ensure `target` block is configured in evaluator YAML', + text: 'Target not available - ensure `target` block is configured in grader YAML', passed: false, }, ], diff --git a/examples/features/composite/README.md b/examples/features/composite/README.md index d8e6b38d0..f3ae487e5 100644 --- a/examples/features/composite/README.md +++ b/examples/features/composite/README.md @@ -1,11 +1,11 @@ -# Composite Evaluators +# Composite Graders -Demonstrates composite evaluator patterns for combining multiple evaluation criteria. +Demonstrates composite grader patterns for combining multiple evaluation criteria. ## What This Shows -- Combining multiple evaluators in a single test case -- Weighted scoring across evaluators +- Combining multiple graders in a single test case +- Weighted scoring across graders - AND/OR logic patterns - Hierarchical evaluation strategies @@ -18,4 +18,4 @@ bun agentv eval examples/features/composite/evals/dataset.eval.yaml ## Key Files -- `evals/dataset.eval.yaml` - Test cases with composite evaluator patterns +- `evals/dataset.eval.yaml` - Test cases with composite grader patterns diff --git a/examples/features/composite/evals/dataset.eval.yaml b/examples/features/composite/evals/dataset.eval.yaml index f28cc5091..a8bfef655 100644 --- a/examples/features/composite/evals/dataset.eval.yaml +++ b/examples/features/composite/evals/dataset.eval.yaml @@ -1,4 +1,4 @@ -name: composite-evaluator-examples +name: composite-grader-examples # This example demonstrates the new CompositeEvaluator feature @@ -84,7 +84,7 @@ tests: type: llm-grader prompt: ../prompts/conflict-resolution.md - # Example 4: Nested Composite Evaluators + # Example 4: Nested Composite Graders - id: nested-composite input: - role: user diff --git a/examples/features/composite/prompts/conflict-resolution.md b/examples/features/composite/prompts/conflict-resolution.md index 8943471c3..7ccd1b331 100644 --- a/examples/features/composite/prompts/conflict-resolution.md +++ b/examples/features/composite/prompts/conflict-resolution.md @@ -1,7 +1,7 @@ -Review the child evaluator results below: +Review the child grader results below: {{EVALUATOR_RESULTS_JSON}} Since this is a product description, we need both conciseness AND detail. -If both evaluators score highly, give a high score. +If both graders score highly, give a high score. If they conflict (one high, one low), prefer detail slightly over conciseness. Provide a balanced final score and verdict. diff --git a/examples/features/composite/scripts/safety-gate-aggregator.js b/examples/features/composite/scripts/safety-gate-aggregator.js index 225fb974d..027f19b62 100644 --- a/examples/features/composite/scripts/safety-gate-aggregator.js +++ b/examples/features/composite/scripts/safety-gate-aggregator.js @@ -22,7 +22,7 @@ try { let verdict = 'fail'; const assertions = []; - // Helper: extract assertions from sub-evaluator results (supports both old and new format) + // Helper: extract assertions from sub-grader results (supports both old and new format) function extractAssertions(result) { if (Array.isArray(result.assertions)) return result.assertions; const out = []; diff --git a/examples/features/copilot-log-eval/.agentv/targets.yaml b/examples/features/copilot-log-eval/.agentv/targets.yaml index a899b44de..2062c3933 100644 --- a/examples/features/copilot-log-eval/.agentv/targets.yaml +++ b/examples/features/copilot-log-eval/.agentv/targets.yaml @@ -1,6 +1,6 @@ targets: # Passive transcript reader — reads Copilot CLI session transcripts from disk. - # Zero API cost. No grader_target needed for deterministic evaluators. + # Zero API cost. No grader_target needed for deterministic graders. # # Usage: # agentv eval evals/skill-trigger.EVAL.yaml --target copilot-log diff --git a/examples/features/copilot-log-eval/README.md b/examples/features/copilot-log-eval/README.md index 7d53385cb..6e890f154 100644 --- a/examples/features/copilot-log-eval/README.md +++ b/examples/features/copilot-log-eval/README.md @@ -1,9 +1,9 @@ # Copilot Log Evaluation Example Demonstrates the `copilot-log` provider reading Copilot CLI session transcripts -from disk with deterministic evaluators. **No LLM API key needed.** +from disk with deterministic graders. **No LLM API key needed.** -Evaluators used: +Graders used: - `skill-trigger` — checks whether a specific skill was invoked - `code-grader` — custom TypeScript grader inspecting the full `Message[]` with tool calls @@ -33,7 +33,7 @@ agentv eval evals/skill-trigger.EVAL.yaml --target copilot-log The `before_all` hook runs `allagents workspace init` to sync the agentv-dev plugin skills into the workspace. The `copilot-log` provider then auto-discovers -the latest session from `~/.copilot/session-state/` and runs all evaluators. +the latest session from `~/.copilot/session-state/` and runs all graders. ## How it works @@ -43,11 +43,11 @@ allagents workspace init (before_all hook) ~/.copilot/session-state/{uuid}/events.jsonl ↓ copilot-log provider (reads from disk) Message[] with tool calls - ├─ skill-trigger evaluator (deterministic) → pass/fail + ├─ skill-trigger grader (deterministic) → pass/fail └─ code-grader (graders/transcript-quality.ts) → pass/fail ``` -## Evaluators +## Graders ### skill-trigger Checks whether the `csv-analyzer` skill was (or was not) invoked. diff --git a/examples/features/copilot-log-eval/evals/skill-trigger.EVAL.yaml b/examples/features/copilot-log-eval/evals/skill-trigger.EVAL.yaml index 81f2ea673..14a080540 100644 --- a/examples/features/copilot-log-eval/evals/skill-trigger.EVAL.yaml +++ b/examples/features/copilot-log-eval/evals/skill-trigger.EVAL.yaml @@ -4,7 +4,7 @@ # 1. skill-trigger — deterministic check for skill invocation # 2. code-grader — custom TypeScript grader inspecting full Message[] # -# No LLM API key needed — all evaluators are deterministic. +# No LLM API key needed — all graders are deterministic. # # Prerequisites: # 1. Have at least one Copilot CLI session in ~/.copilot/session-state/ diff --git a/examples/features/default-evaluators/evals/dataset.eval.yaml b/examples/features/default-evaluators/evals/dataset.eval.yaml index 5e3dd1304..bb4080531 100644 --- a/examples/features/default-evaluators/evals/dataset.eval.yaml +++ b/examples/features/default-evaluators/evals/dataset.eval.yaml @@ -1,8 +1,8 @@ -# Default Evaluators Example +# Default Graders Example # Demonstrates root-level assert that apply to all tests -name: default-evaluators-example -description: Root-level evaluators that automatically apply to every test +name: default-graders-example +description: Root-level graders that automatically apply to every test execution: target: llm diff --git a/examples/features/deterministic-evaluators/README.md b/examples/features/deterministic-evaluators/README.md index 2f2dba44c..bbe0778ea 100644 --- a/examples/features/deterministic-evaluators/README.md +++ b/examples/features/deterministic-evaluators/README.md @@ -1,13 +1,13 @@ -# Deterministic Evaluators +# Deterministic Graders -Demonstrates how a single, parameterised `code_grader` script can replace a family of built-in assertion evaluators (contains, regex, JSON validation, etc.). +Demonstrates how a single, parameterised `code_grader` script can replace a family of built-in assertion graders (contains, regex, JSON validation, etc.). ## Why a Code Grader? -AgentV's design philosophy keeps the core minimal. Instead of adding `contains`, `regex`, `is-json` as built-in evaluator types, you write a small code grader and drive it with YAML `config`: +AgentV's design philosophy keeps the core minimal. Instead of adding `contains`, `regex`, `is-json` as built-in grader types, you write a small code grader and drive it with YAML `config`: ```yaml -evaluators: +graders: - name: has-keyword type: code-grader command: ["bun", "run", "../graders/assertions.ts"] @@ -47,7 +47,7 @@ bun run build ```bash # From examples/features -bun agentv eval deterministic-evaluators/evals/dataset.eval.yaml --target +bun agentv eval deterministic-graders/evals/dataset.eval.yaml --target ``` ## Standalone Test @@ -55,7 +55,7 @@ bun agentv eval deterministic-evaluators/evals/dataset.eval.yaml --target /index.jsonl [options] Options: - --evaluator Filter to a specific evaluator + --grader Filter to a specific grader --format csv Output as CSV instead of table ``` diff --git a/examples/features/document-extraction/evals/confusion-metrics.eval.yaml b/examples/features/document-extraction/evals/confusion-metrics.eval.yaml index b1f5be5e3..e723a24e5 100644 --- a/examples/features/document-extraction/evals/confusion-metrics.eval.yaml +++ b/examples/features/document-extraction/evals/confusion-metrics.eval.yaml @@ -1,7 +1,7 @@ # Confusion Metrics Dataset # # This dataset demonstrates aggregatable TP/TN/FP/FN metrics across multiple documents. -# All cases use the SAME header_confusion evaluator with the SAME fields, +# All cases use the SAME header_confusion grader with the SAME fields, # enabling cross-document aggregation with fractional precision/recall. # # Use case: Measuring field-level extraction accuracy across a document corpus. diff --git a/examples/features/document-extraction/evals/field-accuracy.eval.yaml b/examples/features/document-extraction/evals/field-accuracy.eval.yaml index 1e4a52710..ee3e7048e 100644 --- a/examples/features/document-extraction/evals/field-accuracy.eval.yaml +++ b/examples/features/document-extraction/evals/field-accuracy.eval.yaml @@ -1,6 +1,6 @@ # Field Accuracy Evaluation Dataset # -# This dataset demonstrates the built-in `field_accuracy` evaluator for per-test-case scoring. +# This dataset demonstrates the built-in `field_accuracy` grader for per-test-case scoring. # Use this pattern when you need simple pass/fail scoring per field. # # For aggregatable TP/TN/FP/FN metrics across documents, see confusion-metrics.yaml instead. @@ -22,13 +22,13 @@ # invoice-005: ~1.000 (line items extracted correctly) # invoice-006: ~1.000 (greedy matching handles reordered line items) # -description: Field accuracy evaluator patterns (per-test-case scoring) +description: Field accuracy grader patterns (per-test-case scoring) execution: target: mock_extractor assertions: - # Primary evaluator: Correctness via field-level accuracy + # Primary grader: Correctness via field-level accuracy - name: invoice_field_accuracy type: field-accuracy fields: @@ -104,7 +104,7 @@ assertions: # weight: 1.5 # required: false - # Execution metrics evaluators (optional, require provider to report metrics) + # Execution metrics graders (optional, require provider to report metrics) # - name: performance_check # type: latency # threshold: 2000 # max allowed duration in milliseconds diff --git a/examples/features/document-extraction/fixtures/README.md b/examples/features/document-extraction/fixtures/README.md index e5f23e929..bb5b4d18e 100644 --- a/examples/features/document-extraction/fixtures/README.md +++ b/examples/features/document-extraction/fixtures/README.md @@ -1,6 +1,6 @@ # Document Extraction Test Fixtures -This directory contains JSON mock files representing extracted invoice data for testing the field_accuracy evaluator. +This directory contains JSON mock files representing extracted invoice data for testing the field_accuracy grader. ## Files @@ -12,7 +12,7 @@ This directory contains JSON mock files representing extracted invoice data for ## Intentional Variations -These fixtures contain realistic extraction variations to test the evaluator: +These fixtures contain realistic extraction variations to test the grader: - **invoice-002**: Preserves OCR-like formatting ("Acme - Shipping" with hyphen/spaces) - **invoice-003**: Decimal precision preserved (1889.5) to test ±$1 tolerance - **invoice-004**: Missing invoice_number field to test required field penalty @@ -22,7 +22,7 @@ These fixtures contain realistic extraction variations to test the evaluator: These JSON files simulate **already-extracted** invoice data, representing the output of an OCR/extraction system: - Readable and versionable in git - Fast to test and iterate -- Clear demonstration of evaluator features without PDF parsing complexity +- Clear demonstration of grader features without PDF parsing complexity - Focuses on the **evaluation** logic, not document processing ## Real-World Usage @@ -31,6 +31,6 @@ In production, you would: 1. Use actual PDF/image invoices as input 2. Run OCR/extraction tool (Azure Form Recognizer, Tesseract, vision models, etc.) 3. Extract structured JSON data (like these fixtures) -4. Evaluate extracted data against expected values using field_accuracy evaluator +4. Evaluate extracted data against expected values using field_accuracy grader The mock_extractor.ts script simulates this by simply reading these JSON files. diff --git a/examples/features/document-extraction/graders/fuzzy_match.ts b/examples/features/document-extraction/graders/fuzzy_match.ts index cf7e812b7..eff4d3074 100644 --- a/examples/features/document-extraction/graders/fuzzy_match.ts +++ b/examples/features/document-extraction/graders/fuzzy_match.ts @@ -3,12 +3,12 @@ * Fuzzy String Matching code_grader Example * * This script demonstrates how to implement fuzzy string matching as a code_grader - * evaluator. Use this approach for comparing extracted text that may have OCR errors, + * grader. Use this approach for comparing extracted text that may have OCR errors, * formatting variations, or minor typos. * * Usage in dataset.eval.yaml: * ```yaml - * evaluators: + * graders: * - name: vendor_name_fuzzy * type: code_grader * script: ["bun", "run", "../graders/fuzzy_match.ts"] diff --git a/examples/features/document-extraction/graders/header_confusion_metrics.ts b/examples/features/document-extraction/graders/header_confusion_metrics.ts index 66e813151..eb9878a6a 100644 --- a/examples/features/document-extraction/graders/header_confusion_metrics.ts +++ b/examples/features/document-extraction/graders/header_confusion_metrics.ts @@ -14,7 +14,7 @@ * * Usage in dataset.eval.yaml: * ```yaml - * evaluators: + * graders: * - name: header_confusion * type: code_grader * script: ["bun", "run", "../graders/header_confusion_metrics.ts"] diff --git a/examples/features/document-extraction/graders/line_item_matching.ts b/examples/features/document-extraction/graders/line_item_matching.ts index 6f9775c4f..751eef55e 100644 --- a/examples/features/document-extraction/graders/line_item_matching.ts +++ b/examples/features/document-extraction/graders/line_item_matching.ts @@ -13,7 +13,7 @@ * * Usage in dataset.eval.yaml: * ```yaml - * evaluators: + * graders: * - name: line_items_matched * type: code_grader * script: ["bun", "run", "../graders/line_item_matching.ts"] diff --git a/examples/features/document-extraction/graders/multi_field_fuzzy.ts b/examples/features/document-extraction/graders/multi_field_fuzzy.ts index 7148d935c..e6e788c2d 100644 --- a/examples/features/document-extraction/graders/multi_field_fuzzy.ts +++ b/examples/features/document-extraction/graders/multi_field_fuzzy.ts @@ -7,7 +7,7 @@ * * Usage in dataset.eval.yaml: * ```yaml - * evaluators: + * graders: * - name: party_names_fuzzy * type: code_grader * script: ["bun", "run", "../graders/multi_field_fuzzy.ts"] diff --git a/examples/features/document-extraction/mock_extractor.ts b/examples/features/document-extraction/mock_extractor.ts index d4a5bf0a4..ec1511e98 100644 --- a/examples/features/document-extraction/mock_extractor.ts +++ b/examples/features/document-extraction/mock_extractor.ts @@ -5,7 +5,7 @@ * Simulates a document extraction system that reads structured data from JSON fixtures. * In a real implementation, this would parse PDFs/images using OCR or vision models. * - * This mock simply reads pre-extracted JSON data to demonstrate the field_accuracy evaluator. + * This mock simply reads pre-extracted JSON data to demonstrate the field_accuracy grader. * * Usage: bun run mock_extractor.ts [output-file] */ diff --git a/examples/features/document-extraction/scripts/aggregate_metrics.ts b/examples/features/document-extraction/scripts/aggregate_metrics.ts index c20e52a44..59eafa70b 100644 --- a/examples/features/document-extraction/scripts/aggregate_metrics.ts +++ b/examples/features/document-extraction/scripts/aggregate_metrics.ts @@ -7,7 +7,7 @@ * * Usage: * bun run scripts/aggregate_metrics.ts .agentv/results/runs//index.jsonl - * bun run scripts/aggregate_metrics.ts .agentv/results/runs//index.jsonl --evaluator header_confusion + * bun run scripts/aggregate_metrics.ts .agentv/results/runs//index.jsonl --grader header_confusion * bun run scripts/aggregate_metrics.ts .agentv/results/runs//index.jsonl --format csv */ @@ -103,7 +103,7 @@ function extractMetricsFromResults( } } - // Process nested evaluator results + // Process nested grader results if (result.scores) { for (const child of result.scores) { processResult(child); @@ -235,13 +235,13 @@ async function main(): Promise { console.log(`Usage: bun run scripts/aggregate_metrics.ts [options] Options: - --evaluator Only aggregate metrics from evaluators with this name + --grader Only aggregate metrics from graders with this name --format Output format: table (default) or csv --help Show this help message Example: bun run scripts/aggregate_metrics.ts .agentv/results/eval-001.jsonl - bun run scripts/aggregate_metrics.ts .agentv/results/runs//index.jsonl --evaluator header_confusion --format csv + bun run scripts/aggregate_metrics.ts .agentv/results/runs//index.jsonl --grader header_confusion --format csv `); process.exit(0); } @@ -251,7 +251,7 @@ Example: let format: 'table' | 'csv' = 'table'; for (let i = 1; i < args.length; i++) { - if (args[i] === '--evaluator' && args[i + 1]) { + if (args[i] === '--grader' && args[i + 1]) { evaluatorFilter = args[++i]; } else if (args[i] === '--format' && args[i + 1]) { format = args[++i] as 'table' | 'csv'; diff --git a/examples/features/execution-metrics/evals/dataset.eval.yaml b/examples/features/execution-metrics/evals/dataset.eval.yaml index 4399f3af5..d31b58304 100644 --- a/examples/features/execution-metrics/evals/dataset.eval.yaml +++ b/examples/features/execution-metrics/evals/dataset.eval.yaml @@ -1,7 +1,7 @@ -# Execution Metrics Evaluator Demo -# Demonstrates the built-in execution_metrics evaluator for declarative threshold-based checks. +# Execution Metrics Grader Demo +# Demonstrates the built-in execution_metrics grader for declarative threshold-based checks. # -# The execution_metrics evaluator allows you to set limits on: +# The execution_metrics grader allows you to set limits on: # - max_tool_calls: Maximum number of tool invocations # - max_llm_calls: Maximum number of LLM calls (assistant messages) # - max_tokens: Maximum total tokens (input + output) @@ -16,7 +16,7 @@ # bun agentv eval examples/features/execution-metrics/evals/dataset.eval.yaml --dry-run name: execution-metrics -description: Demonstrates the built-in execution_metrics evaluator +description: Demonstrates the built-in execution_metrics grader # Mock agent that returns realistic execution metrics execution: @@ -67,7 +67,7 @@ tests: # ========================================== # Example 3: Research task with tool trajectory + metrics - # Combines multiple evaluator types + # Combines multiple grader types # ========================================== - id: research-with-metrics diff --git a/examples/features/execution-metrics/scripts/check-metrics-present.ts b/examples/features/execution-metrics/scripts/check-metrics-present.ts index 49f672ccc..1aacdea13 100644 --- a/examples/features/execution-metrics/scripts/check-metrics-present.ts +++ b/examples/features/execution-metrics/scripts/check-metrics-present.ts @@ -6,7 +6,7 @@ * This is a simple sanity check that metrics collection is working. * * Usage in eval YAML: - * evaluators: + * graders: * - name: metrics-present * type: code_grader * script: ["bun", "run", "../scripts/check-metrics-present.ts"] diff --git a/examples/features/file-changes/evals/dataset.eval.yaml b/examples/features/file-changes/evals/dataset.eval.yaml index 3d8db67e2..f2755e714 100644 --- a/examples/features/file-changes/evals/dataset.eval.yaml +++ b/examples/features/file-changes/evals/dataset.eval.yaml @@ -1,6 +1,6 @@ # File changes feature demonstration # Tests that AgentV captures workspace file changes (edits, creates, deletes) -# and passes them to evaluators via the file_changes field. +# and passes them to graders via the file_changes field. # # The workspace-template/ contains: hello.txt, config.json, obsolete.log, src/main.ts # The mock agent: edits hello.txt + config.json, creates src/utils.ts + tests/main.test.ts, diff --git a/examples/features/import-claude/README.md b/examples/features/import-claude/README.md index 11bbca308..134219561 100644 --- a/examples/features/import-claude/README.md +++ b/examples/features/import-claude/README.md @@ -1,9 +1,9 @@ # Import Claude — Offline Transcript Grading Demonstrates importing a Claude Code session transcript and grading it -offline with deterministic evaluators. **No LLM API key needed.** +offline with deterministic graders. **No LLM API key needed.** -Evaluators used: +Graders used: - `code-grader` — custom TypeScript grader inspecting the full `Message[]` with tool calls ## Setup @@ -49,7 +49,7 @@ The import pipeline: 4. Aggregates token usage (last cumulative value per LLM request) 5. Writes a clean `Message[]` JSONL for evaluation -## Evaluators +## Graders ### transcript-quality (code-grader) diff --git a/examples/features/latency-assertions/README.md b/examples/features/latency-assertions/README.md index 81a86b7ec..de6946bd4 100644 --- a/examples/features/latency-assertions/README.md +++ b/examples/features/latency-assertions/README.md @@ -1,17 +1,17 @@ # Per-Step Latency Assertions -This example demonstrates how to use the `max_duration_ms` field in `tool_trajectory` evaluators to validate per-tool-call timing budgets. +This example demonstrates how to use the `max_duration_ms` field in `tool_trajectory` graders to validate per-tool-call timing budgets. ## Overview -The `tool_trajectory` evaluator now supports optional latency assertions on individual tool calls. This allows you to catch performance regressions at a granular level rather than only checking total execution time. +The `tool_trajectory` grader now supports optional latency assertions on individual tool calls. This allows you to catch performance regressions at a granular level rather than only checking total execution time. ## Usage Add `max_duration_ms` to any expected tool item: ```yaml -evaluators: +graders: - name: perf-check type: tool-trajectory mode: in_order diff --git a/examples/features/latency-assertions/evals/dataset.eval.yaml b/examples/features/latency-assertions/evals/dataset.eval.yaml index 8f0a01c53..62e121339 100644 --- a/examples/features/latency-assertions/evals/dataset.eval.yaml +++ b/examples/features/latency-assertions/evals/dataset.eval.yaml @@ -1,7 +1,7 @@ # AgentV Per-Step Latency Assertions Demo -# Demonstrates latency assertions in the tool_trajectory evaluator +# Demonstrates latency assertions in the tool_trajectory grader # -# The tool_trajectory evaluator now supports optional `max_duration_ms` assertions +# The tool_trajectory grader now supports optional `max_duration_ms` assertions # on individual tool calls. This allows you to validate that tools complete within # timing budgets to catch performance regressions. # diff --git a/examples/features/latency-assertions/mock-latency-agent.ts b/examples/features/latency-assertions/mock-latency-agent.ts index 535e4d1b0..80c448997 100644 --- a/examples/features/latency-assertions/mock-latency-agent.ts +++ b/examples/features/latency-assertions/mock-latency-agent.ts @@ -3,7 +3,7 @@ * Mock Agent CLI for latency assertion demos. * * Returns tool calls with duration_ms to demonstrate - * per-step latency validation in tool_trajectory evaluator. + * per-step latency validation in tool_trajectory grader. * * Usage: * bun run mock-latency-agent.ts --prompt "..." --output output.json diff --git a/examples/features/multi-turn-conversation-live/README.md b/examples/features/multi-turn-conversation-live/README.md index db0a9bd28..904de87a8 100644 --- a/examples/features/multi-turn-conversation-live/README.md +++ b/examples/features/multi-turn-conversation-live/README.md @@ -6,7 +6,7 @@ This example demonstrates **live turn-by-turn conversation evaluation** where th - `mode: conversation` — enables live turn-by-turn evaluation - `turns[]` — each entry is a user message that generates an LLM call -- Per-turn `assertions` — string shorthand (rubric) and structured evaluators +- Per-turn `assertions` — string shorthand (rubric) and structured graders - `aggregation: mean | min | max` — how turn scores combine - `on_turn_failure: stop | continue` — behavior on assertion failure - Top-level `assertions` — conversation-level grading after all turns diff --git a/examples/features/multi-turn-conversation/README.md b/examples/features/multi-turn-conversation/README.md index ffd39c5d6..773b3b8f4 100644 --- a/examples/features/multi-turn-conversation/README.md +++ b/examples/features/multi-turn-conversation/README.md @@ -8,7 +8,7 @@ Demonstrates evaluating multi-turn conversation quality using composable 1. Multi-turn input with 4+ user/assistant turns where context retention matters 2. Conversation-aware grader prompts that receive the full `{{ input }}` message array 3. Per-turn score breakdown via structured `details` -4. Composability: multiple `llm-grader` evaluators combined with deterministic assertions +4. Composability: multiple `llm-grader` graders combined with deterministic assertions ## Grader dimensions @@ -24,7 +24,7 @@ Demonstrates evaluating multi-turn conversation quality using composable bun apps/cli/src/cli.ts eval examples/features/multi-turn-conversation/evals/dataset.eval.yaml ``` -## Creating your own conversation evaluator +## Creating your own conversation grader 1. Create a markdown file in `graders/` 2. Use `{{ input }}` to receive the full conversation message array with roles diff --git a/examples/features/nlp-metrics/README.md b/examples/features/nlp-metrics/README.md index cc33e6176..89ed811c8 100644 --- a/examples/features/nlp-metrics/README.md +++ b/examples/features/nlp-metrics/README.md @@ -1,6 +1,6 @@ # NLP Metrics Examples -Demonstrates how to implement common NLP evaluation metrics as AgentV `code_grader` evaluators — no external dependencies required. +Demonstrates how to implement common NLP evaluation metrics as AgentV `code_grader` graders — no external dependencies required. ## Graders @@ -46,4 +46,4 @@ Each grader receives the candidate answer and reference text via the `defineCode ## Combining Metrics -The `multi-metric-evaluation` test in `dataset.eval.yaml` shows how to attach multiple evaluators to a single test case. AgentV runs each grader independently and reports all scores. +The `multi-metric-evaluation` test in `dataset.eval.yaml` shows how to attach multiple graders to a single test case. AgentV runs each grader independently and reports all scores. diff --git a/examples/features/nlp-metrics/evals/dataset.eval.yaml b/examples/features/nlp-metrics/evals/dataset.eval.yaml index f75b4d511..7962277ec 100644 --- a/examples/features/nlp-metrics/evals/dataset.eval.yaml +++ b/examples/features/nlp-metrics/evals/dataset.eval.yaml @@ -1,9 +1,9 @@ -# NLP Metrics Evaluator Examples +# NLP Metrics Grader Examples # Demonstrates ROUGE, BLEU, cosine similarity, and Levenshtein distance -# as code_grader evaluators — no external dependencies required. +# as code_grader graders — no external dependencies required. name: nlp-metrics -description: NLP text-quality metrics using code_grader evaluators +description: NLP text-quality metrics using code_grader graders execution: target: llm diff --git a/examples/features/prompt-template-sdk/README.md b/examples/features/prompt-template-sdk/README.md index e240a6921..c1c1862cc 100644 --- a/examples/features/prompt-template-sdk/README.md +++ b/examples/features/prompt-template-sdk/README.md @@ -52,6 +52,6 @@ prompt-template-sdk/ evals/ dataset.eval.yaml # Tests using TypeScript prompt prompts/ - custom-evaluator.ts # TypeScript prompt template + custom-grader.ts # TypeScript prompt template README.md ``` diff --git a/examples/features/prompt-template-sdk/evals/dataset.eval.yaml b/examples/features/prompt-template-sdk/evals/dataset.eval.yaml index 81e221837..3e1b751b1 100644 --- a/examples/features/prompt-template-sdk/evals/dataset.eval.yaml +++ b/examples/features/prompt-template-sdk/evals/dataset.eval.yaml @@ -1,5 +1,5 @@ # Prompt Template SDK Demo -# Demonstrates using TypeScript/JavaScript files for custom evaluator prompts. +# Demonstrates using TypeScript/JavaScript files for custom grader prompts. # Uses the same explicit script pattern as code_grader for consistency. description: Demonstrates TypeScript prompt templates for custom LLM grader prompts @@ -27,7 +27,7 @@ tests: type: llm-grader # Executable prompt template using explicit script array (matches code_grader pattern) prompt: - command: [bun, run, ../prompts/custom-evaluator.ts] + command: [bun, run, ../prompts/custom-grader.ts] - id: prompt-template-with-config criteria: The CLI explains async/await correctly. @@ -47,7 +47,7 @@ tests: type: llm-grader # Executable prompt template with config prompt: - command: [bun, run, ../prompts/custom-evaluator.ts] + command: [bun, run, ../prompts/custom-grader.ts] config: rubric: |- - Must mention Promises diff --git a/examples/features/rubric/README.md b/examples/features/rubric/README.md index 06ae65845..bddcaa45a 100644 --- a/examples/features/rubric/README.md +++ b/examples/features/rubric/README.md @@ -1,4 +1,4 @@ -# Rubric Evaluator Example +# Rubric Grader Example Demonstrates rubric-based evaluation with weights, required flags, and auto-generation. diff --git a/examples/features/rubric/evals/check_syntax.py b/examples/features/rubric/evals/check_syntax.py index 1822ffe65..c213c197e 100644 --- a/examples/features/rubric/evals/check_syntax.py +++ b/examples/features/rubric/evals/check_syntax.py @@ -67,7 +67,7 @@ def main(): # System error result = { "score": 0.0, - "assertions": [{"text": f"Evaluator error: {str(e)}", "passed": False}], + "assertions": [{"text": f"Grader error: {str(e)}", "passed": False}], } print(json.dumps(result)) diff --git a/examples/features/rubric/evals/dataset.eval.yaml b/examples/features/rubric/evals/dataset.eval.yaml index bcb629cd7..0f1c4d577 100644 --- a/examples/features/rubric/evals/dataset.eval.yaml +++ b/examples/features/rubric/evals/dataset.eval.yaml @@ -1,8 +1,8 @@ -# AgentV Rubric Evaluator Example +# AgentV Rubric Grader Example # Demonstrates the rubric-based evaluation feature using type: rubrics under assert name: rubric -description: "Example showing rubric evaluator - string shorthand and type: rubrics" +description: "Example showing rubric grader - string shorthand and type: rubrics" execution: target: llm @@ -10,7 +10,7 @@ execution: tests: # ========================================== # Example 1: Simple string rubrics - # Demonstrates: string shorthand in assert (strings default to rubrics evaluator) + # Demonstrates: string shorthand in assert (strings default to rubrics grader) # ========================================== - id: code-explanation-simple @@ -98,8 +98,8 @@ tests: required: false # ========================================== - # Example 3: Multiple evaluators with rubric - # Demonstrates: combining rubric evaluator with other evaluators + # Example 3: Multiple graders with rubric + # Demonstrates: combining rubric grader with other graders # ========================================== - id: code-quality-multi-eval # Baseline note: candidates without type hints/edge handling often score lower (~0.75). @@ -134,7 +134,7 @@ tests: return bool(re.match(pattern, email)) assertions: - # Rubric evaluator for semantic checks + # Rubric grader for semantic checks - type: rubrics criteria: - Uses regular expressions for email validation @@ -142,7 +142,7 @@ tests: - Has docstring documentation - Handles edge cases (None, empty string) - # Additional code evaluator for syntax checking + # Additional code grader for syntax checking - name: python_syntax type: code-grader command: ["uv", "run", "python", "check_syntax.py"] @@ -175,7 +175,7 @@ tests: Extreme weather increases. Scientists call for urgent carbon emission cuts and renewable energy adoption. - # No rubrics defined - will use default llm_grader evaluator + # No rubrics defined - will use default llm_grader grader # Add assertions directly if you want deterministic checks or explicit rubrics # ========================================== diff --git a/examples/features/threshold-evaluator/evals/dataset.eval.yaml b/examples/features/threshold-evaluator/evals/dataset.eval.yaml index 065064fa0..a8adc1cf2 100644 --- a/examples/features/threshold-evaluator/evals/dataset.eval.yaml +++ b/examples/features/threshold-evaluator/evals/dataset.eval.yaml @@ -1,7 +1,7 @@ -name: threshold-evaluator-example -description: Demonstrates the threshold aggregator — pass if N% of child evaluators pass +name: threshold-grader-example +description: Demonstrates the threshold aggregator — pass if N% of child graders pass -# Demonstrates the threshold aggregator: pass if N% of child evaluators pass. +# Demonstrates the threshold aggregator: pass if N% of child graders pass. # Borderline verdicts count as passing (lenient). execution: diff --git a/examples/features/tool-evaluation-plugins/README.md b/examples/features/tool-evaluation-plugins/README.md index 16d2f9bb8..5107e96cd 100644 --- a/examples/features/tool-evaluation-plugins/README.md +++ b/examples/features/tool-evaluation-plugins/README.md @@ -13,7 +13,7 @@ Computes precision, recall, and F1 by comparing expected tool names against actu - **False positive**: unexpected tool was called ```yaml -evaluators: +graders: - name: tool-f1 type: code-grader command: ["bun", "run", "../graders/tool-call-f1.ts"] @@ -25,7 +25,7 @@ evaluators: Extends the name-only grader by also validating tool arguments. A call is a hit only if both the name matches AND the required arguments are present (subset match). ```yaml -evaluators: +graders: - name: tool-args-f1 type: code-grader command: ["bun", "run", "../graders/tool-args-f1.ts"] diff --git a/examples/features/tool-evaluation-plugins/evals/dataset.eval.yaml b/examples/features/tool-evaluation-plugins/evals/dataset.eval.yaml index 0413df377..7d3396091 100644 --- a/examples/features/tool-evaluation-plugins/evals/dataset.eval.yaml +++ b/examples/features/tool-evaluation-plugins/evals/dataset.eval.yaml @@ -1,7 +1,7 @@ # Tool-Call F1 Scoring Example # # Demonstrates using code_grader plugins to compute F1 scores over tool calls. -# The graders compare expected tools (from evaluator config) against actual +# The graders compare expected tools (from grader config) against actual # tool calls in the agent's output messages. # # Run: diff --git a/examples/features/tool-evaluation-plugins/graders/tool-args-f1.ts b/examples/features/tool-evaluation-plugins/graders/tool-args-f1.ts index 998b89706..b6f03cedb 100644 --- a/examples/features/tool-evaluation-plugins/graders/tool-args-f1.ts +++ b/examples/features/tool-evaluation-plugins/graders/tool-args-f1.ts @@ -6,14 +6,14 @@ * A tool call is a "hit" only if both the tool name matches AND the * required arguments are present with expected values. * - * Configuration (via evaluator config in YAML): + * Configuration (via grader config in YAML): * expected_tools: * - tool: "search" * args: { query: "weather tokyo" } # required args (subset match) * - tool: "fetch" # no args check — name-only match * * Usage in eval YAML: - * evaluators: + * graders: * - name: tool-args-f1 * type: code_grader * script: ["bun", "run", "../graders/tool-args-f1.ts"] diff --git a/examples/features/tool-evaluation-plugins/graders/tool-call-f1.ts b/examples/features/tool-evaluation-plugins/graders/tool-call-f1.ts index fbfa5a5c0..7a7b7e44e 100644 --- a/examples/features/tool-evaluation-plugins/graders/tool-call-f1.ts +++ b/examples/features/tool-evaluation-plugins/graders/tool-call-f1.ts @@ -5,7 +5,7 @@ * Computes precision, recall, and F1 score by comparing expected tool calls * against actual tool calls from the agent's output messages. * - * Configuration (via evaluator config in YAML): + * Configuration (via grader config in YAML): * expected_tools: string[] — list of tool names the agent should call * * Why this is a plugin (not built-in): @@ -14,7 +14,7 @@ * - Easy to extend with argument matching (see tool-args-f1.ts) * * Usage in eval YAML: - * evaluators: + * graders: * - name: tool-f1 * type: code_grader * script: ["bun", "run", "../graders/tool-call-f1.ts"] @@ -43,7 +43,7 @@ export default defineCodeGrader(({ output, config, ...rest }) => { score: 0, assertions: [ { - text: 'No expected_tools configured — set expected_tools in evaluator config', + text: 'No expected_tools configured — set expected_tools in grader config', passed: false, }, ], diff --git a/examples/features/tool-trajectory-advanced/evals/trace-file-demo.eval.yaml b/examples/features/tool-trajectory-advanced/evals/trace-file-demo.eval.yaml index a978cee75..4a9793fc4 100644 --- a/examples/features/tool-trajectory-advanced/evals/trace-file-demo.eval.yaml +++ b/examples/features/tool-trajectory-advanced/evals/trace-file-demo.eval.yaml @@ -2,7 +2,7 @@ # Demonstrates how to evaluate a pre-existing static trace file. # # This is useful when you have captured traces from production or other systems -# and want to run AgentV evaluators against them offline. +# and want to run AgentV graders against them offline. # # Setup: # 1. Create examples/features/.env with: diff --git a/examples/features/tool-trajectory-simple/evals/dataset.eval.yaml b/examples/features/tool-trajectory-simple/evals/dataset.eval.yaml index fee5661cc..8c547353d 100644 --- a/examples/features/tool-trajectory-simple/evals/dataset.eval.yaml +++ b/examples/features/tool-trajectory-simple/evals/dataset.eval.yaml @@ -1,7 +1,7 @@ -# AgentV Tool Trajectory Evaluator Demo -# Demonstrates trace events and tool_trajectory evaluator for agent execution tracking +# AgentV Tool Trajectory Grader Demo +# Demonstrates trace events and tool_trajectory grader for agent execution tracking # -# The tool_trajectory evaluator validates that an agent used the expected tools +# The tool_trajectory grader validates that an agent used the expected tools # during execution. It works with trace data returned by agent providers. # # Five trajectory matching modes are available: @@ -11,7 +11,7 @@ # - superset: Every expected tool must be found in actual (extras OK, greedy matching) # - subset: Every actual call must be in the allowed set (expected items reusable) # -# Argument matching modes (per-item `args_match` or evaluator-level `args_match`): +# Argument matching modes (per-item `args_match` or grader-level `args_match`): # - exact: Bidirectional deep equality, no extra keys (default) # - superset: Actual args must contain all expected keys (extras OK) # - subset: Actual args must be a subset of expected (no unexpected keys) @@ -20,7 +20,7 @@ # # Legacy shorthand: # - args: any - validate tool name only, ignore arguments -# Note: For pattern/regex matching, use a code_grader evaluator instead. +# Note: For pattern/regex matching, use a code_grader grader instead. # # This demo uses a CLI provider (mock-agent.ts) that simulates an agent with tool usage. # The mock agent generates different traces based on the prompt content. @@ -31,7 +31,7 @@ # 2. Run: cd examples/features && npx agentv eval tool-trajectory-simple/evals/dataset.eval.yaml --target mock_agent name: tool-trajectory-simple -description: Tool trajectory evaluator examples for agent execution validation +description: Tool trajectory grader examples for agent execution validation # Use mock_agent CLI target that returns trace data execution: diff --git a/examples/features/tool-trajectory-simple/mock-agent.ts b/examples/features/tool-trajectory-simple/mock-agent.ts index d35ccc65a..3124dbf6b 100644 --- a/examples/features/tool-trajectory-simple/mock-agent.ts +++ b/examples/features/tool-trajectory-simple/mock-agent.ts @@ -1,6 +1,6 @@ #!/usr/bin/env bun /** - * Mock Agent CLI for tool_trajectory evaluator demos. + * Mock Agent CLI for tool_trajectory grader demos. * * This CLI simulates an agent that uses tools and returns trace data. * It demonstrates how real agent providers (codex, vscode) would return @@ -42,7 +42,7 @@ function createToolCall(name: string, input: unknown, id?: string): ToolCall { /** * Generate response based on the prompt content. - * Different prompts trigger different tool sequences to demonstrate various evaluator modes. + * Different prompts trigger different tool sequences to demonstrate various grader modes. */ function generateResponse(prompt: string): AgentResponse { const lowerPrompt = prompt.toLowerCase(); diff --git a/examples/features/trace-evaluation/README.md b/examples/features/trace-evaluation/README.md index 2a3a8dc78..4a054a645 100644 --- a/examples/features/trace-evaluation/README.md +++ b/examples/features/trace-evaluation/README.md @@ -39,10 +39,10 @@ bun agentv eval examples/features/trace-evaluation/evals/dataset.eval.yaml --dry ## Patterns ### Threshold validation -Pass configurable limits via `config` in the YAML evaluator block: +Pass configurable limits via `config` in the YAML grader block: ```yaml -evaluators: +graders: - name: span-count type: code-grader command: ["bun", "run", "../graders/span-count.ts"] @@ -55,7 +55,7 @@ evaluators: Check for zero errors and block forbidden tools: ```yaml -evaluators: +graders: - name: error-check type: code-grader command: ["bun", "run", "../graders/error-spans.ts"] @@ -69,7 +69,7 @@ evaluators: Ensure no individual step or total execution exceeds time budgets: ```yaml -evaluators: +graders: - name: duration-check type: code-grader command: ["bun", "run", "../graders/span-duration.ts"] diff --git a/examples/features/trial-output-consistency/evals/dataset.eval.yaml b/examples/features/trial-output-consistency/evals/dataset.eval.yaml index dbd467972..1e381938f 100644 --- a/examples/features/trial-output-consistency/evals/dataset.eval.yaml +++ b/examples/features/trial-output-consistency/evals/dataset.eval.yaml @@ -1,4 +1,4 @@ -# Trial Output Consistency Evaluator +# Trial Output Consistency Grader # Measures how consistent an agent's outputs are across repeated trials # using pairwise cosine similarity (embedding-based or token-overlap fallback). # diff --git a/examples/features/weighted-evaluators/README.md b/examples/features/weighted-evaluators/README.md index bef525b1d..ce0d14007 100644 --- a/examples/features/weighted-evaluators/README.md +++ b/examples/features/weighted-evaluators/README.md @@ -1,10 +1,10 @@ -# Weighted Evaluators +# Weighted Graders -Demonstrates weighted evaluator configurations for prioritizing different evaluation criteria. +Demonstrates weighted grader configurations for prioritizing different evaluation criteria. ## What This Shows -- Assigning weights to evaluators +- Assigning weights to graders - Calculating weighted scores - Prioritizing critical vs optional criteria - Balanced vs unbalanced weighting schemes @@ -13,9 +13,9 @@ Demonstrates weighted evaluator configurations for prioritizing different evalua ```bash # From repository root -bun agentv eval examples/features/weighted-evaluators/evals/dataset.eval.yaml +bun agentv eval examples/features/weighted-graders/evals/dataset.eval.yaml ``` ## Key Files -- `evals/dataset.eval.yaml` - Test cases with weighted evaluator configurations +- `evals/dataset.eval.yaml` - Test cases with weighted grader configurations diff --git a/examples/features/weighted-evaluators/evals/dataset.eval.yaml b/examples/features/weighted-evaluators/evals/dataset.eval.yaml index dd2f8dfbf..0787b9301 100644 --- a/examples/features/weighted-evaluators/evals/dataset.eval.yaml +++ b/examples/features/weighted-evaluators/evals/dataset.eval.yaml @@ -1,13 +1,13 @@ -name: weighted-evaluators-examples +name: weighted-graders-examples -# This example demonstrates per-evaluator weights for top-level aggregation +# This example demonstrates per-grader weights for top-level aggregation execution: target: llm tests: - # Example 1: Different weights for multiple evaluators - - id: weighted-multi-evaluator + # Example 1: Different weights for multiple graders + - id: weighted-multi-grader input: - role: user content: "Explain the concept of neural networks." @@ -34,8 +34,8 @@ tests: prompt: ../prompts/style-evaluation.md weight: 1.0 - # Example 2: Using weight 0 to exclude an evaluator - - id: experimental-evaluator-disabled + # Example 2: Using weight 0 to exclude an grader + - id: experimental-grader-disabled input: - role: user content: "What is reinforcement learning?" @@ -50,14 +50,14 @@ tests: type: llm-grader prompt: ../prompts/accuracy-check.md weight: 1.0 - # Experimental evaluator - excluded from aggregation with weight 0 + # Experimental grader - excluded from aggregation with weight 0 # Useful for collecting data without affecting the score - name: experimental-metric type: llm-grader prompt: ../prompts/experimental-check.md weight: 0 - # Example 3: Default weights (all evaluators weighted equally) + # Example 3: Default weights (all graders weighted equally) - id: equal-weights-default input: - role: user diff --git a/examples/features/weighted-evaluators/prompts/experimental-check.md b/examples/features/weighted-evaluators/prompts/experimental-check.md index 1ff680238..499051059 100644 --- a/examples/features/weighted-evaluators/prompts/experimental-check.md +++ b/examples/features/weighted-evaluators/prompts/experimental-check.md @@ -1,9 +1,9 @@ # Experimental Metric -An experimental evaluator for collecting additional metrics without affecting scores. +An experimental grader for collecting additional metrics without affecting scores. ## Task -This is an experimental evaluator used to test new evaluation criteria. Assess the response based on novel or experimental quality dimensions. +This is an experimental grader used to test new evaluation criteria. Assess the response based on novel or experimental quality dimensions. ## Input - Question: {{ input }} @@ -16,7 +16,7 @@ Return a JSON object with: - `reasoning`: Experimental observations ## Note -This evaluator has weight 0 and does not affect the final score, but its results are collected for analysis. +This grader has weight 0 and does not affect the final score, but its results are collected for analysis. ## Example ```json diff --git a/examples/showcase/README.md b/examples/showcase/README.md index 01d680ac3..a75567f26 100644 --- a/examples/showcase/README.md +++ b/examples/showcase/README.md @@ -27,7 +27,7 @@ End-to-end real-world evaluation scenarios. Each example is runnable and demonst | Example | Description | |---------|-------------| -| [tool-evaluation-plugins](tool-evaluation-plugins/) | Tool selection correctness, efficiency scoring, and pairwise comparison as code-grader plugins — includes a decision table for when to use plugins vs the built-in `tool_trajectory` evaluator | +| [tool-evaluation-plugins](tool-evaluation-plugins/) | Tool selection correctness, efficiency scoring, and pairwise comparison as code-grader plugins — includes a decision table for when to use plugins vs the built-in `tool_trajectory` grader | --- @@ -39,11 +39,11 @@ End-to-end real-world evaluation scenarios. Each example is runnable and demonst --- -### Verify your evaluators are reliable +### Verify your graders are reliable | Example | Description | |---------|-------------| -| [evaluator-conformance](evaluator-conformance/) | Meta-harness that checks an evaluator for output compatibility and verdict stability, reporting flip rate, mean/variance, and bound violations across repeated runs | +| [grader-conformance](grader-conformance/) | Meta-harness that checks an grader for output compatibility and verdict stability, reporting flip rate, mean/variance, and bound violations across repeated runs | --- @@ -53,7 +53,7 @@ End-to-end real-world evaluation scenarios. Each example is runnable and demonst |---------|----------| | [cross-repo-sync](cross-repo-sync/) | Code agents & multi-repo workflows | | [cw-incident-triage](cw-incident-triage/) | Classification tasks | -| [evaluator-conformance](evaluator-conformance/) | Evaluator reliability | +| [grader-conformance](grader-conformance/) | Grader reliability | | [export-screening](export-screening/) | Classification tasks | | [multi-model-benchmark](multi-model-benchmark/) | Weighted LLM panel | | [offline-grader-benchmark](offline-grader-benchmark/) | Weighted LLM panel | diff --git a/examples/showcase/cross-repo-sync/evals/dataset.eval.yaml b/examples/showcase/cross-repo-sync/evals/dataset.eval.yaml index 7177de043..4180c06d5 100644 --- a/examples/showcase/cross-repo-sync/evals/dataset.eval.yaml +++ b/examples/showcase/cross-repo-sync/evals/dataset.eval.yaml @@ -25,7 +25,7 @@ tests: ground_truth: ../evals/ground-truth/eval-spec-v2.diff criteria: >- Update agentevals spec to reflect eval spec v2: add contains/regex/is_json/equals - assert types, required gates for all evaluators, tests-as-string-path. + assert types, required gates for all graders, tests-as-string-path. input: - role: user content: | @@ -37,7 +37,7 @@ tests: type: code-grader command: ["bash", "../scripts/run-ts.sh", "../scripts/validate-sync.ts"] expected_files_modified: - - agentevals/docs/src/content/docs/specification/evaluators.mdx + - agentevals/docs/src/content/docs/specification/graders.mdx - agentevals/docs/src/content/docs/specification/eval-format.mdx expected_keywords: [contains, regex, is_json, equals, required, assert] diff --git a/examples/showcase/cross-repo-sync/workspace-template/AGENTS.md b/examples/showcase/cross-repo-sync/workspace-template/AGENTS.md index b02158d8e..fbf79dc69 100644 --- a/examples/showcase/cross-repo-sync/workspace-template/AGENTS.md +++ b/examples/showcase/cross-repo-sync/workspace-template/AGENTS.md @@ -7,7 +7,7 @@ Open standard spec + Starlight docs site at agentevals.io. Must reflect agentv's (EntityProcess/agentv) capabilities. Key docs paths: -- `docs/src/content/docs/specification/evaluators.mdx` +- `docs/src/content/docs/specification/graders.mdx` - `docs/src/content/docs/specification/eval-format.mdx` - `docs/src/content/docs/specification/evalcase-schema.mdx` - `docs/src/content/docs/patterns/` @@ -22,7 +22,7 @@ Key source paths: - `packages/core/src/evaluation/yaml-parser.ts` ## Sync Rules -- agentv evaluator changes → update `agentevals/docs/src/content/docs/specification/evaluators.mdx` +- agentv grader changes → update `agentevals/docs/src/content/docs/specification/graders.mdx` - agentv schema changes → update `agentevals/docs/src/content/docs/specification/eval-format.mdx` and `evalcase-schema.mdx` - New patterns → update `agentevals/docs/src/content/docs/patterns/` - Preserve existing Starlight/MDX formatting conventions diff --git a/examples/showcase/cross-repo-sync/workspace-template/skills/cross-repo-sync.md b/examples/showcase/cross-repo-sync/workspace-template/skills/cross-repo-sync.md index c4c278426..420e2e105 100644 --- a/examples/showcase/cross-repo-sync/workspace-template/skills/cross-repo-sync.md +++ b/examples/showcase/cross-repo-sync/workspace-template/skills/cross-repo-sync.md @@ -24,8 +24,8 @@ Synchronize agentevals spec docs when agentv implementation changes. - Replace with the new name in prose, code examples, and schema definitions - Update any YAML/JSON examples that show the field -### New evaluator types -- Add to the evaluators list in `evaluators.mdx` +### New grader types +- Add to the graders list in `graders.mdx` - Add configuration schema to `eval-format.mdx` - Include usage example diff --git a/examples/showcase/cw-incident-triage/evals/validate_output.py b/examples/showcase/cw-incident-triage/evals/validate_output.py index 718f9ca20..886a3720a 100644 --- a/examples/showcase/cw-incident-triage/evals/validate_output.py +++ b/examples/showcase/cw-incident-triage/evals/validate_output.py @@ -2,7 +2,7 @@ """ JSON Format Validator for AgentV Validates that the candidate answer is strictly valid JSON with required keys. -Returns score 0.0 if not valid JSON, otherwise passes to next evaluator. +Returns score 0.0 if not valid JSON, otherwise passes to next grader. """ import json @@ -68,7 +68,7 @@ def validate_json_format(candidate_answer: str, required_keys: list[str]) -> dic def main(): - """Main entry point for AgentV code evaluator.""" + """Main entry point for AgentV code grader.""" # AgentV passes eval data via stdin as JSON try: eval_data = json.load(sys.stdin) diff --git a/examples/showcase/evaluator-conformance/EVAL.yaml b/examples/showcase/evaluator-conformance/EVAL.yaml index 54c6d9ed7..c2827bd21 100644 --- a/examples/showcase/evaluator-conformance/EVAL.yaml +++ b/examples/showcase/evaluator-conformance/EVAL.yaml @@ -1,13 +1,13 @@ -# Evaluator Conformance Demo -# Demonstrates using the keyword-grader evaluator in a standard AgentV eval. +# Grader Conformance Demo +# Demonstrates using the keyword-grader grader in a standard AgentV eval. # -# Run: cd examples/showcase/evaluator-conformance +# Run: cd examples/showcase/grader-conformance # npx agentv eval EVAL.yaml --dry-run # -# The conformance harness validates this evaluator separately: +# The conformance harness validates this grader separately: # bun run conformance-check.ts -description: Keyword-matching evaluator used for conformance testing demo +description: Keyword-matching grader used for conformance testing demo execution: target: llm @@ -20,7 +20,7 @@ tests: assertions: - name: keyword-grader type: code-grader - command: ["bun", "run", "evaluators/keyword-grader.ts"] + command: ["bun", "run", "graders/keyword-grader.ts"] - id: partial-match criteria: "Answer must mention red, blue, and yellow." @@ -29,4 +29,4 @@ tests: assertions: - name: keyword-grader type: code-grader - command: ["bun", "run", "evaluators/keyword-grader.ts"] + command: ["bun", "run", "graders/keyword-grader.ts"] diff --git a/examples/showcase/evaluator-conformance/README.md b/examples/showcase/evaluator-conformance/README.md index 79485de5e..dcdac8f7e 100644 --- a/examples/showcase/evaluator-conformance/README.md +++ b/examples/showcase/evaluator-conformance/README.md @@ -1,17 +1,17 @@ -# Evaluator Conformance Harness +# Grader Conformance Harness -A showcase demonstrating how to verify that an evaluator is **compatible** (produces valid output) and **consistent** (produces stable scores across repeated runs). +A showcase demonstrating how to verify that an grader is **compatible** (produces valid output) and **consistent** (produces stable scores across repeated runs). ## Problem -LLM-based and heuristic evaluators can be non-deterministic. Before trusting an evaluator in CI, you need to know: +LLM-based and heuristic graders can be non-deterministic. Before trusting an grader in CI, you need to know: -1. **Compatibility** — Does the evaluator always return valid `{ score, hits, misses }` output? +1. **Compatibility** — Does the grader always return valid `{ score, hits, misses }` output? 2. **Consistency** — Does it produce stable verdicts on unambiguous inputs? ## How It Works -The harness runs an evaluator N times against a labeled fixture dataset: +The harness runs an grader N times against a labeled fixture dataset: | Label | Expectation | |-----------|---------------------------------------------| @@ -28,7 +28,7 @@ It then computes per-fixture metrics: ## Quick Start ```bash -cd examples/showcase/evaluator-conformance +cd examples/showcase/grader-conformance bun install bun run conformance-check.ts ``` @@ -45,8 +45,8 @@ bun run conformance-check.ts ## Example Output ``` - Evaluator Conformance Harness - evaluator: bun run evaluators/keyword-grader.ts + Grader Conformance Harness + grader: bun run graders/keyword-grader.ts fixtures: 9 runs/each: 5 max-flip: 0 @@ -82,7 +82,7 @@ The `--output` flag writes a structured JSON report for programmatic consumption ```json { - "evaluator": ["bun", "run", "evaluators/keyword-grader.ts"], + "grader": ["bun", "run", "graders/keyword-grader.ts"], "total_fixtures": 9, "total_runs": 45, "compatible": true, @@ -104,19 +104,19 @@ The `--output` flag writes a structured JSON report for programmatic consumption } ``` -## Adapting for Your Evaluator +## Adapting for Your Grader -1. Replace `evaluators/keyword-grader.ts` with your evaluator script +1. Replace `graders/keyword-grader.ts` with your grader script 2. Update `fixtures.yaml` with domain-specific test cases 3. Set `score_bounds` on ambiguous fixtures based on acceptable variance -4. Adjust `--max-flip-rate` for LLM-based evaluators (e.g., `0.1` allows 10% flip rate) +4. Adjust `--max-flip-rate` for LLM-based graders (e.g., `0.1` allows 10% flip rate) ## Files | File | Purpose | |--------------------------------|-----------------------------------------------| -| `conformance-check.ts` | Harness script — runs evaluator, validates | +| `conformance-check.ts` | Harness script — runs grader, validates | | `fixtures.yaml` | Labeled fixture dataset | -| `evaluators/keyword-grader.ts` | Sample deterministic evaluator under test | -| `EVAL.yaml` | Standard AgentV eval using the same evaluator | +| `graders/keyword-grader.ts` | Sample deterministic grader under test | +| `EVAL.yaml` | Standard AgentV eval using the same grader | | `package.json` | Dependencies | diff --git a/examples/showcase/evaluator-conformance/conformance-check.ts b/examples/showcase/evaluator-conformance/conformance-check.ts index 9aecdf21b..e1adc3fb4 100644 --- a/examples/showcase/evaluator-conformance/conformance-check.ts +++ b/examples/showcase/evaluator-conformance/conformance-check.ts @@ -1,8 +1,8 @@ #!/usr/bin/env bun /** - * Evaluator Conformance Harness + * Grader Conformance Harness * - * Runs an evaluator N times per fixture and validates: + * Runs an grader N times per fixture and validates: * - Compatibility: output matches CodeGraderResult schema (score, assertions) * - Consistency: flip-rate, agreement, and variance meet thresholds * @@ -35,7 +35,7 @@ interface Fixture { } interface FixtureFile { - evaluator: { script: string[] }; + grader: { script: string[] }; fixtures: Fixture[]; } @@ -72,7 +72,7 @@ interface FixtureReport { } interface ConformanceReport { - evaluator: string[]; + grader: string[]; total_fixtures: number; total_runs: number; compatible: boolean; @@ -96,7 +96,7 @@ const fixturePath = resolve(values.fixture ?? 'fixtures.yaml'); const runs = Number.parseInt(values.runs ?? '5', 10); const maxFlipRate = Number.parseFloat(values['max-flip-rate'] ?? '0'); -// ── Evaluator invocation ──────────────────────────────────────────────── +// ── Grader invocation ──────────────────────────────────────────────── function buildCodeGraderInput(fixture: Fixture): string { // Build a minimal CodeGraderInput in the snake_case wire format @@ -132,7 +132,7 @@ function runEvaluator(script: string[], input: string): Promise proc.on('error', reject); proc.on('close', (code) => { if (code !== 0) { - reject(new Error(`Evaluator exited with code ${code}: ${stderr}`)); + reject(new Error(`Grader exited with code ${code}: ${stderr}`)); return; } try { @@ -224,9 +224,9 @@ async function main(): Promise { const raw = readFileSync(fixturePath, 'utf-8'); const data = parse(raw) as FixtureFile; - const { evaluator, fixtures } = data; - console.log('\n Evaluator Conformance Harness'); - console.log(` evaluator: ${evaluator.script.join(' ')}`); + const { grader, fixtures } = data; + console.log('\n Grader Conformance Harness'); + console.log(` grader: ${grader.script.join(' ')}`); console.log(` fixtures: ${fixtures.length}`); console.log(` runs/each: ${runs}`); console.log(` max-flip: ${maxFlipRate}\n`); @@ -243,7 +243,7 @@ async function main(): Promise { for (let i = 0; i < runs; i++) { try { - const result = await runEvaluator(evaluator.script, input); + const result = await runEvaluator(grader.script, input); const schemaErrors = validateResult(result); if (schemaErrors.length > 0) { compatible = false; @@ -355,7 +355,7 @@ async function main(): Promise { // Write output if (values.output) { const output: ConformanceReport = { - evaluator: evaluator.script, + grader: grader.script, total_fixtures: fixtures.length, total_runs: runs * fixtures.length, compatible: allCompatible, diff --git a/examples/showcase/evaluator-conformance/evaluators/keyword-grader.ts b/examples/showcase/evaluator-conformance/evaluators/keyword-grader.ts index 7612499b1..5f8f1513a 100644 --- a/examples/showcase/evaluator-conformance/evaluators/keyword-grader.ts +++ b/examples/showcase/evaluator-conformance/evaluators/keyword-grader.ts @@ -1,6 +1,6 @@ #!/usr/bin/env bun /** - * Sample evaluator for conformance testing. + * Sample grader for conformance testing. * * Deterministic keyword-matching grader: checks whether expected keywords * appear in the candidate output. Produces stable scores for unambiguous diff --git a/examples/showcase/evaluator-conformance/fixtures.yaml b/examples/showcase/evaluator-conformance/fixtures.yaml index 03981d9fb..009538fc6 100644 --- a/examples/showcase/evaluator-conformance/fixtures.yaml +++ b/examples/showcase/evaluator-conformance/fixtures.yaml @@ -1,14 +1,14 @@ -# Evaluator Conformance Fixture Dataset +# Grader Conformance Fixture Dataset # -# Each fixture has a label indicating the expected evaluator behavior: -# - pass: Unambiguous pass — evaluator must always score 1.0 -# - fail: Unambiguous fail — evaluator must always score 0.0 +# Each fixture has a label indicating the expected grader behavior: +# - pass: Unambiguous pass — grader must always score 1.0 +# - fail: Unambiguous fail — grader must always score 0.0 # - ambiguous: Allowed to vary within score_bounds # -# The evaluator under test receives: question, criteria, answer, expected_output +# The grader under test receives: question, criteria, answer, expected_output -evaluator: - command: ["bun", "run", "evaluators/keyword-grader.ts"] +grader: + command: ["bun", "run", "graders/keyword-grader.ts"] fixtures: # ── Unambiguous Pass Cases ───────────────────────────────────────────── diff --git a/examples/showcase/export-screening/README.md b/examples/showcase/export-screening/README.md index 059a8b7de..fa16236c4 100644 --- a/examples/showcase/export-screening/README.md +++ b/examples/showcase/export-screening/README.md @@ -14,7 +14,7 @@ Trade compliance teams screen shipments to identify potential dual-use goods req 1. **Multi-class classification** (Low/Medium/High) 2. **Structured JSON output** with reasoning -3. **Code evaluator** for format validation and accuracy checking +3. **Code grader** for format validation and accuracy checking 4. **Wrapper-based metrics** (confusion matrix + precision/recall/F1 + policy-weighted overall) 5. **Multi-sample CI gating** — run eval N times, aggregate results, and gate on aggregated metrics @@ -83,9 +83,9 @@ Each case contains: - **Expected output**: Expert risk assessment (`riskLevel: High|Medium|Low`) - **Outcome description**: Explanation for human reviewers -### 2. Code Evaluator (`validate_risk_output.ts`) +### 2. Code Grader (`validate_risk_output.ts`) -The evaluator: +The grader: 1. Validates JSON format and required fields 2. Extracts AI's `riskLevel` prediction 3. Compares to expected `riskLevel` from `expected_output` diff --git a/examples/showcase/multi-model-benchmark/README.md b/examples/showcase/multi-model-benchmark/README.md index b519a28d4..0767d89ed 100644 --- a/examples/showcase/multi-model-benchmark/README.md +++ b/examples/showcase/multi-model-benchmark/README.md @@ -7,7 +7,7 @@ Demonstrates a complete **multi-model × multi-metric × variability** evaluatio | Feature | How it's used | |---------|---------------| | **Targets matrix** | Every test runs against `copilot`, `claude`, and `gemini-llm` | -| **Weighted evaluators** | Accuracy (3×), completeness (2×), clarity (1×) | +| **Weighted graders** | Accuracy (3×), completeness (2×), clarity (1×) | | **Trials (pass@k)** | 2 trials per test to surface non-determinism | | **Compare workflow** | Side-by-side model comparison from result files | @@ -102,7 +102,7 @@ execution: - gemini-llm # e.g., gemini-flash ``` -### 2. Weighted Evaluators +### 2. Weighted Graders Three LLM graders score each response. Weights control their contribution to the aggregate score: @@ -179,7 +179,7 @@ execution: - my_new_model # Add here ``` -### Adding an evaluator +### Adding an grader Add a new grader prompt in `prompts/` and reference it in the eval's `assertions` block: @@ -205,6 +205,6 @@ execution: ## See Also - [`examples/features/matrix-evaluation/`](../../features/matrix-evaluation/) — minimal targets matrix example -- [`examples/features/weighted-evaluators/`](../../features/weighted-evaluators/) — per-evaluator weight patterns +- [`examples/features/weighted-graders/`](../../features/weighted-graders/) — per-grader weight patterns - [`examples/features/trials/`](../../features/trials/) — trial strategy configuration - [`examples/features/compare/`](../../features/compare/) — baseline vs candidate comparison diff --git a/examples/showcase/multi-model-benchmark/evals/benchmark.eval.yaml b/examples/showcase/multi-model-benchmark/evals/benchmark.eval.yaml index 4e6b468cf..9d75be316 100644 --- a/examples/showcase/multi-model-benchmark/evals/benchmark.eval.yaml +++ b/examples/showcase/multi-model-benchmark/evals/benchmark.eval.yaml @@ -2,7 +2,7 @@ # # Demonstrates multi-model × multi-metric × variability workflow: # - Targets matrix: runs every test against multiple models -# - Weighted evaluators: accuracy (3×), completeness (2×), clarity (1×) +# - Weighted graders: accuracy (3×), completeness (2×), clarity (1×) # - Trials: pass@k with 2 trials to measure variability # # Default targets use low-cost models for safe experimentation. diff --git a/examples/showcase/offline-grader-benchmark/README.md b/examples/showcase/offline-grader-benchmark/README.md index 141da6d5c..4ccfec2cc 100644 --- a/examples/showcase/offline-grader-benchmark/README.md +++ b/examples/showcase/offline-grader-benchmark/README.md @@ -4,7 +4,7 @@ A public, offline workflow for benchmarking **grader quality itself** against a It uses existing AgentV primitives: - a `cli` replay target to return the frozen agent output from each sample, -- three `llm-grader` evaluators (each can use a different low-cost target), +- three `llm-grader` graders (each can use a different low-cost target), - a `composite` threshold aggregator for majority vote, - `agentv compare` for A/B grader-setup comparison, - and a small post-processing script that scores the grader panel against human ground truth. diff --git a/examples/showcase/offline-grader-benchmark/scripts/score-grader-benchmark.ts b/examples/showcase/offline-grader-benchmark/scripts/score-grader-benchmark.ts index 58f77d351..c6e7e46d5 100644 --- a/examples/showcase/offline-grader-benchmark/scripts/score-grader-benchmark.ts +++ b/examples/showcase/offline-grader-benchmark/scripts/score-grader-benchmark.ts @@ -30,7 +30,7 @@ type GroundTruth = { }; function usage(): never { - console.error(`Usage: bun score-grader-benchmark.ts --results --eval-set [--label ] [--evaluator ] + console.error(`Usage: bun score-grader-benchmark.ts --results --eval-set [--label ] [--grader ] Reads raw AgentV eval JSONL for a grader panel, resolves a majority verdict from child grader scores, and emits scored JSONL where score=1 means the panel matched human ground truth. @@ -39,7 +39,7 @@ Options: --results Raw AgentV eval output JSONL --eval-set Offline labeled export JSONL used for the eval --label Optional output target label (defaults to input target or results filename) - --evaluator Composite evaluator name to inspect (defaults to first composite / first score group) + --grader Composite grader name to inspect (defaults to first composite / first score group) --help Show this help message `); process.exit(1); @@ -130,7 +130,7 @@ function selectPanel(scores: ScoreRecord[] | undefined, evaluatorName?: string): if (evaluatorName) { const named = scores.find((score) => score.name === evaluatorName); if (!named) { - throw new Error(`Evaluator '${evaluatorName}' not found in scores[]`); + throw new Error(`Grader '${evaluatorName}' not found in scores[]`); } return named; } @@ -158,7 +158,7 @@ if (args.includes('--help')) usage(); const resultsPath = getArg('--results'); const evalSetPath = getArg('--eval-set'); const labelOverride = getArg('--label'); -const evaluatorName = getArg('--evaluator'); +const evaluatorName = getArg('--grader'); if (!resultsPath || !evalSetPath) usage(); @@ -185,7 +185,7 @@ for (const line of rawResults) { const graders = panel.scores ?? []; if (graders.length === 0) { throw new Error( - `Evaluator '${panel.name ?? 'unknown'}' for '${result.test_id}' has no child grader scores`, + `Grader '${panel.name ?? 'unknown'}' for '${result.test_id}' has no child grader scores`, ); } diff --git a/examples/showcase/psychotherapy/evals/validate_output.py b/examples/showcase/psychotherapy/evals/validate_output.py index 48b02ccbc..187d64cab 100644 --- a/examples/showcase/psychotherapy/evals/validate_output.py +++ b/examples/showcase/psychotherapy/evals/validate_output.py @@ -3,7 +3,7 @@ JSON Format Validator for AgentV Validates that the candidate answer is strictly valid JSON with required keys. Auto-detects framework type from expected answer structure. -Returns score 0.0 if not valid JSON, otherwise passes to next evaluator. +Returns score 0.0 if not valid JSON, otherwise passes to next grader. """ import json @@ -216,7 +216,7 @@ def validate_routing_schema(parsed: dict[str, Any]) -> list[str]: def main(): - """Main entry point for AgentV code evaluator.""" + """Main entry point for AgentV code grader.""" # AgentV passes eval data via stdin as JSON try: eval_data = json.load(sys.stdin) diff --git a/examples/showcase/tool-evaluation-plugins/README.md b/examples/showcase/tool-evaluation-plugins/README.md index cbf91eaca..1ee18e7ec 100644 --- a/examples/showcase/tool-evaluation-plugins/README.md +++ b/examples/showcase/tool-evaluation-plugins/README.md @@ -1,6 +1,6 @@ # Tool Evaluation Plugin Patterns -This showcase demonstrates **plugin-based tool evaluation patterns** that complement AgentV's built-in `tool_trajectory` evaluator. These patterns are intentionally implemented as plugins (code graders) rather than built-ins because they involve domain-specific logic or semantic evaluation. +This showcase demonstrates **plugin-based tool evaluation patterns** that complement AgentV's built-in `tool_trajectory` grader. These patterns are intentionally implemented as plugins (code graders) rather than built-ins because they involve domain-specific logic or semantic evaluation. ## When to Use Plugins vs Built-ins @@ -16,12 +16,12 @@ This showcase demonstrates **plugin-based tool evaluation patterns** that comple ## Plugin Examples -### 1. Tool Selection Evaluator (`tool-selection-grader.ts`) +### 1. Tool Selection Grader (`tool-selection-grader.ts`) Evaluates whether the agent selected the **right tools** for the task. Uses heuristic matching to assess tool choices against task keywords. ```yaml -evaluators: +graders: - name: tool-selection type: code-grader command: ["bun", "run", "scripts/tool-selection-grader.ts"] @@ -32,7 +32,7 @@ evaluators: Computes efficiency metrics and scores based on configurable thresholds. Demonstrates how to use execution metrics in evaluation. ```yaml -evaluators: +graders: - name: efficiency type: code-grader command: ["bun", "run", "scripts/efficiency-scorer.ts"] @@ -43,7 +43,7 @@ evaluators: Compares two agent responses for tool usage quality with position bias mitigation (runs comparison twice with swapped order). ```yaml -evaluators: +graders: - name: pairwise-compare type: code-grader command: ["bun", "run", "scripts/pairwise-tool-compare.ts"] diff --git a/examples/showcase/tool-evaluation-plugins/scripts/efficiency-scorer.ts b/examples/showcase/tool-evaluation-plugins/scripts/efficiency-scorer.ts index 6d7623f19..55337a013 100644 --- a/examples/showcase/tool-evaluation-plugins/scripts/efficiency-scorer.ts +++ b/examples/showcase/tool-evaluation-plugins/scripts/efficiency-scorer.ts @@ -14,7 +14,7 @@ * - Different projects have different cost/performance tradeoffs * * Usage in eval YAML: - * evaluators: + * graders: * - name: efficiency * type: code_grader * script: ["bun", "run", "scripts/efficiency-scorer.ts"] diff --git a/examples/showcase/tool-evaluation-plugins/scripts/pairwise-tool-compare.ts b/examples/showcase/tool-evaluation-plugins/scripts/pairwise-tool-compare.ts index b610470d5..b3fa30c66 100644 --- a/examples/showcase/tool-evaluation-plugins/scripts/pairwise-tool-compare.ts +++ b/examples/showcase/tool-evaluation-plugins/scripts/pairwise-tool-compare.ts @@ -12,7 +12,7 @@ * - Not all evaluations need comparative assessment * * Usage in eval YAML: - * evaluators: + * graders: * - name: pairwise-compare * type: code_grader * script: ["bun", "run", "scripts/pairwise-tool-compare.ts"] diff --git a/examples/showcase/tool-evaluation-plugins/scripts/tool-selection-grader.ts b/examples/showcase/tool-evaluation-plugins/scripts/tool-selection-grader.ts index e9b694874..8e2131700 100644 --- a/examples/showcase/tool-evaluation-plugins/scripts/tool-selection-grader.ts +++ b/examples/showcase/tool-evaluation-plugins/scripts/tool-selection-grader.ts @@ -1,6 +1,6 @@ #!/usr/bin/env bun /** - * Tool Selection Evaluator - Code Grader Plugin + * Tool Selection Grader - Code Grader Plugin * * Evaluates whether the agent selected the RIGHT tools for the task. * This is a semantic evaluation that requires understanding task requirements @@ -12,7 +12,7 @@ * - Different projects have different tool selection criteria * * Usage in eval YAML: - * evaluators: + * graders: * - name: tool-selection * type: code_grader * script: ["bun", "run", "scripts/tool-selection-grader.ts"] diff --git a/examples/showcase/tool-evaluation-plugins/tool-eval-demo.eval.yaml b/examples/showcase/tool-evaluation-plugins/tool-eval-demo.eval.yaml index 5ba64e4d2..18bc492e9 100644 --- a/examples/showcase/tool-evaluation-plugins/tool-eval-demo.eval.yaml +++ b/examples/showcase/tool-evaluation-plugins/tool-eval-demo.eval.yaml @@ -1,7 +1,7 @@ # Tool Evaluation Plugins Demo # Demonstrates plugin-based (code grader) tool evaluation patterns # -# These patterns complement the built-in tool_trajectory evaluator with +# These patterns complement the built-in tool_trajectory grader with # semantic evaluation capabilities that require domain-specific logic. # # Run: cd examples/showcase/tool-evaluation-plugins diff --git a/packages/core/src/evaluation/assertions.ts b/packages/core/src/evaluation/assertions.ts index f8a12d79d..0a724fe1d 100644 --- a/packages/core/src/evaluation/assertions.ts +++ b/packages/core/src/evaluation/assertions.ts @@ -2,7 +2,7 @@ * Types for inline assertion functions used in the evaluate() API. * * Inline functions are the escape hatch for custom evaluation logic - * that doesn't fit a built-in evaluator type. For built-in assertions + * that doesn't fit a built-in grader type. For built-in assertions * (contains, regex, is-json, etc.), use config objects instead: * * assert: [{ type: 'contains', value: 'hello' }] diff --git a/packages/core/src/evaluation/baseline.ts b/packages/core/src/evaluation/baseline.ts index 941ff4e4b..2801b2dbb 100644 --- a/packages/core/src/evaluation/baseline.ts +++ b/packages/core/src/evaluation/baseline.ts @@ -1,4 +1,4 @@ -import type { EvaluationResult, EvaluatorResult } from './types.js'; +import type { EvaluationResult, GraderResult } from './types.js'; /** * Top-level fields to strip from baseline results. @@ -23,26 +23,26 @@ const STRIPPED_TOP_LEVEL_FIELDS = new Set([ ]); /** - * Fields to strip from evaluator results. + * Fields to strip from grader results. */ const STRIPPED_EVALUATOR_FIELDS = new Set(['rawRequest', 'input']); /** * Trims an evaluator result for baseline storage. * Strips debug/audit fields while preserving scoring data. - * Recursively trims nested evaluator results (for composites). + * Recursively trims nested grader results (for composites). */ -function trimEvaluatorResult(result: EvaluatorResult): EvaluatorResult { +function trimEvaluatorResult(result: GraderResult): GraderResult { const trimmed: Record = {}; for (const [key, value] of Object.entries(result)) { if (STRIPPED_EVALUATOR_FIELDS.has(key)) continue; if (key === 'scores' && Array.isArray(value)) { - trimmed[key] = (value as EvaluatorResult[]).map(trimEvaluatorResult); + trimmed[key] = (value as GraderResult[]).map(trimEvaluatorResult); } else { trimmed[key] = value; } } - return trimmed as unknown as EvaluatorResult; + return trimmed as unknown as GraderResult; } /** @@ -57,7 +57,7 @@ export function trimBaselineResult(result: EvaluationResult): EvaluationResult { for (const [key, value] of Object.entries(result)) { if (STRIPPED_TOP_LEVEL_FIELDS.has(key)) continue; if (key === 'scores' && Array.isArray(value)) { - trimmed[key] = (value as EvaluatorResult[]).map(trimEvaluatorResult); + trimmed[key] = (value as GraderResult[]).map(trimEvaluatorResult); } else { trimmed[key] = value; } diff --git a/packages/core/src/evaluation/evaluate.ts b/packages/core/src/evaluation/evaluate.ts index 619bc90d8..1aab6886e 100644 --- a/packages/core/src/evaluation/evaluate.ts +++ b/packages/core/src/evaluation/evaluate.ts @@ -61,17 +61,17 @@ import path from 'node:path'; import { buildDirectoryChain, findGitRoot } from './file-utils.js'; import type { AssertFn } from './assertions.js'; -import { DEFAULT_THRESHOLD } from './evaluators/scoring.js'; +import { DEFAULT_THRESHOLD } from './graders/scoring.js'; import { runEvaluation } from './orchestrator.js'; import { createFunctionProvider } from './providers/function-provider.js'; import { readTargetDefinitions } from './providers/targets-file.js'; import { type ResolvedTarget, resolveTargetDefinition } from './providers/targets.js'; import type { TargetDefinition } from './providers/types.js'; -import { INLINE_ASSERT_FN } from './registry/builtin-evaluators.js'; +import { INLINE_ASSERT_FN } from './registry/builtin-graders.js'; import type { EvalTest, EvaluationResult, - EvaluatorConfig, + GraderConfig, InlineAssertEvaluatorConfig, } from './types.js'; import { loadTests } from './yaml-parser.js'; @@ -309,7 +309,7 @@ export async function evaluate(config: EvalConfig): Promise { }; return Object.assign(base, { [INLINE_ASSERT_FN]: entry as AssertFn, - }) as unknown as EvaluatorConfig; + }) as unknown as GraderConfig; } const a = entry as EvalAssertionInput; const { type: rawType, ...rest } = a; @@ -317,7 +317,7 @@ export async function evaluate(config: EvalConfig): Promise { ...rest, name: a.name ?? `${rawType}_${i}`, type: mapAssertionType(rawType), - } as unknown as EvaluatorConfig; + } as unknown as GraderConfig; }); return { @@ -364,7 +364,7 @@ export async function evaluate(config: EvalConfig): Promise { } /** - * Map user-facing assertion type names to internal evaluator type names. + * Map user-facing assertion type names to internal grader type names. * Handles snake_case to kebab-case normalization (e.g., 'llm_grader' -> 'llm-grader'). */ function mapAssertionType(type: string): string { diff --git a/packages/core/src/evaluation/evaluators.ts b/packages/core/src/evaluation/evaluators.ts deleted file mode 100644 index 6826771ca..000000000 --- a/packages/core/src/evaluation/evaluators.ts +++ /dev/null @@ -1,5 +0,0 @@ -/** - * Re-exports from the evaluators/ directory for backwards compatibility. - * New code should import directly from './evaluators/index.js' or specific modules. - */ -export * from './evaluators/index.js'; diff --git a/packages/core/src/evaluation/evaluators/index.ts b/packages/core/src/evaluation/evaluators/index.ts deleted file mode 100644 index c1d01106a..000000000 --- a/packages/core/src/evaluation/evaluators/index.ts +++ /dev/null @@ -1,93 +0,0 @@ -// Types -export type { - ChildEvaluatorResult, - EvaluationContext, - EvaluationScore, - EvaluationVerdict, - Evaluator, - EvaluatorFactory, -} from './types.js'; - -// Scoring utilities -export { - DEFAULT_THRESHOLD, - PASS_THRESHOLD, - clampScore, - deepEqual, - extractJsonBlob, - isNonEmptyString, - negateScore, - parseJsonFromText, - parseJsonSafe, - scoreToVerdict, -} from './scoring.js'; - -// Evaluators -export { CodeEvaluator, executeScript } from './code-evaluator.js'; -export type { CodeEvaluatorOptions } from './code-evaluator.js'; - -export { CompositeEvaluator } from './composite.js'; -export type { CompositeEvaluatorOptions } from './composite.js'; - -export { CostEvaluator } from './cost.js'; -export type { CostEvaluatorOptions } from './cost.js'; - -export { ExecutionMetricsEvaluator } from './execution-metrics.js'; -export type { ExecutionMetricsEvaluatorOptions } from './execution-metrics.js'; - -export { FieldAccuracyEvaluator } from './field-accuracy.js'; -export type { FieldAccuracyEvaluatorOptions } from './field-accuracy.js'; - -export { LatencyEvaluator } from './latency.js'; -export type { LatencyEvaluatorOptions } from './latency.js'; - -export { - LlmGraderEvaluator, - LlmGraderEvaluator as LlmJudgeEvaluator, - buildOutputSchema, - buildRubricOutputSchema, - buildScoreRangeOutputSchema, - calculateRubricScore, - DEFAULT_EVALUATOR_TEMPLATE, - extractImageBlocks, - substituteVariables, - freeformEvaluationSchema, - rubricEvaluationSchema, -} from './llm-grader.js'; -export type { - LlmGraderEvaluatorOptions, - LlmGraderEvaluatorOptions as LlmJudgeEvaluatorOptions, -} from './llm-grader.js'; - -export { SkillTriggerEvaluator } from './skill-trigger.js'; - -export { - assembleLlmGraderPrompt, - assembleLlmGraderPrompt as assembleLlmJudgePrompt, -} from './llm-grader-prompt.js'; -export type { - LlmGraderPromptAssembly, - LlmGraderPromptAssembly as LlmJudgePromptAssembly, -} from './llm-grader-prompt.js'; - -export { TokenUsageEvaluator } from './token-usage.js'; -export type { TokenUsageEvaluatorOptions } from './token-usage.js'; - -export { ToolTrajectoryEvaluator } from './tool-trajectory.js'; -export type { ToolTrajectoryEvaluatorOptions } from './tool-trajectory.js'; - -// Deterministic assertions -export { - runContainsAssertion, - runContainsAnyAssertion, - runContainsAllAssertion, - runIcontainsAssertion, - runIcontainsAnyAssertion, - runIcontainsAllAssertion, - runStartsWithAssertion, - runEndsWithAssertion, - runEqualsAssertion, - runIsJsonAssertion, - runRegexAssertion, -} from './assertions.js'; -export type { AssertionResult } from './assertions.js'; diff --git a/packages/core/src/evaluation/graders.ts b/packages/core/src/evaluation/graders.ts new file mode 100644 index 000000000..614cdf3b8 --- /dev/null +++ b/packages/core/src/evaluation/graders.ts @@ -0,0 +1,5 @@ +/** + * Re-exports from the evaluators/ directory for backwards compatibility. + * New code should import directly from './graders/index.js' or specific modules. + */ +export * from './graders/index.js'; diff --git a/packages/core/src/evaluation/evaluators/assertions.ts b/packages/core/src/evaluation/graders/assertions.ts similarity index 100% rename from packages/core/src/evaluation/evaluators/assertions.ts rename to packages/core/src/evaluation/graders/assertions.ts diff --git a/packages/core/src/evaluation/evaluators/code-evaluator.ts b/packages/core/src/evaluation/graders/code-grader.ts similarity index 97% rename from packages/core/src/evaluation/evaluators/code-evaluator.ts rename to packages/core/src/evaluation/graders/code-grader.ts index 8762cde27..3895a2ffb 100644 --- a/packages/core/src/evaluation/evaluators/code-evaluator.ts +++ b/packages/core/src/evaluation/graders/code-grader.ts @@ -13,7 +13,7 @@ import { type ContentImage, isContentArray } from '../content.js'; import type { AssertionEntry, JsonObject, TargetAccessConfig } from '../types.js'; import { getRepoCheckoutTargets } from '../workspace/repo-checkout.js'; import { clampScore, isNonEmptyString, parseJsonSafe, scoreToVerdict } from './scoring.js'; -import type { EvaluationContext, EvaluationScore, Evaluator } from './types.js'; +import type { EvaluationContext, EvaluationScore, Grader } from './types.js'; /** Threshold in bytes above which output is written to a temp file instead of inlined. */ const FILE_BACKED_OUTPUT_THRESHOLD = 50_000; @@ -95,7 +95,7 @@ export async function materializeContentForGrader( return result; } -export interface CodeEvaluatorOptions { +export interface CodeGraderOptions { readonly command: readonly string[]; /** @deprecated Use `command` instead */ readonly script?: readonly string[]; @@ -107,7 +107,7 @@ export interface CodeEvaluatorOptions { readonly target?: TargetAccessConfig; } -export class CodeEvaluator implements Evaluator { +export class CodeGrader implements Grader { readonly kind = 'code-grader'; private readonly command: readonly string[]; @@ -116,7 +116,7 @@ export class CodeEvaluator implements Evaluator { private readonly config?: Record; private readonly target?: TargetAccessConfig; - constructor(options: CodeEvaluatorOptions) { + constructor(options: CodeGraderOptions) { this.command = options.command ?? options.script ?? []; this.cwd = options.cwd; this.agentTimeoutMs = options.agentTimeoutMs; @@ -263,7 +263,7 @@ export class CodeEvaluator implements Evaluator { // Build evaluator raw request with proxy metadata if used const proxyUsage = getProxyUsage?.(); - const evaluatorRawRequest: JsonObject = { + const graderRawRequest: JsonObject = { command: this.command, ...(this.cwd ? { cwd: this.cwd } : {}), ...(proxyUsage @@ -281,7 +281,7 @@ export class CodeEvaluator implements Evaluator { verdict: scoreToVerdict(score), assertions, expectedAspectCount: assertions.length || 1, - evaluatorRawRequest, + graderRawRequest, ...(details ? { details } : {}), tokenUsage: proxyUsage?.tokenUsage, }; @@ -293,7 +293,7 @@ export class CodeEvaluator implements Evaluator { verdict: 'fail', assertions: [{ text: `Code evaluator failed: ${message}`, passed: false }], expectedAspectCount: 1, - evaluatorRawRequest: { + graderRawRequest: { command: this.command, ...(this.cwd ? { cwd: this.cwd } : {}), ...(proxyUsage diff --git a/packages/core/src/evaluation/evaluators/composite.ts b/packages/core/src/evaluation/graders/composite.ts similarity index 89% rename from packages/core/src/evaluation/evaluators/composite.ts rename to packages/core/src/evaluation/graders/composite.ts index 6d5295fc6..aa4974388 100644 --- a/packages/core/src/evaluation/evaluators/composite.ts +++ b/packages/core/src/evaluation/graders/composite.ts @@ -4,18 +4,18 @@ import { extractLastAssistantContent } from '../providers/types.js'; import type { AssertionEntry, CompositeAggregatorConfig, - CompositeEvaluatorConfig, + CompositeGraderConfig, JsonObject, } from '../types.js'; -import { executeScript } from './code-evaluator.js'; +import { executeScript } from './code-grader.js'; import { buildOutputSchema, freeformEvaluationSchema } from './llm-grader.js'; import { clampScore, parseJsonFromText, parseJsonSafe, scoreToVerdict } from './scoring.js'; import type { - ChildEvaluatorResult, + ChildGraderResult, EvaluationContext, EvaluationScore, - Evaluator, - EvaluatorFactory, + Grader, + GraderFactory, } from './types.js'; interface MemberResult { @@ -27,23 +27,23 @@ interface MemberResult { const DEFAULT_COMPOSITE_AGGREGATOR_PROMPT = `Review the following evaluation results: {{EVALUATOR_RESULTS_JSON}} -Decide the final score and verdict based on all evaluator results. +Decide the final score and verdict based on all grader results. Return a JSON object with: score (0.0-1.0), verdict (pass/fail), and reasoning.`; -export interface CompositeEvaluatorOptions { - readonly config: CompositeEvaluatorConfig; - readonly evaluatorFactory: EvaluatorFactory; +export interface CompositeGraderOptions { + readonly config: CompositeGraderConfig; + readonly evaluatorFactory: GraderFactory; readonly cwd?: string; } -export class CompositeEvaluator implements Evaluator { +export class CompositeGrader implements Grader { readonly kind = 'composite'; - private readonly config: CompositeEvaluatorConfig; - private readonly evaluatorFactory: EvaluatorFactory; + private readonly config: CompositeGraderConfig; + private readonly evaluatorFactory: GraderFactory; private readonly cwd?: string; - constructor(options: CompositeEvaluatorOptions) { + constructor(options: CompositeGraderOptions) { this.config = options.config; this.evaluatorFactory = options.evaluatorFactory; this.cwd = options.cwd; @@ -92,7 +92,7 @@ export class CompositeEvaluator implements Evaluator { let weightedSum = 0; let evaluatedCount = 0; const allAssertions: AssertionEntry[] = []; - const scores: ChildEvaluatorResult[] = []; + const scores: ChildGraderResult[] = []; for (const member of results) { const weight = weights?.[member.id] ?? 1.0; @@ -105,7 +105,7 @@ export class CompositeEvaluator implements Evaluator { weight, verdict: member.result.verdict, assertions: [...member.result.assertions], - evaluatorRawRequest: member.result.evaluatorRawRequest, + graderRawRequest: member.result.graderRawRequest, scores: member.result.scores, details: member.result.details, tokenUsage: member.result.tokenUsage, @@ -131,7 +131,7 @@ export class CompositeEvaluator implements Evaluator { verdict: 'skip' as const, assertions: [{ text: 'All evaluators skipped (infrastructure failure)', passed: false }], expectedAspectCount: 1, - evaluatorRawRequest: { + graderRawRequest: { aggregator: 'weighted_average', ...(weights ? { weights } : {}), }, @@ -146,7 +146,7 @@ export class CompositeEvaluator implements Evaluator { verdict: scoreToVerdict(finalScore), assertions: allAssertions, expectedAspectCount: allAssertions.length || 1, - evaluatorRawRequest: { + graderRawRequest: { aggregator: 'weighted_average', ...(weights ? { weights } : {}), }, @@ -155,7 +155,7 @@ export class CompositeEvaluator implements Evaluator { } private runThreshold(results: readonly MemberResult[], threshold: number): EvaluationScore { - const scores: ChildEvaluatorResult[] = []; + const scores: ChildGraderResult[] = []; const allAssertions: AssertionEntry[] = []; let passingCount = 0; let evaluatedCount = 0; @@ -168,7 +168,7 @@ export class CompositeEvaluator implements Evaluator { score: member.result.score, verdict: member.result.verdict, assertions: [...member.result.assertions], - evaluatorRawRequest: member.result.evaluatorRawRequest, + graderRawRequest: member.result.graderRawRequest, scores: member.result.scores, details: member.result.details, tokenUsage: member.result.tokenUsage, @@ -197,7 +197,7 @@ export class CompositeEvaluator implements Evaluator { verdict: 'skip' as const, assertions: [{ text: 'All evaluators skipped (infrastructure failure)', passed: false }], expectedAspectCount: 1, - evaluatorRawRequest: { + graderRawRequest: { aggregator: 'threshold', threshold, }, @@ -219,7 +219,7 @@ export class CompositeEvaluator implements Evaluator { verdict: pass ? 'pass' : 'fail', assertions: allAssertions, expectedAspectCount: allAssertions.length || 1, - evaluatorRawRequest: { + graderRawRequest: { aggregator: 'threshold', threshold, }, @@ -237,14 +237,14 @@ export class CompositeEvaluator implements Evaluator { const inputPayload = JSON.stringify({ results: resultsObject }, null, 2); // Build child results for output - const scores: ChildEvaluatorResult[] = results.map((member) => ({ + const scores: ChildGraderResult[] = results.map((member) => ({ name: member.id, type: member.type, score: member.result.score, weight: weights?.[member.id] ?? 1.0, verdict: member.result.verdict, assertions: [...member.result.assertions], - evaluatorRawRequest: member.result.evaluatorRawRequest, + graderRawRequest: member.result.graderRawRequest, scores: member.result.scores, details: member.result.details, })); @@ -278,7 +278,7 @@ export class CompositeEvaluator implements Evaluator { verdict, assertions, expectedAspectCount: assertions.length || 1, - evaluatorRawRequest: { + graderRawRequest: { aggregator: 'code-grader', script: scriptPath, }, @@ -291,7 +291,7 @@ export class CompositeEvaluator implements Evaluator { verdict: 'fail', assertions: [{ text: `Code aggregator failed: ${message}`, passed: false }], expectedAspectCount: 1, - evaluatorRawRequest: { + graderRawRequest: { aggregator: 'code-grader', script: scriptPath, error: message, @@ -315,13 +315,13 @@ export class CompositeEvaluator implements Evaluator { const resultsJson = JSON.stringify(resultsObject, null, 2); // Build child results for output - const scores: ChildEvaluatorResult[] = results.map((member) => ({ + const scores: ChildGraderResult[] = results.map((member) => ({ name: member.id, type: member.type, score: member.result.score, verdict: member.result.verdict, assertions: [...member.result.assertions], - evaluatorRawRequest: member.result.evaluatorRawRequest, + graderRawRequest: member.result.graderRawRequest, scores: member.result.scores, details: member.result.details, })); @@ -332,7 +332,7 @@ export class CompositeEvaluator implements Evaluator { const systemPrompt = buildOutputSchema(); - const evaluatorRawRequest: JsonObject = { + const graderRawRequest: JsonObject = { aggregator: 'llm-grader', userPrompt, systemPrompt, @@ -359,7 +359,7 @@ export class CompositeEvaluator implements Evaluator { verdict: scoreToVerdict(score), assertions, expectedAspectCount: Math.max(assertions.length, 1), - evaluatorRawRequest, + graderRawRequest, scores, }; } @@ -384,7 +384,7 @@ export class CompositeEvaluator implements Evaluator { verdict: scoreToVerdict(score), assertions, expectedAspectCount: Math.max(assertions.length, 1), - evaluatorRawRequest, + graderRawRequest, scores, }; } catch { @@ -393,7 +393,7 @@ export class CompositeEvaluator implements Evaluator { verdict: 'fail', assertions: [{ text: 'LLM aggregator failed', passed: false }], expectedAspectCount: 1, - evaluatorRawRequest, + graderRawRequest, scores, }; } diff --git a/packages/core/src/evaluation/evaluators/cost.ts b/packages/core/src/evaluation/graders/cost.ts similarity index 71% rename from packages/core/src/evaluation/evaluators/cost.ts rename to packages/core/src/evaluation/graders/cost.ts index b810a451f..45615f766 100644 --- a/packages/core/src/evaluation/evaluators/cost.ts +++ b/packages/core/src/evaluation/graders/cost.ts @@ -1,20 +1,20 @@ -import type { CostEvaluatorConfig } from '../types.js'; -import type { EvaluationContext, EvaluationScore, Evaluator } from './types.js'; +import type { CostGraderConfig } from '../types.js'; +import type { EvaluationContext, EvaluationScore, Grader } from './types.js'; -export interface CostEvaluatorOptions { - readonly config: CostEvaluatorConfig; +export interface CostGraderOptions { + readonly config: CostGraderConfig; } /** - * Evaluator that checks execution cost against a budget. + * Grader that checks execution cost against a budget. * Uses costUsd from the evaluation context. */ -export class CostEvaluator implements Evaluator { +export class CostGrader implements Grader { readonly kind = 'cost'; - private readonly config: CostEvaluatorConfig; + private readonly config: CostGraderConfig; - constructor(options: CostEvaluatorOptions) { + constructor(options: CostGraderOptions) { this.config = options.config; } @@ -29,7 +29,7 @@ export class CostEvaluator implements Evaluator { verdict: 'fail', assertions: [{ text: 'No cost data available in trace', passed: false }], expectedAspectCount: 1, - evaluatorRawRequest: { + graderRawRequest: { type: 'cost', budget, costUsd: null, @@ -52,7 +52,7 @@ export class CostEvaluator implements Evaluator { : { text: `Cost ${formatCost(costUsd)} > ${formatCost(budget)} budget`, passed: false }, ], expectedAspectCount: 1, - evaluatorRawRequest: { + graderRawRequest: { type: 'cost', budget, costUsd, diff --git a/packages/core/src/evaluation/evaluators/execution-metrics.ts b/packages/core/src/evaluation/graders/execution-metrics.ts similarity index 92% rename from packages/core/src/evaluation/evaluators/execution-metrics.ts rename to packages/core/src/evaluation/graders/execution-metrics.ts index 72c22784b..d540a70b2 100644 --- a/packages/core/src/evaluation/evaluators/execution-metrics.ts +++ b/packages/core/src/evaluation/graders/execution-metrics.ts @@ -1,25 +1,25 @@ import { explorationRatio } from '../trace.js'; -import type { AssertionEntry, ExecutionMetricsEvaluatorConfig } from '../types.js'; +import type { AssertionEntry, ExecutionMetricsGraderConfig } from '../types.js'; import { scoreToVerdict } from './scoring.js'; -import type { EvaluationContext, EvaluationScore, Evaluator } from './types.js'; +import type { EvaluationContext, EvaluationScore, Grader } from './types.js'; -export interface ExecutionMetricsEvaluatorOptions { - readonly config: ExecutionMetricsEvaluatorConfig; +export interface ExecutionMetricsGraderOptions { + readonly config: ExecutionMetricsGraderConfig; } /** - * Evaluator that checks execution metrics against configured thresholds. + * Grader that checks execution metrics against configured thresholds. * Supports multiple threshold types: tool calls, LLM calls, tokens, cost, duration, * and exploration ratio. Only specified thresholds are checked. * * Score is proportional: passed / total assertions */ -export class ExecutionMetricsEvaluator implements Evaluator { +export class ExecutionMetricsGrader implements Grader { readonly kind = 'execution-metrics'; - private readonly config: ExecutionMetricsEvaluatorConfig; + private readonly config: ExecutionMetricsGraderConfig; - constructor(options: ExecutionMetricsEvaluatorOptions) { + constructor(options: ExecutionMetricsGraderOptions) { this.config = options.config; } @@ -46,7 +46,7 @@ export class ExecutionMetricsEvaluator implements Evaluator { verdict: 'fail', assertions: [{ text: 'No trace summary available', passed: false }], expectedAspectCount: 1, - evaluatorRawRequest: { + graderRawRequest: { type: 'execution-metrics', config: this.extractConfiguredThresholds(), actual: null, @@ -188,7 +188,7 @@ export class ExecutionMetricsEvaluator implements Evaluator { verdict: scoreToVerdict(score), assertions, expectedAspectCount: totalChecks || 1, - evaluatorRawRequest: { + graderRawRequest: { type: 'execution-metrics', config: this.extractConfiguredThresholds(), actual: this.filterDefinedMetrics(actualMetrics), diff --git a/packages/core/src/evaluation/evaluators/field-accuracy.ts b/packages/core/src/evaluation/graders/field-accuracy.ts similarity index 96% rename from packages/core/src/evaluation/evaluators/field-accuracy.ts rename to packages/core/src/evaluation/graders/field-accuracy.ts index 5c8a0a28e..70b52e82a 100644 --- a/packages/core/src/evaluation/evaluators/field-accuracy.ts +++ b/packages/core/src/evaluation/graders/field-accuracy.ts @@ -1,11 +1,11 @@ import type { AssertionEntry, - FieldAccuracyEvaluatorConfig, + FieldAccuracyGraderConfig, FieldConfig, JsonObject, } from '../types.js'; import { clampScore, deepEqual, parseJsonFromText, scoreToVerdict } from './scoring.js'; -import type { EvaluationContext, EvaluationScore, Evaluator } from './types.js'; +import type { EvaluationContext, EvaluationScore, Grader } from './types.js'; /** Result from evaluating a single field */ interface FieldResult { @@ -61,20 +61,20 @@ const MONTH_NAMES: Record = { december: 11, }; -export interface FieldAccuracyEvaluatorOptions { - readonly config: FieldAccuracyEvaluatorConfig; +export interface FieldAccuracyGraderOptions { + readonly config: FieldAccuracyGraderConfig; } /** - * FieldAccuracyEvaluator compares extracted structured data against expected values + * FieldAccuracyGrader compares extracted structured data against expected values * with configurable matching strategies (exact, fuzzy, numeric_tolerance, date). */ -export class FieldAccuracyEvaluator implements Evaluator { +export class FieldAccuracyGrader implements Grader { readonly kind = 'field-accuracy'; - private readonly config: FieldAccuracyEvaluatorConfig; + private readonly config: FieldAccuracyGraderConfig; - constructor(options: FieldAccuracyEvaluatorOptions) { + constructor(options: FieldAccuracyGraderOptions) { this.config = options.config; } diff --git a/packages/core/src/evaluation/graders/index.ts b/packages/core/src/evaluation/graders/index.ts new file mode 100644 index 000000000..107582aee --- /dev/null +++ b/packages/core/src/evaluation/graders/index.ts @@ -0,0 +1,83 @@ +// Types +export type { + ChildGraderResult, + EvaluationContext, + EvaluationScore, + EvaluationVerdict, + Grader, + GraderFactory, +} from './types.js'; + +// Scoring utilities +export { + DEFAULT_THRESHOLD, + PASS_THRESHOLD, + clampScore, + deepEqual, + extractJsonBlob, + isNonEmptyString, + negateScore, + parseJsonFromText, + parseJsonSafe, + scoreToVerdict, +} from './scoring.js'; + +// Graders +export { CodeGrader, executeScript } from './code-grader.js'; +export type { CodeGraderOptions } from './code-grader.js'; + +export { CompositeGrader } from './composite.js'; +export type { CompositeGraderOptions } from './composite.js'; + +export { CostGrader } from './cost.js'; +export type { CostGraderOptions } from './cost.js'; + +export { ExecutionMetricsGrader } from './execution-metrics.js'; +export type { ExecutionMetricsGraderOptions } from './execution-metrics.js'; + +export { FieldAccuracyGrader } from './field-accuracy.js'; +export type { FieldAccuracyGraderOptions } from './field-accuracy.js'; + +export { LatencyGrader } from './latency.js'; +export type { LatencyGraderOptions } from './latency.js'; + +export { + LlmGrader, + buildOutputSchema, + buildRubricOutputSchema, + buildScoreRangeOutputSchema, + calculateRubricScore, + DEFAULT_GRADER_TEMPLATE, + extractImageBlocks, + substituteVariables, + freeformEvaluationSchema, + rubricEvaluationSchema, +} from './llm-grader.js'; +export type { LlmGraderOptions } from './llm-grader.js'; + +export { SkillTriggerGrader } from './skill-trigger.js'; + +export { assembleLlmGraderPrompt } from './llm-grader-prompt.js'; +export type { LlmGraderPromptAssembly } from './llm-grader-prompt.js'; + +export { TokenUsageGrader } from './token-usage.js'; +export type { TokenUsageGraderOptions } from './token-usage.js'; + +export { ToolTrajectoryGrader } from './tool-trajectory.js'; +export type { ToolTrajectoryGraderOptions } from './tool-trajectory.js'; + +// Deterministic assertions +export { + runContainsAssertion, + runContainsAnyAssertion, + runContainsAllAssertion, + runIcontainsAssertion, + runIcontainsAnyAssertion, + runIcontainsAllAssertion, + runStartsWithAssertion, + runEndsWithAssertion, + runEqualsAssertion, + runIsJsonAssertion, + runRegexAssertion, +} from './assertions.js'; +export type { AssertionResult } from './assertions.js'; diff --git a/packages/core/src/evaluation/evaluators/inline-assert.ts b/packages/core/src/evaluation/graders/inline-assert.ts similarity index 83% rename from packages/core/src/evaluation/evaluators/inline-assert.ts rename to packages/core/src/evaluation/graders/inline-assert.ts index eb996c11a..33e556a88 100644 --- a/packages/core/src/evaluation/evaluators/inline-assert.ts +++ b/packages/core/src/evaluation/graders/inline-assert.ts @@ -1,13 +1,13 @@ import type { AssertFn } from '../assertions.js'; import type { JsonObject } from '../types.js'; import { clampScore, scoreToVerdict } from './scoring.js'; -import type { EvaluationContext, EvaluationScore, Evaluator } from './types.js'; +import type { EvaluationContext, EvaluationScore, Grader } from './types.js'; /** - * Evaluator that wraps an inline AssertFn and runs it in-process. + * Grader that wraps an inline AssertFn and runs it in-process. * No subprocess, no stdin/stdout -- just calls the function directly. */ -export class InlineAssertEvaluator implements Evaluator { +export class InlineAssertGrader implements Grader { readonly kind = 'inline-assert'; constructor( diff --git a/packages/core/src/evaluation/evaluators/latency.ts b/packages/core/src/evaluation/graders/latency.ts similarity index 69% rename from packages/core/src/evaluation/evaluators/latency.ts rename to packages/core/src/evaluation/graders/latency.ts index dd8996bdb..74bb195b3 100644 --- a/packages/core/src/evaluation/evaluators/latency.ts +++ b/packages/core/src/evaluation/graders/latency.ts @@ -1,20 +1,20 @@ -import type { LatencyEvaluatorConfig } from '../types.js'; -import type { EvaluationContext, EvaluationScore, Evaluator } from './types.js'; +import type { LatencyGraderConfig } from '../types.js'; +import type { EvaluationContext, EvaluationScore, Grader } from './types.js'; -export interface LatencyEvaluatorOptions { - readonly config: LatencyEvaluatorConfig; +export interface LatencyGraderOptions { + readonly config: LatencyGraderConfig; } /** - * Evaluator that checks execution duration against a threshold. + * Grader that checks execution duration against a threshold. * Uses durationMs from the evaluation context. */ -export class LatencyEvaluator implements Evaluator { +export class LatencyGrader implements Grader { readonly kind = 'latency'; - private readonly config: LatencyEvaluatorConfig; + private readonly config: LatencyGraderConfig; - constructor(options: LatencyEvaluatorOptions) { + constructor(options: LatencyGraderOptions) { this.config = options.config; } @@ -29,7 +29,7 @@ export class LatencyEvaluator implements Evaluator { verdict: 'fail', assertions: [{ text: 'No duration data available in trace', passed: false }], expectedAspectCount: 1, - evaluatorRawRequest: { + graderRawRequest: { type: 'latency', threshold, durationMs: null, @@ -49,7 +49,7 @@ export class LatencyEvaluator implements Evaluator { : { text: `Duration ${durationMs}ms > ${threshold}ms threshold`, passed: false }, ], expectedAspectCount: 1, - evaluatorRawRequest: { + graderRawRequest: { type: 'latency', threshold, durationMs, diff --git a/packages/core/src/evaluation/evaluators/llm-grader-prompt.ts b/packages/core/src/evaluation/graders/llm-grader-prompt.ts similarity index 88% rename from packages/core/src/evaluation/evaluators/llm-grader-prompt.ts rename to packages/core/src/evaluation/graders/llm-grader-prompt.ts index b8d80feff..1cc7774bb 100644 --- a/packages/core/src/evaluation/evaluators/llm-grader-prompt.ts +++ b/packages/core/src/evaluation/graders/llm-grader-prompt.ts @@ -1,9 +1,9 @@ import type { Message } from '../providers/types.js'; import { TEMPLATE_VARIABLES } from '../template-variables.js'; -import type { EvalTest, LlmGraderEvaluatorConfig, RubricItem } from '../types.js'; +import type { EvalTest, LlmGraderConfig, RubricItem } from '../types.js'; import type { PromptInputs } from '../yaml-parser.js'; import { - DEFAULT_EVALUATOR_TEMPLATE, + DEFAULT_GRADER_TEMPLATE, buildOutputSchema, buildRubricOutputSchema, buildScoreRangeOutputSchema, @@ -21,10 +21,10 @@ export function assembleLlmGraderPrompt(input: { evalCase: EvalTest; candidate: string; promptInputs: PromptInputs; - evaluatorConfig?: LlmGraderEvaluatorConfig; + evaluatorConfig?: LlmGraderConfig; output?: readonly Message[]; fileChanges?: string; - evaluatorTemplateOverride?: string; + graderTemplateOverride?: string; }): LlmGraderPromptAssembly { const { evalCase, @@ -32,7 +32,7 @@ export function assembleLlmGraderPrompt(input: { promptInputs, evaluatorConfig, fileChanges, - evaluatorTemplateOverride, + graderTemplateOverride, } = input; const rubrics = evaluatorConfig?.rubrics; @@ -46,13 +46,7 @@ export function assembleLlmGraderPrompt(input: { return assembleChecklist(evalCase, candidate, promptInputs, rubrics, fileChanges); } - return assembleFreeform( - evalCase, - candidate, - promptInputs, - fileChanges, - evaluatorTemplateOverride, - ); + return assembleFreeform(evalCase, candidate, promptInputs, fileChanges, graderTemplateOverride); } function assembleFreeform( @@ -60,7 +54,7 @@ function assembleFreeform( candidate: string, promptInputs: PromptInputs, fileChanges?: string, - evaluatorTemplateOverride?: string, + graderTemplateOverride?: string, ): LlmGraderPromptAssembly { const formattedQuestion = promptInputs.question && promptInputs.question.trim().length > 0 @@ -80,11 +74,11 @@ function assembleFreeform( }; const systemPrompt = buildOutputSchema(); - const template = evaluatorTemplateOverride ?? DEFAULT_EVALUATOR_TEMPLATE; + const template = graderTemplateOverride ?? DEFAULT_GRADER_TEMPLATE; let userPrompt = substituteVariables(template, variables); // Append file_changes section to default template only when present - if (fileChanges && !evaluatorTemplateOverride) { + if (fileChanges && !graderTemplateOverride) { userPrompt += `\n\n[[ ## file_changes ## ]]\n${fileChanges}`; } @@ -109,7 +103,7 @@ function assembleChecklist( : evalCase.question; const parts: string[] = [ - 'You are an expert evaluator. Evaluate the candidate answer against each rubric item below.', + 'You are an expert grader. Evaluate the candidate answer against each rubric item below.', '', '[[ ## question ## ]]', formattedQuestion, @@ -163,7 +157,7 @@ function assembleScoreRange( : evalCase.question; const parts: string[] = [ - 'You are an expert evaluator. Score the candidate answer on each criterion below using the provided score ranges.', + 'You are an expert grader. Score the candidate answer on each criterion below using the provided score ranges.', 'For each criterion, output an integer score from 0 to 10 based on which score range best matches the answer.', '', '[[ ## question ## ]]', diff --git a/packages/core/src/evaluation/evaluators/llm-grader.ts b/packages/core/src/evaluation/graders/llm-grader.ts similarity index 94% rename from packages/core/src/evaluation/evaluators/llm-grader.ts rename to packages/core/src/evaluation/graders/llm-grader.ts index 6aa9ab955..47812ef47 100644 --- a/packages/core/src/evaluation/evaluators/llm-grader.ts +++ b/packages/core/src/evaluation/graders/llm-grader.ts @@ -16,7 +16,7 @@ import { DEPRECATED_TEMPLATE_VARIABLES, TEMPLATE_VARIABLES } from '../template-v import type { TokenUsage } from '../trace.js'; import type { AssertionEntry, JsonObject, RubricItem } from '../types.js'; import { clampScore, isNonEmptyString, parseJsonFromText, scoreToVerdict } from './scoring.js'; -import type { EvaluationContext, EvaluationScore, Evaluator } from './types.js'; +import type { EvaluationContext, EvaluationScore, Grader } from './types.js'; // --------------------------------------------------------------------------- // Constants for built-in agent mode (filesystem tools) @@ -67,10 +67,10 @@ const BINARY_EXTENSIONS = new Set([ ]); /** - * Default evaluator template for the user prompt (variables will be substituted). - * Custom evaluators can override this via evaluatorTemplate option. + * Default grader template for the user prompt (variables will be substituted). + * Custom graders can override this via graderTemplate option. */ -export const DEFAULT_EVALUATOR_TEMPLATE = `You are an expert evaluator. Your goal is to grade the answer based on how well it achieves the criteria for the original task. +export const DEFAULT_GRADER_TEMPLATE = `You are an expert grader. Your goal is to grade the answer based on how well it achieves the criteria for the original task. Use the reference_answer as a gold standard for a high-quality response (if provided). The reference_answer may be a simple text response, or it may contain a sequence of expected agent messages including tool calls. When it contains multiple messages, the last message represents the final expected answer. The answer does not need to match it verbatim, but should capture the key points and follow the same spirit. @@ -90,13 +90,13 @@ Be concise and focused in your evaluation. Provide succinct, specific feedback r type GraderProviderResolver = (context: EvaluationContext) => Promise; -export interface LlmGraderEvaluatorOptions { +export interface LlmGraderOptions { readonly resolveGraderProvider: GraderProviderResolver; /** @deprecated Use `resolveGraderProvider` instead. */ readonly resolveJudgeProvider?: GraderProviderResolver; readonly maxOutputTokens?: number; readonly temperature?: number; - readonly evaluatorTemplate?: string; + readonly graderTemplate?: string; readonly maxSteps?: number; readonly graderTargetProvider?: Provider; /** @deprecated Use `graderTargetProvider` instead. */ @@ -173,22 +173,22 @@ function resolveContentBasePath(context: EvaluationContext): string | undefined return undefined; } -export class LlmGraderEvaluator implements Evaluator { +export class LlmGrader implements Grader { readonly kind = 'llm-grader'; private readonly resolveGraderProvider: GraderProviderResolver; private readonly maxOutputTokens?: number; private readonly temperature?: number; - private readonly evaluatorTemplate?: string; + private readonly graderTemplate?: string; private readonly maxSteps: number; private readonly graderTargetProvider?: Provider; - constructor(options: LlmGraderEvaluatorOptions) { + constructor(options: LlmGraderOptions) { this.resolveGraderProvider = (options.resolveGraderProvider ?? options.resolveJudgeProvider) as NonNullable; this.maxOutputTokens = options.maxOutputTokens; this.temperature = options.temperature; - this.evaluatorTemplate = options.evaluatorTemplate; + this.graderTemplate = options.graderTemplate; this.maxSteps = Math.min(options.maxSteps ?? DEFAULT_MAX_STEPS, MAX_STEPS_LIMIT); this.graderTargetProvider = options.graderTargetProvider ?? options.judgeTargetProvider; } @@ -282,20 +282,20 @@ export class LlmGraderEvaluator implements Evaluator { const systemPrompt = buildOutputSchema(); // Build user prompt based on custom template or default template - const evaluatorTemplate = - context.evaluatorTemplateOverride ?? this.evaluatorTemplate ?? DEFAULT_EVALUATOR_TEMPLATE; + const graderTemplate = + context.graderTemplateOverride ?? this.graderTemplate ?? DEFAULT_GRADER_TEMPLATE; // Warn once per run when custom templates use deprecated _text variable names - warnDeprecatedTemplateVars(evaluatorTemplate); + warnDeprecatedTemplateVars(graderTemplate); - let userPrompt = substituteVariables(evaluatorTemplate, variables); + let userPrompt = substituteVariables(graderTemplate, variables); // Append file_changes section to default template only when present - if (context.fileChanges && !context.evaluatorTemplateOverride && !this.evaluatorTemplate) { + if (context.fileChanges && !context.graderTemplateOverride && !this.graderTemplate) { userPrompt += `\n\n[[ ## file_changes ## ]]\n${context.fileChanges}`; } - const evaluatorRawRequest: JsonObject = { + const graderRawRequest: JsonObject = { userPrompt, systemPrompt, }; @@ -323,7 +323,7 @@ export class LlmGraderEvaluator implements Evaluator { verdict: scoreToVerdict(score), assertions, expectedAspectCount: Math.max(assertions.length, 1), - evaluatorRawRequest, + graderRawRequest, graderTarget: graderProvider.targetName, details: data.details as JsonObject | undefined, tokenUsage, @@ -339,7 +339,7 @@ export class LlmGraderEvaluator implements Evaluator { verdict: 'skip' as const, assertions: [{ text: `Grader parse failure after 3 attempts: ${message}`, passed: false }], expectedAspectCount: 1, - evaluatorRawRequest, + graderRawRequest, graderTarget: graderProvider.targetName, }; } @@ -366,7 +366,7 @@ export class LlmGraderEvaluator implements Evaluator { const prompt = this.buildRubricPrompt(context, rubrics); const systemPrompt = buildRubricOutputSchema(); - const evaluatorRawRequest: JsonObject = { + const graderRawRequest: JsonObject = { userPrompt: prompt, systemPrompt, }; @@ -391,7 +391,7 @@ export class LlmGraderEvaluator implements Evaluator { verdict, assertions, expectedAspectCount: rubrics.length, - evaluatorRawRequest, + graderRawRequest, graderTarget: graderProvider.targetName, tokenUsage, }; @@ -404,7 +404,7 @@ export class LlmGraderEvaluator implements Evaluator { verdict: 'skip' as const, assertions: [{ text: `Grader parse failure after 3 attempts: ${message}`, passed: false }], expectedAspectCount: rubrics.length, - evaluatorRawRequest, + graderRawRequest, graderTarget: graderProvider.targetName, }; } @@ -422,7 +422,7 @@ export class LlmGraderEvaluator implements Evaluator { const prompt = this.buildScoreRangePrompt(context, rubrics); const systemPrompt = buildScoreRangeOutputSchema(); - const evaluatorRawRequest: JsonObject = { + const graderRawRequest: JsonObject = { userPrompt: prompt, systemPrompt, }; @@ -447,7 +447,7 @@ export class LlmGraderEvaluator implements Evaluator { verdict, assertions, expectedAspectCount: rubrics.length, - evaluatorRawRequest, + graderRawRequest, graderTarget: graderProvider.targetName, details, tokenUsage, @@ -461,7 +461,7 @@ export class LlmGraderEvaluator implements Evaluator { verdict: 'skip' as const, assertions: [{ text: `Grader parse failure after 3 attempts: ${message}`, passed: false }], expectedAspectCount: rubrics.length, - evaluatorRawRequest, + graderRawRequest, graderTarget: graderProvider.targetName, }; } @@ -500,7 +500,7 @@ export class LlmGraderEvaluator implements Evaluator { const fsTools = createFilesystemTools(workspacePath); - const evaluatorRawRequest: JsonObject = { + const graderRawRequest: JsonObject = { mode: 'built-in', systemPrompt, userPrompt, @@ -528,7 +528,7 @@ export class LlmGraderEvaluator implements Evaluator { return this.parseAgentResult( text, rubrics, - evaluatorRawRequest, + graderRawRequest, details, graderProvider.targetName, ); @@ -539,7 +539,7 @@ export class LlmGraderEvaluator implements Evaluator { verdict: 'fail', assertions: [{ text: `llm-grader built-in evaluation failed: ${message}`, passed: false }], expectedAspectCount: 1, - evaluatorRawRequest, + graderRawRequest, graderTarget: graderProvider.targetName, details: { mode: 'built-in', error: message }, }; @@ -583,7 +583,7 @@ export class LlmGraderEvaluator implements Evaluator { const workspacePath = context.workspacePath; const prompt = this.buildDelegatedPrompt(context); - const evaluatorRawRequest: JsonObject = { + const graderRawRequest: JsonObject = { mode: modeLabel, grader_target: provider.targetName, prompt, @@ -606,7 +606,7 @@ export class LlmGraderEvaluator implements Evaluator { { text: `llm-grader ${modeLabel} returned no assistant response`, passed: false }, ], expectedAspectCount: 1, - evaluatorRawRequest, + graderRawRequest, graderTarget: provider.targetName, details: { mode: modeLabel, grader_target: provider.targetName }, }; @@ -623,7 +623,7 @@ export class LlmGraderEvaluator implements Evaluator { return this.parseAgentResult( assistantContent, rubrics, - evaluatorRawRequest, + graderRawRequest, details, provider.targetName, ); @@ -636,7 +636,7 @@ export class LlmGraderEvaluator implements Evaluator { { text: `llm-grader ${modeLabel} evaluation failed: ${message}`, passed: false }, ], expectedAspectCount: 1, - evaluatorRawRequest, + graderRawRequest, graderTarget: provider.targetName, details: { mode: modeLabel, @@ -660,7 +660,7 @@ export class LlmGraderEvaluator implements Evaluator { const rubrics = config?.type === 'llm-grader' ? config.rubrics : undefined; const parts: string[] = [ - 'You are an expert evaluator with access to the workspace filesystem.', + 'You are an expert grader with access to the workspace filesystem.', 'Use the provided tools to investigate the workspace and verify the criteria are met.', 'Thoroughly examine relevant files before making your assessment.', '', @@ -697,9 +697,9 @@ export class LlmGraderEvaluator implements Evaluator { [TEMPLATE_VARIABLES.EXPECTED_OUTPUT_TEXT]: (context.evalCase.reference_answer ?? '').trim(), }; - if (this.evaluatorTemplate) { - warnDeprecatedTemplateVars(this.evaluatorTemplate); - return substituteVariables(this.evaluatorTemplate, variables); + if (this.graderTemplate) { + warnDeprecatedTemplateVars(this.graderTemplate); + return substituteVariables(this.graderTemplate, variables); } const config = context.evaluator; @@ -759,7 +759,7 @@ export class LlmGraderEvaluator implements Evaluator { const config = context.evaluator; const rubrics = config?.type === 'llm-grader' ? config.rubrics : undefined; - if (this.evaluatorTemplate) { + if (this.graderTemplate) { const variables: Record = { [TEMPLATE_VARIABLES.CRITERIA]: context.evalCase.criteria.trim(), [TEMPLATE_VARIABLES.INPUT]: formattedQuestion.trim(), @@ -771,8 +771,8 @@ export class LlmGraderEvaluator implements Evaluator { [TEMPLATE_VARIABLES.OUTPUT_TEXT]: context.candidate.trim(), [TEMPLATE_VARIABLES.EXPECTED_OUTPUT_TEXT]: (context.evalCase.reference_answer ?? '').trim(), }; - warnDeprecatedTemplateVars(this.evaluatorTemplate); - const customPrompt = substituteVariables(this.evaluatorTemplate, variables); + warnDeprecatedTemplateVars(this.graderTemplate); + const customPrompt = substituteVariables(this.graderTemplate, variables); const outputSchema = rubrics && rubrics.length > 0 ? buildRubricOutputSchema() : buildOutputSchema(); @@ -781,7 +781,7 @@ export class LlmGraderEvaluator implements Evaluator { } const parts: string[] = [ - 'You are an expert evaluator. Investigate the workspace to verify the criteria are met.', + 'You are an expert grader. Investigate the workspace to verify the criteria are met.', '', '[[ ## question ## ]]', formattedQuestion, @@ -828,7 +828,7 @@ export class LlmGraderEvaluator implements Evaluator { private parseAgentResult( text: string, rubrics: readonly RubricItem[] | undefined, - evaluatorRawRequest: JsonObject, + graderRawRequest: JsonObject, details: JsonObject, graderTarget?: string, ): EvaluationScore { @@ -843,7 +843,7 @@ export class LlmGraderEvaluator implements Evaluator { verdict, assertions, expectedAspectCount: rubrics.length, - evaluatorRawRequest, + graderRawRequest, graderTarget, details, }; @@ -860,7 +860,7 @@ export class LlmGraderEvaluator implements Evaluator { verdict: scoreToVerdict(score), assertions, expectedAspectCount: Math.max(assertions.length, 1), - evaluatorRawRequest, + graderRawRequest, graderTarget, details: data.details && Object.keys(data.details).length > 0 @@ -878,7 +878,7 @@ export class LlmGraderEvaluator implements Evaluator { }, ], expectedAspectCount: 1, - evaluatorRawRequest, + graderRawRequest, graderTarget, details, }; @@ -902,7 +902,7 @@ export class LlmGraderEvaluator implements Evaluator { : context.evalCase.question; const parts: string[] = [ - 'You are an expert evaluator. Score the candidate answer on each criterion below using the provided score ranges.', + 'You are an expert grader. Score the candidate answer on each criterion below using the provided score ranges.', 'For each criterion, output an integer score from 0 to 10 based on which score range best matches the answer.', '', '[[ ## question ## ]]', @@ -965,7 +965,7 @@ export class LlmGraderEvaluator implements Evaluator { : context.evalCase.question; const parts: string[] = [ - 'You are an expert evaluator. Evaluate the candidate answer against each rubric item below.', + 'You are an expert grader. Evaluate the candidate answer against each rubric item below.', '', '[[ ## question ## ]]', formattedQuestion, @@ -1142,7 +1142,7 @@ export class LlmGraderEvaluator implements Evaluator { /** * Build the mandatory output schema that all evaluators must follow. - * This schema is always appended to the evaluator template. + * This schema is always appended to the grader template. */ export function buildOutputSchema(): string { return [ @@ -1194,7 +1194,7 @@ function sumTokenUsage( } export function buildRubricOutputSchema(): string { - return `You are an expert evaluator. Evaluate the candidate answer against each rubric item. + return `You are an expert grader. Evaluate the candidate answer against each rubric item. You must return a valid JSON object matching this schema: { "checks": [ @@ -1237,7 +1237,7 @@ export function warnDeprecatedTemplateVars(template: string): void { if (used.length > 0) { warnedTemplateStrings.add(template); console.warn( - `${ANSI_YELLOW}⚠ Deprecated template variables detected (they still work but will be removed in a future version):\n ${used.join('\n ')}\n Update your custom evaluator template to use the new names.${ANSI_RESET}`, + `${ANSI_YELLOW}⚠ Deprecated template variables detected (they still work but will be removed in a future version):\n ${used.join('\n ')}\n Update your custom grader template to use the new names.${ANSI_RESET}`, ); } } @@ -1286,7 +1286,7 @@ export function calculateRubricScore( * Build the output schema for score-range rubric evaluation. */ export function buildScoreRangeOutputSchema(): string { - return `You are an expert evaluator. Score the candidate answer on each criterion. + return `You are an expert grader. Score the candidate answer on each criterion. You must return a valid JSON object matching this schema: { "checks": [ diff --git a/packages/core/src/evaluation/evaluators/prompt-resolution.ts b/packages/core/src/evaluation/graders/prompt-resolution.ts similarity index 93% rename from packages/core/src/evaluation/evaluators/prompt-resolution.ts rename to packages/core/src/evaluation/graders/prompt-resolution.ts index 04e9df5dc..4e9bf7a5a 100644 --- a/packages/core/src/evaluation/evaluators/prompt-resolution.ts +++ b/packages/core/src/evaluation/graders/prompt-resolution.ts @@ -1,13 +1,13 @@ /** * Prompt resolution utilities for LLM judge evaluators. * - * Extracted from orchestrator.ts to enable reuse by the evaluator registry. + * Extracted from orchestrator.ts to enable reuse by the grader registry. * * Key behavior: When a user writes `prompt: "some text"` in an assertion, * `resolveCustomPrompt()` returns that text. The caller must then decide * whether the text is a **full template** (contains `{{output}}` etc.) or * **bare criteria** (no template variables). Use `containsTemplateVariables()` - * to distinguish: full templates become `evaluatorTemplateOverride`, while + * to distinguish: full templates become `graderTemplateOverride`, while * bare criteria are injected into the default template's `{{criteria}}` slot. */ @@ -19,7 +19,7 @@ import type { Message } from '../providers/types.js'; import { VALID_TEMPLATE_VARIABLES } from '../template-variables.js'; import type { TraceSummary } from '../trace.js'; import type { EvalTest, PromptScriptConfig } from '../types.js'; -import { executeScript } from './code-evaluator.js'; +import { executeScript } from './code-grader.js'; export interface ResolveCustomPromptContext { readonly evalCase: EvalTest; @@ -77,7 +77,7 @@ export async function resolveCustomPrompt( /** * Checks whether a prompt string contains any known `{{ variable }}` template * placeholders (e.g. `{{output}}`, `{{input}}`). If it does, the string is a - * full evaluator template and should replace the default template. If not, + * full grader template and should replace the default template. If not, * it's bare criteria text and should be injected into the `{{criteria}}` slot * of the default template. */ diff --git a/packages/core/src/evaluation/evaluators/scoring.ts b/packages/core/src/evaluation/graders/scoring.ts similarity index 100% rename from packages/core/src/evaluation/evaluators/scoring.ts rename to packages/core/src/evaluation/graders/scoring.ts diff --git a/packages/core/src/evaluation/evaluators/skill-trigger.ts b/packages/core/src/evaluation/graders/skill-trigger.ts similarity index 91% rename from packages/core/src/evaluation/evaluators/skill-trigger.ts rename to packages/core/src/evaluation/graders/skill-trigger.ts index 7466c393d..3fc462df4 100644 --- a/packages/core/src/evaluation/evaluators/skill-trigger.ts +++ b/packages/core/src/evaluation/graders/skill-trigger.ts @@ -19,15 +19,15 @@ * names (input.skill, input.file_path) regardless of provider. */ -import type { SkillTriggerEvaluatorConfig } from '../types.js'; -import type { EvaluationContext, EvaluationScore, Evaluator } from './types.js'; +import type { SkillTriggerGraderConfig } from '../types.js'; +import type { EvaluationContext, EvaluationScore, Grader } from './types.js'; -export class SkillTriggerEvaluator implements Evaluator { +export class SkillTriggerGrader implements Grader { readonly kind = 'skill-trigger'; - private readonly config: SkillTriggerEvaluatorConfig; + private readonly config: SkillTriggerGraderConfig; - constructor(config: SkillTriggerEvaluatorConfig) { + constructor(config: SkillTriggerGraderConfig) { this.config = config; } diff --git a/packages/core/src/evaluation/evaluators/token-usage.ts b/packages/core/src/evaluation/graders/token-usage.ts similarity index 81% rename from packages/core/src/evaluation/evaluators/token-usage.ts rename to packages/core/src/evaluation/graders/token-usage.ts index bbd4d6a32..22997d588 100644 --- a/packages/core/src/evaluation/evaluators/token-usage.ts +++ b/packages/core/src/evaluation/graders/token-usage.ts @@ -1,20 +1,20 @@ -import type { AssertionEntry, TokenUsageEvaluatorConfig } from '../types.js'; -import type { EvaluationContext, EvaluationScore, Evaluator } from './types.js'; +import type { AssertionEntry, TokenUsageGraderConfig } from '../types.js'; +import type { EvaluationContext, EvaluationScore, Grader } from './types.js'; -export interface TokenUsageEvaluatorOptions { - readonly config: TokenUsageEvaluatorConfig; +export interface TokenUsageGraderOptions { + readonly config: TokenUsageGraderConfig; } /** - * Evaluator that checks provider-reported token usage against configured limits. + * Grader that checks provider-reported token usage against configured limits. * Uses tokenUsage from the evaluation context. */ -export class TokenUsageEvaluator implements Evaluator { +export class TokenUsageGrader implements Grader { readonly kind = 'token-usage'; - private readonly config: TokenUsageEvaluatorConfig; + private readonly config: TokenUsageGraderConfig; - constructor(options: TokenUsageEvaluatorOptions) { + constructor(options: TokenUsageGraderOptions) { this.config = options.config; } @@ -36,7 +36,7 @@ export class TokenUsageEvaluator implements Evaluator { verdict: 'fail', assertions: [{ text: 'No token usage data available in trace', passed: false }], expectedAspectCount, - evaluatorRawRequest: { + graderRawRequest: { type: 'token-usage', max_total: maxTotal ?? null, max_input: maxInput ?? null, @@ -84,7 +84,7 @@ export class TokenUsageEvaluator implements Evaluator { verdict: passed ? 'pass' : 'fail', assertions, expectedAspectCount, - evaluatorRawRequest: { + graderRawRequest: { type: 'token-usage', max_total: maxTotal ?? null, max_input: maxInput ?? null, diff --git a/packages/core/src/evaluation/evaluators/tool-trajectory.ts b/packages/core/src/evaluation/graders/tool-trajectory.ts similarity index 97% rename from packages/core/src/evaluation/evaluators/tool-trajectory.ts rename to packages/core/src/evaluation/graders/tool-trajectory.ts index cade893a9..0d5454e60 100644 --- a/packages/core/src/evaluation/evaluators/tool-trajectory.ts +++ b/packages/core/src/evaluation/graders/tool-trajectory.ts @@ -1,13 +1,13 @@ import type { Message } from '../providers/types.js'; import type { ArgsMatchMode, - ToolTrajectoryEvaluatorConfig, ToolTrajectoryExpectedItem, + ToolTrajectoryGraderConfig, TraceSummary, } from '../trace.js'; import type { AssertionEntry } from '../types.js'; import { deepEqual, scoreToVerdict } from './scoring.js'; -import type { EvaluationContext, EvaluationScore, Evaluator } from './types.js'; +import type { EvaluationContext, EvaluationScore, Grader } from './types.js'; /** Extracted tool call with optional arguments and timing */ interface ExtractedToolCall { @@ -38,7 +38,7 @@ function getNestedValue(obj: Record, path: string): unknown { */ function resolveArgsMatchMode( item: ToolTrajectoryExpectedItem, - config: ToolTrajectoryEvaluatorConfig, + config: ToolTrajectoryGraderConfig, ): ArgsMatchMode | readonly string[] { return item.argsMatch ?? config.argsMatch ?? 'exact'; } @@ -149,16 +149,16 @@ function checkLatency( }; } -export interface ToolTrajectoryEvaluatorOptions { - readonly config: ToolTrajectoryEvaluatorConfig; +export interface ToolTrajectoryGraderOptions { + readonly config: ToolTrajectoryGraderConfig; } -export class ToolTrajectoryEvaluator implements Evaluator { +export class ToolTrajectoryGrader implements Grader { readonly kind = 'tool-trajectory'; - private readonly config: ToolTrajectoryEvaluatorConfig; + private readonly config: ToolTrajectoryGraderConfig; - constructor(options: ToolTrajectoryEvaluatorOptions) { + constructor(options: ToolTrajectoryGraderOptions) { this.config = options.config; } diff --git a/packages/core/src/evaluation/evaluators/types.ts b/packages/core/src/evaluation/graders/types.ts similarity index 87% rename from packages/core/src/evaluation/evaluators/types.ts rename to packages/core/src/evaluation/graders/types.ts index 639ccd117..14e4fd445 100644 --- a/packages/core/src/evaluation/evaluators/types.ts +++ b/packages/core/src/evaluation/graders/types.ts @@ -6,7 +6,7 @@ import type { DockerWorkspaceConfig, EvalTest, EvaluationVerdict, - EvaluatorConfig, + GraderConfig, JsonObject, } from '../types.js'; @@ -33,8 +33,8 @@ export interface EvaluationContext { readonly graderProvider?: Provider; /** @deprecated Use `graderProvider` instead */ readonly judgeProvider?: Provider; - readonly evaluatorTemplateOverride?: string; - readonly evaluator?: EvaluatorConfig; + readonly graderTemplateOverride?: string; + readonly evaluator?: GraderConfig; /** Output messages from agent execution (primary source for tool trajectory) */ readonly output?: readonly Message[]; /** Lightweight summary of trace events (if available) */ @@ -68,8 +68,8 @@ export interface EvaluationScore { readonly verdict: EvaluationVerdict; readonly assertions: readonly import('../types.js').AssertionEntry[]; readonly expectedAspectCount: number; - readonly evaluatorRawRequest?: JsonObject; - readonly scores?: readonly ChildEvaluatorResult[]; + readonly graderRawRequest?: JsonObject; + readonly scores?: readonly ChildGraderResult[]; /** Optional structured details from evaluators (e.g., TP/TN/FP/FN counts, alignments, per-turn scores). */ readonly details?: JsonObject; /** Token usage from LLM calls made by this evaluator (optional). */ @@ -78,26 +78,26 @@ export interface EvaluationScore { readonly graderTarget?: string; } -export interface ChildEvaluatorResult { +export interface ChildGraderResult { readonly name: string; readonly type: string; readonly score: number; readonly weight?: number; readonly verdict: EvaluationVerdict; readonly assertions: readonly import('../types.js').AssertionEntry[]; - readonly evaluatorRawRequest?: JsonObject; - readonly scores?: readonly ChildEvaluatorResult[]; + readonly graderRawRequest?: JsonObject; + readonly scores?: readonly ChildGraderResult[]; /** Optional structured details from evaluators (e.g., TP/TN/FP/FN counts, alignments, per-turn scores). */ readonly details?: JsonObject; /** Token usage from LLM calls made by this evaluator (optional). */ readonly tokenUsage?: TokenUsage; } -export interface Evaluator { +export interface Grader { readonly kind: string; evaluate(context: EvaluationContext): Promise | EvaluationScore; } -export interface EvaluatorFactory { - create(config: EvaluatorConfig, context: EvaluationContext): Evaluator; +export interface GraderFactory { + create(config: GraderConfig, context: EvaluationContext): Grader; } diff --git a/packages/core/src/evaluation/loaders/agent-skills-parser.ts b/packages/core/src/evaluation/loaders/agent-skills-parser.ts index 1036a2971..2632375b4 100644 --- a/packages/core/src/evaluation/loaders/agent-skills-parser.ts +++ b/packages/core/src/evaluation/loaders/agent-skills-parser.ts @@ -1,7 +1,7 @@ import { readFile } from 'node:fs/promises'; import path from 'node:path'; -import type { EvalTest, EvaluatorConfig } from '../types.js'; +import type { EvalTest, GraderConfig } from '../types.js'; const ANSI_RED = '\u001b[31m'; const ANSI_RESET = '\u001b[0m'; @@ -44,7 +44,7 @@ export function isAgentSkillsFormat(parsed: unknown): parsed is AgentSkillsEvals * - id (number) → id (string) * - prompt → input: [{role: "user", content: prompt}] * - expected_output → expected_output: [{role: "assistant", content}] as JsonObject[] - * - assertions (string[]) → assertions: EvaluatorConfig[] (each → llm-grader) + * - assertions (string[]) → assertions: GraderConfig[] (each → llm-grader) * - files → metadata.agent_skills_files (resolved by #541) * - skill_name → metadata.skill_name */ @@ -92,10 +92,10 @@ export function parseAgentSkillsEvals( } // Promote assertions → llm-grader evaluators - let assertions: readonly EvaluatorConfig[] | undefined; + let assertions: readonly GraderConfig[] | undefined; if (evalCase.assertions && evalCase.assertions.length > 0) { assertions = evalCase.assertions.map( - (text, i): EvaluatorConfig => ({ + (text, i): GraderConfig => ({ name: `assertion-${i + 1}`, type: 'llm-grader', prompt: text, diff --git a/packages/core/src/evaluation/loaders/evaluator-parser.ts b/packages/core/src/evaluation/loaders/grader-parser.ts similarity index 95% rename from packages/core/src/evaluation/loaders/evaluator-parser.ts rename to packages/core/src/evaluation/loaders/grader-parser.ts index 8a878eee4..de4acf492 100644 --- a/packages/core/src/evaluation/loaders/evaluator-parser.ts +++ b/packages/core/src/evaluation/loaders/grader-parser.ts @@ -4,15 +4,15 @@ import { parse } from 'yaml'; import { normalizePreprocessorType } from '../content-preprocessor.js'; import { interpolateEnv } from '../interpolation.js'; -import type { ToolTrajectoryEvaluatorConfig, ToolTrajectoryExpectedItem } from '../trace.js'; +import type { ToolTrajectoryExpectedItem, ToolTrajectoryGraderConfig } from '../trace.js'; import type { ContentPreprocessorConfig, - EvaluatorConfig, - EvaluatorKind, + GraderConfig, + GraderKind, JsonObject, JsonValue, } from '../types.js'; -import { isEvaluatorKind } from '../types.js'; +import { isGraderKind } from '../types.js'; import { validateCustomPromptContent } from '../validation/prompt-validator.js'; import { resolveFileReference } from './file-resolver.js'; @@ -32,13 +32,13 @@ const MAX_ASSERTION_INCLUDE_DEPTH = 3; const PROMPT_FILE_PREFIX = 'file://'; /** - * Normalize evaluator type names from legacy snake_case to internal kebab-case. + * Normalize grader type names from legacy snake_case to internal kebab-case. * Accepts both forms for backward compatibility: * - snake_case: 'llm_grader' -> 'llm-grader' (legacy, still accepted) * - kebab-case: 'llm-grader' -> 'llm-grader' (preferred, passes through) * - single-word: 'contains' -> 'contains' (unchanged) */ -export function normalizeEvaluatorType(type: string): string { +export function normalizeGraderType(type: string): string { return type.replace(/_/g, '-'); } @@ -49,7 +49,7 @@ function isDeprecatedJudgeType(type: string): boolean { /** * Parse evaluators from eval case configuration. */ -export async function parseEvaluators( +export async function parseGraders( rawEvalCase: JsonObject & { readonly execution?: JsonValue; readonly assertions?: JsonValue; @@ -60,7 +60,7 @@ export async function parseEvaluators( searchRoots: readonly string[], evalId: string, defaultPreprocessors?: readonly ContentPreprocessorConfig[], -): Promise { +): Promise { const execution = rawEvalCase.execution; const executionObject = isJsonObject(execution) ? execution : undefined; @@ -78,14 +78,14 @@ export async function parseEvaluators( : (globalExecution?.assertions ?? globalExecution?.assert ?? globalExecution?.evaluators); // deprecated: use assertions // Parse case-level evaluators - const parsedCase = await parseEvaluatorList( + const parsedCase = await parseGraderList( caseEvaluators, searchRoots, evalId, defaultPreprocessors, ); // Parse root-level evaluators (appended after case-level) - const parsedRoot = await parseEvaluatorList( + const parsedRoot = await parseGraderList( rootEvaluators, searchRoots, evalId, @@ -97,7 +97,7 @@ export async function parseEvaluators( } // Case-level evaluators run first, root-level defaults appended - const evaluators: EvaluatorConfig[] = [...(parsedCase ?? []), ...(parsedRoot ?? [])]; + const evaluators: GraderConfig[] = [...(parsedCase ?? []), ...(parsedRoot ?? [])]; return evaluators.length > 0 ? evaluators : undefined; } @@ -204,14 +204,14 @@ async function loadAssertionTemplateEntries( ]; return ( - (await expandEvaluatorEntries(assertions, nestedSearchRoots, evalId, { + (await expandGraderEntries(assertions, nestedSearchRoots, evalId, { depth: nextDepth, chain: [...includeContext.chain, resolved.resolvedPath], })) ?? [] ); } -async function expandEvaluatorEntries( +async function expandGraderEntries( candidateEvaluators: JsonValue | undefined, searchRoots: readonly string[], evalId: string, @@ -245,15 +245,15 @@ async function expandEvaluatorEntries( } /** - * Parse a raw evaluator array into typed EvaluatorConfig objects. + * Parse a raw evaluator array into typed GraderConfig objects. */ -async function parseEvaluatorList( +async function parseGraderList( candidateEvaluators: JsonValue | undefined, searchRoots: readonly string[], evalId: string, defaultPreprocessors?: readonly ContentPreprocessorConfig[], -): Promise { - const expandedEvaluators = await expandEvaluatorEntries(candidateEvaluators, searchRoots, evalId); +): Promise { + const expandedEvaluators = await expandGraderEntries(candidateEvaluators, searchRoots, evalId); if (!expandedEvaluators) { return undefined; } @@ -304,7 +304,7 @@ async function parseEvaluatorList( return result; })(); - const evaluators: EvaluatorConfig[] = []; + const evaluators: GraderConfig[] = []; for (const rawEvaluator of processedEvaluators) { if (!isJsonObject(rawEvaluator)) { @@ -315,7 +315,7 @@ async function parseEvaluatorList( const rawName = asString(rawEvaluator.name); const rawType = rawEvaluator.type; // Normalize legacy snake_case YAML type names to internal kebab-case (e.g., 'llm_grader' -> 'llm-grader') - const typeValue = typeof rawType === 'string' ? normalizeEvaluatorType(rawType) : rawType; + const typeValue = typeof rawType === 'string' ? normalizeGraderType(rawType) : rawType; if (typeof typeValue === 'string' && isDeprecatedJudgeType(typeValue)) { logWarning( @@ -325,7 +325,7 @@ async function parseEvaluatorList( } // Unknown types are treated as custom assertion types (resolved via registry discovery) - const isCustomType = typeof typeValue === 'string' && !isEvaluatorKind(typeValue); + const isCustomType = typeof typeValue === 'string' && !isGraderKind(typeValue); if (typeof typeValue !== 'string') { logWarning(`Skipping evaluator with invalid type in '${evalId}'`); continue; @@ -336,7 +336,7 @@ async function parseEvaluatorList( // Auto-generate name from type if not provided const name = rawName ?? - (isCustomType ? typeValue : generateAssertionName(typeValue as EvaluatorKind, rawEvaluator)); + (isCustomType ? typeValue : generateAssertionName(typeValue as GraderKind, rawEvaluator)); if (!name) { logWarning(`Skipping evaluator with missing name in '${evalId}'`); @@ -371,13 +371,13 @@ async function parseEvaluatorList( } evaluators.push({ name, - type: customTypeName as unknown as EvaluatorKind, + type: customTypeName as unknown as GraderKind, ...(weight !== undefined ? { weight } : {}), ...(required !== undefined ? { required } : {}), ...(min_score !== undefined ? { min_score } : {}), ...(negate !== undefined ? { negate } : {}), ...(Object.keys(config).length > 0 ? { config } : {}), - } as EvaluatorConfig); + } as GraderConfig); continue; } @@ -522,7 +522,7 @@ async function parseEvaluatorList( typeof aggregatorType === 'string' ? aggregatorType === 'weighted_average' || aggregatorType === 'threshold' ? aggregatorType - : normalizeEvaluatorType(aggregatorType) + : normalizeGraderType(aggregatorType) : aggregatorType; if ( typeof normalizedAggregatorType === 'string' && @@ -545,7 +545,7 @@ async function parseEvaluatorList( continue; } - const expandedMembers = await expandEvaluatorEntries( + const expandedMembers = await expandGraderEntries( rawMembers, searchRoots, `${evalId}:${name}`, @@ -555,7 +555,7 @@ async function parseEvaluatorList( } // Recursively parse member evaluators - const memberEvaluators: EvaluatorConfig[] = []; + const memberEvaluators: GraderConfig[] = []; for (const rawMember of expandedMembers) { if (!isJsonObject(rawMember)) { logWarning(`Skipping invalid member evaluator in composite '${name}' (expected object)`); @@ -565,13 +565,13 @@ async function parseEvaluatorList( const memberName = asString(rawMember.name); const memberType = rawMember.type; - if (!memberName || !isEvaluatorKind(memberType)) { + if (!memberName || !isGraderKind(memberType)) { logWarning(`Skipping member evaluator with invalid name/type in composite '${name}'`); continue; } // Parse member evaluator (reuse existing logic for code, llm-grader, code-grader) - const memberConfigs = await parseEvaluators( + const memberConfigs = await parseGraders( { evaluators: [rawMember] }, undefined, searchRoots, @@ -840,7 +840,7 @@ async function parseEvaluatorList( evalId, ); - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name, type: 'tool-trajectory', mode, @@ -1219,7 +1219,7 @@ async function parseEvaluatorList( ...(required !== undefined ? { required } : {}), ...(min_score !== undefined ? { min_score } : {}), ...(negate !== undefined ? { negate } : {}), - } as import('../types.js').EvaluatorConfig); + } as import('../types.js').GraderConfig); continue; } @@ -1244,7 +1244,7 @@ async function parseEvaluatorList( ...(required !== undefined ? { required } : {}), ...(min_score !== undefined ? { min_score } : {}), ...(negate !== undefined ? { negate } : {}), - } as import('../types.js').EvaluatorConfig); + } as import('../types.js').GraderConfig); continue; } @@ -1271,7 +1271,7 @@ async function parseEvaluatorList( ...(required !== undefined ? { required } : {}), ...(min_score !== undefined ? { min_score } : {}), ...(negate !== undefined ? { negate } : {}), - } as import('../types.js').EvaluatorConfig); + } as import('../types.js').GraderConfig); continue; } @@ -1296,7 +1296,7 @@ async function parseEvaluatorList( ...(required !== undefined ? { required } : {}), ...(min_score !== undefined ? { min_score } : {}), ...(negate !== undefined ? { negate } : {}), - } as import('../types.js').EvaluatorConfig); + } as import('../types.js').GraderConfig); continue; } @@ -1449,7 +1449,7 @@ async function parseEvaluatorList( ); if (!commandArray) { - throw new Error(`Evaluator '${name}' in '${evalId}': prompt object requires command array`); + throw new Error(`Grader '${name}' in '${evalId}': prompt object requires command array`); } // Resolve the command path (last element is typically the file path) @@ -1461,7 +1461,7 @@ async function parseEvaluatorList( resolvedPromptScript = [...commandArray.slice(0, -1), path.resolve(resolved.resolvedPath)]; } else { throw new Error( - `Evaluator '${name}' in '${evalId}': prompt command file not found: ${resolved.displayPath}`, + `Grader '${name}' in '${evalId}': prompt command file not found: ${resolved.displayPath}`, ); } @@ -1486,11 +1486,11 @@ async function parseEvaluatorList( await validateCustomPromptContent(promptPath); } catch (error) { const message = error instanceof Error ? error.message : String(error); - throw new Error(`Evaluator '${name}' template (${promptPath}): ${message}`); + throw new Error(`Grader '${name}' template (${promptPath}): ${message}`); } } else { throw new Error( - `Evaluator '${name}' in '${evalId}': prompt file not found: ${resolved.displayPath}`, + `Grader '${name}' in '${evalId}': prompt file not found: ${resolved.displayPath}`, ); } } else { @@ -1654,20 +1654,20 @@ export async function parsePreprocessors( return undefined; } if (!Array.isArray(rawValue)) { - throw new Error(`Evaluator '${evaluatorName}' in '${evalId}': preprocessors must be an array`); + throw new Error(`Grader '${evaluatorName}' in '${evalId}': preprocessors must be an array`); } const preprocessors: ContentPreprocessorConfig[] = []; for (const rawEntry of rawValue) { if (!isJsonObject(rawEntry)) { throw new Error( - `Evaluator '${evaluatorName}' in '${evalId}': each preprocessor must be an object`, + `Grader '${evaluatorName}' in '${evalId}': each preprocessor must be an object`, ); } const type = asString(rawEntry.type)?.trim(); if (!type) { - throw new Error(`Evaluator '${evaluatorName}' in '${evalId}': preprocessor.type is required`); + throw new Error(`Grader '${evaluatorName}' in '${evalId}': preprocessor.type is required`); } const command = asStringArray( @@ -1676,7 +1676,7 @@ export async function parsePreprocessors( ); if (!command || command.length === 0) { throw new Error( - `Evaluator '${evaluatorName}' in '${evalId}': preprocessor '${type}' requires command`, + `Grader '${evaluatorName}' in '${evalId}': preprocessor '${type}' requires command`, ); } @@ -1684,7 +1684,7 @@ export async function parsePreprocessors( const resolved = await resolveFileReference(commandPath, searchRoots); if (!resolved.resolvedPath) { throw new Error( - `Evaluator '${evaluatorName}' in '${evalId}': preprocessor command file not found: ${resolved.displayPath}`, + `Grader '${evaluatorName}' in '${evalId}': preprocessor command file not found: ${resolved.displayPath}`, ); } @@ -1737,30 +1737,30 @@ function generateAssertionName(typeValue: string, rawEvaluator: JsonObject): str case 'rubrics': return 'rubrics'; default: - // For all other evaluator types (llm-grader, code-grader, latency, etc.), + // For all other grader types (llm-grader, code-grader, latency, etc.), // use the type name itself as the auto-derived name. return typeValue; } } /** - * Coerce evaluator value to valid EvaluatorKind. + * Coerce evaluator value to valid GraderKind. */ export function coerceEvaluator( candidate: JsonValue | undefined, contextId: string, -): EvaluatorKind | undefined { +): GraderKind | undefined { if (typeof candidate !== 'string') { return undefined; } // Normalize legacy snake_case to kebab-case - const normalized = normalizeEvaluatorType(candidate); + const normalized = normalizeGraderType(candidate); if (isDeprecatedJudgeType(normalized)) { throw new Error( `Unsupported grader '${candidate}' in ${contextId}. Use '${normalized.replace('-judge', '-grader')}' instead.`, ); } - if (isEvaluatorKind(normalized)) { + if (isGraderKind(normalized)) { return normalized; } logWarning(`Unknown grader '${candidate}' in ${contextId}, falling back to default`); @@ -1825,7 +1825,7 @@ function isJsonObject(value: unknown): value is JsonObject { */ export function warnUnconsumedCriteria( _criteria: string | undefined, - _evaluators: readonly EvaluatorConfig[] | undefined, + _evaluators: readonly GraderConfig[] | undefined, _testId: string, ): void { return; @@ -1883,7 +1883,7 @@ function parseRequiredAndMinScore( // Keep numeric required for backward compat (orchestrator reads min_score preferentially) result.required = rawRequired; logWarning( - `Evaluator '${evaluatorName}' in '${evalId}': 'required: ${rawRequired}' is deprecated. ` + + `Grader '${evaluatorName}' in '${evalId}': 'required: ${rawRequired}' is deprecated. ` + `Use 'required: true' + 'min_score: ${rawRequired}' instead.`, ); } @@ -2219,11 +2219,11 @@ function parseScoreRanges( * - String shorthand: "Must be polite" -> { id: "rubric-1", outcome: "Must be polite", weight: 1.0, required: true } * - Object form with outcome, weight, required, score_ranges, required_min_score * - * Returns an LlmGraderEvaluatorConfig to prepend to evaluators, or undefined if no valid rubrics. + * Returns an LlmGraderConfig to prepend to evaluators, or undefined if no valid rubrics. */ export function parseInlineRubrics( rawRubrics: readonly unknown[], -): import('../types.js').LlmGraderEvaluatorConfig | undefined { +): import('../types.js').LlmGraderConfig | undefined { const rubricItems = rawRubrics .filter((r): r is JsonObject | string => isJsonObject(r) || typeof r === 'string') .map((rubric, index) => { diff --git a/packages/core/src/evaluation/loaders/jsonl-parser.ts b/packages/core/src/evaluation/loaders/jsonl-parser.ts index 356f35862..0887048fe 100644 --- a/packages/core/src/evaluation/loaders/jsonl-parser.ts +++ b/packages/core/src/evaluation/loaders/jsonl-parser.ts @@ -7,13 +7,13 @@ import { collectResolvedInputFilePaths } from '../input-message-utils.js'; import { interpolateEnv } from '../interpolation.js'; import type { EvalTest, JsonObject, JsonValue, TestMessage } from '../types.js'; import { isJsonObject, isTestMessage } from '../types.js'; +import { buildSearchRoots, fileExists, resolveToAbsolutePath } from './file-resolver.js'; import { coerceEvaluator, - parseEvaluators, + parseGraders, parseInlineRubrics, warnUnconsumedCriteria, -} from './evaluator-parser.js'; -import { buildSearchRoots, fileExists, resolveToAbsolutePath } from './file-resolver.js'; +} from './grader-parser.js'; import { processExpectedMessages, processMessages } from './message-processor.js'; import { resolveExpectedMessages, resolveInputMessages } from './shorthand-expansion.js'; @@ -271,9 +271,9 @@ export async function loadTestsFromJsonl( const mergedExecution = caseExecution ?? globalExecution; const testCaseEvaluatorKind = coerceEvaluator(testCaseConfig.evaluator, id) ?? globalEvaluator; - let evaluators: Awaited>; + let evaluators: Awaited>; try { - evaluators = await parseEvaluators( + evaluators = await parseGraders( testCaseConfig, mergedExecution, searchRoots, diff --git a/packages/core/src/evaluation/orchestrator.ts b/packages/core/src/evaluation/orchestrator.ts index cef3a7a6b..2b77d604b 100644 --- a/packages/core/src/evaluation/orchestrator.ts +++ b/packages/core/src/evaluation/orchestrator.ts @@ -8,16 +8,16 @@ import micromatch from 'micromatch'; import pLimit from 'p-limit'; import { getWorkspacePoolRoot } from '../paths.js'; +import { readJsonFile } from './file-utils.js'; import { - type ChildEvaluatorResult, + type ChildGraderResult, DEFAULT_THRESHOLD, type EvaluationScore, - type Evaluator, - LlmGraderEvaluator, + type Grader, + LlmGrader, negateScore, scoreToVerdict, -} from './evaluators.js'; -import { readJsonFile } from './file-utils.js'; +} from './graders.js'; import { createBuiltinProviderRegistry, createProvider } from './providers/index.js'; import { discoverProviders } from './providers/provider-discovery.js'; import { @@ -57,15 +57,15 @@ import type { EvalTest, EvaluationResult, EvaluationVerdict, - EvaluatorConfig, - EvaluatorKind, - EvaluatorResult, ExecutionStatus, FailOnError, FailureStage, + GraderConfig, + GraderKind, + GraderResult, JsonObject, JsonValue, - LlmGraderEvaluatorConfig, + LlmGraderConfig, TestMessage, TestMessageRole, TrialResult, @@ -104,7 +104,7 @@ function classifyQualityStatus(score: number, threshold = DEFAULT_THRESHOLD): Ex } function buildSkippedEvaluatorError( - scores: readonly EvaluatorResult[] | undefined, + scores: readonly GraderResult[] | undefined, ): string | undefined { const skippedScores = scores?.filter((score) => score.verdict === 'skip') ?? []; if (skippedScores.length === 0) { @@ -114,11 +114,11 @@ function buildSkippedEvaluatorError( const messages = skippedScores.map((score) => { const label = score.name || score.type; const assertionMessage = - score.assertions.find((assertion) => !assertion.passed)?.text ?? 'Evaluator skipped'; + score.assertions.find((assertion) => !assertion.passed)?.text ?? 'Grader skipped'; return `${label}: ${assertionMessage}`; }); - return messages.length === 1 ? messages[0] : `Evaluators skipped: ${messages.join(' | ')}`; + return messages.length === 1 ? messages[0] : `Graders skipped: ${messages.join(' | ')}`; } function usesFileReferencePrompt(provider: Provider): boolean { @@ -325,7 +325,7 @@ export interface RunEvalCaseOptions { readonly evalCase: EvalTest; readonly provider: Provider; readonly target: ResolvedTarget; - readonly evaluators: Partial> & { readonly 'llm-grader': Evaluator }; + readonly evaluators: Partial> & { readonly 'llm-grader': Grader }; readonly now?: () => Date; readonly maxRetries?: number; readonly agentTimeoutMs?: number; @@ -355,8 +355,8 @@ export interface RunEvalCaseOptions { readonly suiteWorkspaceFile?: string; /** Real-time observability callbacks passed to the provider */ readonly streamCallbacks?: ProviderStreamCallbacks; - /** Evaluator type registry (with custom assertions discovered) */ - readonly typeRegistry?: import('./registry/evaluator-registry.js').EvaluatorRegistry; + /** Grader type registry (with custom assertions discovered) */ + readonly typeRegistry?: import('./registry/grader-registry.js').GraderRegistry; /** RepoManager instance for repo lifecycle (shared workspace mode) */ readonly repoManager?: RepoManager; /** Directory containing the eval YAML file. Used as default cwd for workspace scripts. */ @@ -391,7 +391,7 @@ export interface RunEvaluationOptions { readonly targets?: readonly TargetDefinition[]; readonly env?: EnvLookup; readonly providerFactory?: (target: ResolvedTarget) => Provider; - readonly evaluators?: Partial>; + readonly evaluators?: Partial>; readonly maxRetries?: number; readonly agentTimeoutMs?: number; readonly cache?: EvaluationCache; @@ -1541,10 +1541,10 @@ async function runBatchEvaluation(options: { readonly evalCases: readonly EvalTest[]; readonly provider: Provider; readonly target: ResolvedTarget; - readonly evaluatorRegistry: Partial> & { - readonly 'llm-grader': Evaluator; + readonly evaluatorRegistry: Partial> & { + readonly 'llm-grader': Grader; }; - readonly typeRegistry: import('./registry/evaluator-registry.js').EvaluatorRegistry; + readonly typeRegistry: import('./registry/grader-registry.js').GraderRegistry; readonly nowFn: () => Date; readonly onProgress?: (event: ProgressEvent) => MaybePromise; readonly onResult?: (result: EvaluationResult) => MaybePromise; @@ -2379,7 +2379,7 @@ export async function runEvalCase(options: RunEvalCaseOptions): Promise> & { readonly 'llm-grader': Evaluator }; - readonly typeRegistry: import('./registry/evaluator-registry.js').EvaluatorRegistry; + readonly evaluators: Partial> & { readonly 'llm-grader': Grader }; + readonly typeRegistry: import('./registry/grader-registry.js').GraderRegistry; readonly promptInputs: PromptInputs; readonly nowFn: () => Date; readonly attempt: number; @@ -2715,7 +2715,7 @@ async function evaluateCandidate(options: { } } - const evaluatorRequest = scores ? undefined : score.evaluatorRawRequest; + const evaluatorRequest = scores ? undefined : score.graderRawRequest; // Only include agent request if it has content (verbose mode adds the input field) const effectiveAgentRequest = agentRequest && Object.keys(agentRequest).length > 0 ? agentRequest : undefined; @@ -2758,8 +2758,8 @@ async function runEvaluatorsForCase(options: { readonly candidate: string; readonly target: ResolvedTarget; readonly provider: Provider; - readonly evaluators: Partial> & { readonly 'llm-grader': Evaluator }; - readonly typeRegistry: import('./registry/evaluator-registry.js').EvaluatorRegistry; + readonly evaluators: Partial> & { readonly 'llm-grader': Grader }; + readonly typeRegistry: import('./registry/grader-registry.js').GraderRegistry; readonly attempt: number; readonly promptInputs: PromptInputs; readonly now: Date; @@ -2779,7 +2779,7 @@ async function runEvaluatorsForCase(options: { readonly dockerConfig?: import('./types.js').DockerWorkspaceConfig; readonly threshold?: number; readonly dependencyResults?: Readonly>; -}): Promise<{ score: EvaluationScore; scores?: EvaluatorResult[] }> { +}): Promise<{ score: EvaluationScore; scores?: GraderResult[] }> { const { evalCase, candidate, @@ -2877,7 +2877,7 @@ async function runEvaluatorsForCase(options: { return { score }; } -function buildImplicitLlmGraderConfig(evalCase: EvalTest): LlmGraderEvaluatorConfig | undefined { +function buildImplicitLlmGraderConfig(evalCase: EvalTest): LlmGraderConfig | undefined { if (!evalCase.preprocessors || evalCase.preprocessors.length === 0) { return undefined; } @@ -2891,14 +2891,14 @@ function buildImplicitLlmGraderConfig(evalCase: EvalTest): LlmGraderEvaluatorCon async function runEvaluatorList(options: { readonly evalCase: EvalTest; - readonly evaluators: readonly EvaluatorConfig[]; + readonly evaluators: readonly GraderConfig[]; readonly candidate: string; readonly target: ResolvedTarget; readonly provider: Provider; - readonly evaluatorRegistry: Partial> & { - readonly 'llm-grader': Evaluator; + readonly evaluatorRegistry: Partial> & { + readonly 'llm-grader': Grader; }; - readonly typeRegistry: import('./registry/evaluator-registry.js').EvaluatorRegistry; + readonly typeRegistry: import('./registry/grader-registry.js').GraderRegistry; readonly attempt: number; readonly promptInputs: PromptInputs; readonly now: Date; @@ -2918,7 +2918,7 @@ async function runEvaluatorList(options: { readonly dockerConfig?: import('./types.js').DockerWorkspaceConfig; readonly threshold?: number; readonly dependencyResults?: Readonly>; -}): Promise<{ score: EvaluationScore; scores: EvaluatorResult[] }> { +}): Promise<{ score: EvaluationScore; scores: GraderResult[] }> { const { evalCase, evaluators, @@ -2955,10 +2955,10 @@ async function runEvaluatorList(options: { readonly required?: boolean | number; readonly min_score?: number; }> = []; - const scores: EvaluatorResult[] = []; + const scores: GraderResult[] = []; // Build the evaluation context (shared across all evaluators for this case) - const evalContext: import('./evaluators/types.js').EvaluationContext = { + const evalContext: import('./graders/types.js').EvaluationContext = { evalCase, candidate, target, @@ -2984,7 +2984,7 @@ async function runEvaluatorList(options: { // Build the dispatch context for evaluator factories const evalFileDir = evalCase.file_paths[0] ? path.dirname(evalCase.file_paths[0]) : process.cwd(); - const dispatchContext: import('./registry/evaluator-registry.js').EvaluatorDispatchContext = { + const dispatchContext: import('./registry/grader-registry.js').GraderDispatchContext = { graderProvider, targetResolver, availableTargets, @@ -3021,7 +3021,7 @@ async function runEvaluatorList(options: { weight, verdict: score.verdict, assertions: score.assertions, - input: score.evaluatorRawRequest, + input: score.graderRawRequest, target: score.graderTarget, details: score.details, scores: mapChildResults(score.scores), @@ -3037,7 +3037,7 @@ async function runEvaluatorList(options: { score: 0, verdict: 'fail', assertions: [ - { text: `Evaluator '${evaluatorConfig.name}' failed: ${message}`, passed: false }, + { text: `Grader '${evaluatorConfig.name}' failed: ${message}`, passed: false }, ], expectedAspectCount: 1, }; @@ -3060,7 +3060,7 @@ async function runEvaluatorList(options: { verdict: 'fail', assertions: [ { - text: `Evaluator '${evaluatorConfig.name ?? 'unknown'}' failed: ${message}`, + text: `Grader '${evaluatorConfig.name ?? 'unknown'}' failed: ${message}`, passed: false, }, ], @@ -3139,12 +3139,12 @@ function filterEvalCases( } function buildEvaluatorRegistry( - overrides: Partial> | undefined, + overrides: Partial> | undefined, resolveGraderProvider: (target: ResolvedTarget) => Promise, -): Partial> & { readonly 'llm-grader': Evaluator } { +): Partial> & { readonly 'llm-grader': Grader } { const llmGrader = overrides?.['llm-grader'] ?? - new LlmGraderEvaluator({ + new LlmGrader({ resolveGraderProvider: async (context) => { if (context.graderProvider) { return context.graderProvider; @@ -3173,8 +3173,8 @@ async function runConversationMode(options: { readonly evalCase: EvalTest; readonly provider: Provider; readonly target: ResolvedTarget; - readonly evaluators: Partial> & { readonly 'llm-grader': Evaluator }; - readonly typeRegistry: import('./registry/evaluator-registry.js').EvaluatorRegistry; + readonly evaluators: Partial> & { readonly 'llm-grader': Grader }; + readonly typeRegistry: import('./registry/grader-registry.js').GraderRegistry; readonly graderProvider?: Provider; readonly promptInputs: PromptInputs; readonly nowFn: () => Date; @@ -3221,7 +3221,7 @@ async function runConversationMode(options: { history.push({ role: msg.role as ChatMessageRole, content }); } - const turnScores: EvaluatorResult[] = []; + const turnScores: GraderResult[] = []; const allTurnScoreValues: number[] = []; let stopped = false; const caseStartMs = Date.now(); @@ -3234,7 +3234,7 @@ async function runConversationMode(options: { // Turn skipped due to on_turn_failure: stop turnScores.push({ name: `turn-${turnIndex}`, - type: 'rubrics' as EvaluatorKind, + type: 'rubrics' as GraderKind, score: 0, verdict: 'skip' as EvaluationVerdict, assertions: [{ text: 'Skipped due to previous turn failure', passed: false }], @@ -3268,7 +3268,7 @@ async function runConversationMode(options: { const message = error instanceof Error ? error.message : String(error); turnScores.push({ name: `turn-${turnIndex}`, - type: 'rubrics' as EvaluatorKind, + type: 'rubrics' as GraderKind, score: 0, verdict: 'fail' as EvaluationVerdict, assertions: [{ text: `Provider error: ${message}`, passed: false }], @@ -3289,7 +3289,7 @@ async function runConversationMode(options: { // No assertions or expected_output — turn scores 1.0 turnScores.push({ name: `turn-${turnIndex}`, - type: 'rubrics' as EvaluatorKind, + type: 'rubrics' as GraderKind, score: 1.0, verdict: 'pass' as EvaluationVerdict, assertions: [], @@ -3346,7 +3346,7 @@ async function runConversationMode(options: { turnScores.push({ name: `turn-${turnIndex}`, - type: 'rubrics' as EvaluatorKind, + type: 'rubrics' as GraderKind, score: turnScore, verdict: scoreToVerdict(turnScore, threshold ?? DEFAULT_THRESHOLD) as EvaluationVerdict, assertions: turnResult.assertions ? [...turnResult.assertions] : [], @@ -3360,7 +3360,7 @@ async function runConversationMode(options: { } // Run conversation-level assertions (top-level assertions on full transcript) - let conversationScores: EvaluatorResult[] = []; + let conversationScores: GraderResult[] = []; if (evalCase.assertions?.length) { const conversationEvalCase: EvalTest = { ...evalCase, @@ -3405,7 +3405,7 @@ async function runConversationMode(options: { conversationScores = [ { name: 'conversation', - type: 'rubrics' as EvaluatorKind, + type: 'rubrics' as GraderKind, score: conversationResult.score, verdict: scoreToVerdict( conversationResult.score, @@ -3480,15 +3480,15 @@ function buildTurnGraderInput(history: readonly ChatMessage[], windowSize?: numb } /** - * Convert per-turn assertions to EvaluatorConfig[]. + * Convert per-turn assertions to GraderConfig[]. * String assertions are grouped into a single rubrics evaluator. * Structured assertions pass through as-is. */ -function buildTurnAssertions(turn: ConversationTurn): EvaluatorConfig[] { +function buildTurnAssertions(turn: ConversationTurn): GraderConfig[] { if (!turn.assertions?.length) return []; const stringCriteria: string[] = []; - const structured: EvaluatorConfig[] = []; + const structured: GraderConfig[] = []; for (const a of turn.assertions) { if (typeof a === 'string') { @@ -3498,21 +3498,21 @@ function buildTurnAssertions(turn: ConversationTurn): EvaluatorConfig[] { } } - const result: EvaluatorConfig[] = []; + const result: GraderConfig[] = []; // Group string assertions into a single llm-grader evaluator with rubrics. // Uses llm-grader (not rubrics) because 'rubrics' is a YAML shorthand resolved by - // the evaluator-parser — at runtime we always dispatch through 'llm-grader'. + // the grader-parser — at runtime we always dispatch through 'llm-grader'. if (stringCriteria.length > 0) { result.push({ name: 'turn-rubrics', - type: 'llm-grader' as EvaluatorKind, + type: 'llm-grader' as GraderKind, rubrics: stringCriteria.map((text, idx) => ({ id: `criterion-${idx + 1}`, outcome: text, weight: 1, })), - } as unknown as EvaluatorConfig); + } as unknown as GraderConfig); } result.push(...structured); @@ -3707,10 +3707,10 @@ function buildResultInput(promptInputs: PromptInputs): EvaluationResult['input'] } /** - * Sum token usage across all evaluator results (including nested children). + * Sum token usage across all grader results (including nested children). * Returns undefined when no evaluator reported token usage. */ -function aggregateEvaluatorTokenUsage(scores?: readonly EvaluatorResult[]): TokenUsage | undefined { +function aggregateEvaluatorTokenUsage(scores?: readonly GraderResult[]): TokenUsage | undefined { if (!scores || scores.length === 0) return undefined; let hasAny = false; @@ -3721,7 +3721,7 @@ function aggregateEvaluatorTokenUsage(scores?: readonly EvaluatorResult[]): Toke let hasReasoning = false; let hasCached = false; - const visit = (items: readonly EvaluatorResult[]): void => { + const visit = (items: readonly GraderResult[]): void => { for (const item of items) { if (item.tokenUsage) { hasAny = true; @@ -3809,20 +3809,20 @@ function sleep(ms: number, signal?: AbortSignal): Promise { } function mapChildResults( - children?: readonly ChildEvaluatorResult[], -): readonly EvaluatorResult[] | undefined { + children?: readonly ChildGraderResult[], +): readonly GraderResult[] | undefined { if (!children || children.length === 0) { return undefined; } return children.map((child) => ({ name: child.name, - type: child.type as EvaluatorKind, + type: child.type as GraderKind, score: child.score, weight: child.weight, verdict: child.verdict, assertions: child.assertions, - input: child.evaluatorRawRequest, + input: child.graderRawRequest, scores: mapChildResults(child.scores), details: child.details, tokenUsage: child.tokenUsage, diff --git a/packages/core/src/evaluation/registry/assertion-discovery.ts b/packages/core/src/evaluation/registry/assertion-discovery.ts index 3c7e80b9b..7c9a63a56 100644 --- a/packages/core/src/evaluation/registry/assertion-discovery.ts +++ b/packages/core/src/evaluation/registry/assertion-discovery.ts @@ -11,9 +11,9 @@ import path from 'node:path'; import fg from 'fast-glob'; -import { CodeEvaluator } from '../evaluators/code-evaluator.js'; -import type { EvaluatorFactoryFn } from './evaluator-registry.js'; -import type { EvaluatorRegistry } from './evaluator-registry.js'; +import { CodeGrader } from '../graders/code-grader.js'; +import type { GraderFactoryFn } from './grader-registry.js'; +import type { GraderRegistry } from './grader-registry.js'; /** * Discover custom assertion scripts from `.agentv/assertions/` and register @@ -24,7 +24,7 @@ import type { EvaluatorRegistry } from './evaluator-registry.js'; * @returns Names of discovered assertion types */ export async function discoverAssertions( - registry: EvaluatorRegistry, + registry: GraderRegistry, baseDir: string, ): Promise { const patterns = ['*.ts', '*.js', '*.mts', '*.mjs']; @@ -63,8 +63,8 @@ export async function discoverAssertions( continue; } - const factory: EvaluatorFactoryFn = (_config, context) => { - return new CodeEvaluator({ + const factory: GraderFactoryFn = (_config, context) => { + return new CodeGrader({ command: ['bun', 'run', filePath], agentTimeoutMs: context.agentTimeoutMs, }); diff --git a/packages/core/src/evaluation/registry/builtin-evaluators.ts b/packages/core/src/evaluation/registry/builtin-graders.ts similarity index 63% rename from packages/core/src/evaluation/registry/builtin-evaluators.ts rename to packages/core/src/evaluation/registry/builtin-graders.ts index 37b17b77c..b24eb20b0 100644 --- a/packages/core/src/evaluation/registry/builtin-evaluators.ts +++ b/packages/core/src/evaluation/registry/builtin-graders.ts @@ -1,23 +1,23 @@ /** - * Factory functions for all built-in evaluator types. + * Factory functions for all built-in grader types. * - * Each factory creates an Evaluator instance from an EvaluatorConfig, + * Each factory creates an Grader instance from an GraderConfig, * handling type-specific initialization logic. These are registered into - * the EvaluatorRegistry at startup. + * the GraderRegistry at startup. */ import { - CodeEvaluator, - CompositeEvaluator, - CostEvaluator, - type Evaluator, - ExecutionMetricsEvaluator, - FieldAccuracyEvaluator, - LatencyEvaluator, - LlmGraderEvaluator, - SkillTriggerEvaluator, - TokenUsageEvaluator, - ToolTrajectoryEvaluator, + CodeGrader, + CompositeGrader, + CostGrader, + ExecutionMetricsGrader, + FieldAccuracyGrader, + type Grader, + LatencyGrader, + LlmGrader, + SkillTriggerGrader, + TokenUsageGrader, + ToolTrajectoryGrader, runContainsAllAssertion, runContainsAnyAssertion, runContainsAssertion, @@ -29,43 +29,43 @@ import { runIsJsonAssertion, runRegexAssertion, runStartsWithAssertion, -} from '../evaluators.js'; -import { InlineAssertEvaluator } from '../evaluators/inline-assert.js'; -import { containsTemplateVariables, resolveCustomPrompt } from '../evaluators/prompt-resolution.js'; +} from '../graders.js'; +import { InlineAssertGrader } from '../graders/inline-assert.js'; +import { containsTemplateVariables, resolveCustomPrompt } from '../graders/prompt-resolution.js'; import { isAgentProvider } from '../providers/types.js'; import type { Provider } from '../providers/types.js'; -import type { ToolTrajectoryEvaluatorConfig } from '../trace.js'; +import type { ToolTrajectoryGraderConfig } from '../trace.js'; import type { - CodeEvaluatorConfig, - CompositeEvaluatorConfig, - ContainsAllEvaluatorConfig, - ContainsAnyEvaluatorConfig, - ContainsEvaluatorConfig, - CostEvaluatorConfig, - EndsWithEvaluatorConfig, - EqualsEvaluatorConfig, - EvaluatorConfig, - ExecutionMetricsEvaluatorConfig, - FieldAccuracyEvaluatorConfig, - IcontainsAllEvaluatorConfig, - IcontainsAnyEvaluatorConfig, - IcontainsEvaluatorConfig, - IsJsonEvaluatorConfig, - LatencyEvaluatorConfig, - LlmGraderEvaluatorConfig, - RegexEvaluatorConfig, - SkillTriggerEvaluatorConfig, - StartsWithEvaluatorConfig, - TokenUsageEvaluatorConfig, + CodeGraderConfig, + CompositeGraderConfig, + ContainsAllGraderConfig, + ContainsAnyGraderConfig, + ContainsGraderConfig, + CostGraderConfig, + EndsWithGraderConfig, + EqualsGraderConfig, + ExecutionMetricsGraderConfig, + FieldAccuracyGraderConfig, + GraderConfig, + IcontainsAllGraderConfig, + IcontainsAnyGraderConfig, + IcontainsGraderConfig, + IsJsonGraderConfig, + LatencyGraderConfig, + LlmGraderConfig, + RegexGraderConfig, + SkillTriggerGraderConfig, + StartsWithGraderConfig, + TokenUsageGraderConfig, } from '../types.js'; import { - DeterministicAssertionEvaluator, - type EvaluatorDispatchContext, - type EvaluatorFactoryFn, - EvaluatorRegistry, -} from './evaluator-registry.js'; + DeterministicAssertionGrader, + type GraderDispatchContext, + type GraderFactoryFn, + GraderRegistry, +} from './grader-registry.js'; -/** Symbol for attaching inline AssertFn to EvaluatorConfig objects */ +/** Symbol for attaching inline AssertFn to GraderConfig objects */ export const INLINE_ASSERT_FN = Symbol.for('agentv.inline-assert-fn'); /** @@ -78,8 +78,8 @@ export const INLINE_ASSERT_FN = Symbol.for('agentv.inline-assert-fn'); * - Agent providers (claude-cli, copilot, etc.): delegate mode * - agentv provider: built-in AI SDK agent mode with filesystem tools */ -export const llmGraderFactory: EvaluatorFactoryFn = (config, context) => { - const c = config as LlmGraderEvaluatorConfig; +export const llmGraderFactory: GraderFactoryFn = (config, context) => { + const c = config as LlmGraderConfig; const { llmGrader, graderProvider, judgeProvider, targetResolver, agentTimeoutMs } = context; let evaluator = llmGrader; @@ -98,7 +98,7 @@ export const llmGraderFactory: EvaluatorFactoryFn = (config, context) => { // Note: agentv uses asLanguageModel() not invoke(), so it's not in AGENT_PROVIDER_KINDS; // check it explicitly here for built-in agent mode. const isAgent = isAgentProvider(graderTargetProvider) || graderTargetProvider.kind === 'agentv'; - evaluator = new LlmGraderEvaluator({ + evaluator = new LlmGrader({ resolveGraderProvider: async (evalContext) => { if (graderTargetProvider) return graderTargetProvider; if (evalContext.graderProvider) return evalContext.graderProvider; @@ -128,7 +128,7 @@ export const llmGraderFactory: EvaluatorFactoryFn = (config, context) => { ); // Determine whether the resolved prompt should replace the entire - // evaluator template or be injected as the {{criteria}} in the default + // grader template or be injected as the {{criteria}} in the default // template. // // Script-based prompts (resolvedPromptScript) and file-based prompts @@ -144,11 +144,11 @@ export const llmGraderFactory: EvaluatorFactoryFn = (config, context) => { const isFromInlinePrompt = !c.resolvedPromptScript?.length && !c.resolvedPromptPath && !c.promptPath; - let evaluatorTemplateOverride: string | undefined; + let graderTemplateOverride: string | undefined; let evalCase = evalContext.evalCase; if (customPrompt) { if (!isFromInlinePrompt || containsTemplateVariables(customPrompt)) { - evaluatorTemplateOverride = customPrompt; + graderTemplateOverride = customPrompt; } else { // Bare inline text — use as criteria in the default template evalCase = { ...evalCase, criteria: customPrompt }; @@ -158,20 +158,17 @@ export const llmGraderFactory: EvaluatorFactoryFn = (config, context) => { return evaluator.evaluate({ ...evalContext, evalCase, - evaluatorTemplateOverride, + graderTemplateOverride, evaluator: c, }); }, }; }; -/** @deprecated Use `llmGraderFactory` instead. */ -export const llmJudgeFactory = llmGraderFactory; - /** Factory for `code-grader` evaluators. */ -export const codeFactory: EvaluatorFactoryFn = (config, context) => { - const c = config as CodeEvaluatorConfig; - return new CodeEvaluator({ +export const codeFactory: GraderFactoryFn = (config, context) => { + const c = config as CodeGraderConfig; + return new CodeGrader({ command: c.command ?? c.script ?? [], cwd: c.resolvedCwd ?? c.cwd, agentTimeoutMs: context.agentTimeoutMs, @@ -181,25 +178,25 @@ export const codeFactory: EvaluatorFactoryFn = (config, context) => { }; /** Factory for `composite` evaluators. */ -export const compositeFactory: EvaluatorFactoryFn = (config, context) => { - const c = config as CompositeEvaluatorConfig; +export const compositeFactory: GraderFactoryFn = (config, context) => { + const c = config as CompositeGraderConfig; const evalFileDir = context.evalFileDir ?? process.cwd(); - return new CompositeEvaluator({ + return new CompositeGrader({ config: c, cwd: evalFileDir, evaluatorFactory: { - create: (memberConfig: EvaluatorConfig) => { + create: (memberConfig: GraderConfig) => { const factory = context.registry.get(memberConfig.type); if (!factory) { - throw new Error(`Unsupported evaluator type in composite: ${memberConfig.type}`); + throw new Error(`Unsupported grader type in composite: ${memberConfig.type}`); } // Factory functions may return a promise; for composite sync creation, // we handle the common synchronous cases directly. const result = factory(memberConfig, context); if (result instanceof Promise) { throw new Error( - `Evaluator factory for type "${memberConfig.type}" is async — not supported inside composite members. Use synchronous factories for composite child evaluators.`, + `Grader factory for type "${memberConfig.type}" is async — not supported inside composite members. Use synchronous factories for composite child evaluators.`, ); } return result; @@ -209,50 +206,50 @@ export const compositeFactory: EvaluatorFactoryFn = (config, context) => { }; /** Factory for `tool-trajectory` evaluators. */ -export const toolTrajectoryFactory: EvaluatorFactoryFn = (config) => { - return new ToolTrajectoryEvaluator({ - config: config as ToolTrajectoryEvaluatorConfig, +export const toolTrajectoryFactory: GraderFactoryFn = (config) => { + return new ToolTrajectoryGrader({ + config: config as ToolTrajectoryGraderConfig, }); }; /** Factory for `field-accuracy` evaluators. */ -export const fieldAccuracyFactory: EvaluatorFactoryFn = (config) => { - return new FieldAccuracyEvaluator({ - config: config as FieldAccuracyEvaluatorConfig, +export const fieldAccuracyFactory: GraderFactoryFn = (config) => { + return new FieldAccuracyGrader({ + config: config as FieldAccuracyGraderConfig, }); }; /** Factory for `latency` evaluators. */ -export const latencyFactory: EvaluatorFactoryFn = (config) => { - return new LatencyEvaluator({ config: config as LatencyEvaluatorConfig }); +export const latencyFactory: GraderFactoryFn = (config) => { + return new LatencyGrader({ config: config as LatencyGraderConfig }); }; /** Factory for `cost` evaluators. */ -export const costFactory: EvaluatorFactoryFn = (config) => { - return new CostEvaluator({ config: config as CostEvaluatorConfig }); +export const costFactory: GraderFactoryFn = (config) => { + return new CostGrader({ config: config as CostGraderConfig }); }; /** Factory for `token-usage` evaluators. */ -export const tokenUsageFactory: EvaluatorFactoryFn = (config) => { - return new TokenUsageEvaluator({ config: config as TokenUsageEvaluatorConfig }); +export const tokenUsageFactory: GraderFactoryFn = (config) => { + return new TokenUsageGrader({ config: config as TokenUsageGraderConfig }); }; /** Factory for `execution-metrics` evaluators. */ -export const executionMetricsFactory: EvaluatorFactoryFn = (config) => { - return new ExecutionMetricsEvaluator({ - config: config as ExecutionMetricsEvaluatorConfig, +export const executionMetricsFactory: GraderFactoryFn = (config) => { + return new ExecutionMetricsGrader({ + config: config as ExecutionMetricsGraderConfig, }); }; /** Factory for `skill-trigger` evaluator. */ -export const skillTriggerFactory: EvaluatorFactoryFn = (config) => { - return new SkillTriggerEvaluator(config as SkillTriggerEvaluatorConfig); +export const skillTriggerFactory: GraderFactoryFn = (config) => { + return new SkillTriggerGrader(config as SkillTriggerGraderConfig); }; /** Factory for `contains` deterministic assertion. */ -export const containsFactory: EvaluatorFactoryFn = (config) => { - const c = config as ContainsEvaluatorConfig; - return new DeterministicAssertionEvaluator('contains', (ctx) => { +export const containsFactory: GraderFactoryFn = (config) => { + const c = config as ContainsGraderConfig; + return new DeterministicAssertionGrader('contains', (ctx) => { const result = runContainsAssertion(ctx.candidate, c.value); return { score: result.score, @@ -264,9 +261,9 @@ export const containsFactory: EvaluatorFactoryFn = (config) => { }; /** Factory for `regex` deterministic assertion. */ -export const regexFactory: EvaluatorFactoryFn = (config) => { - const c = config as RegexEvaluatorConfig; - return new DeterministicAssertionEvaluator('regex', (ctx) => { +export const regexFactory: GraderFactoryFn = (config) => { + const c = config as RegexGraderConfig; + return new DeterministicAssertionGrader('regex', (ctx) => { const result = runRegexAssertion(ctx.candidate, c.value, c.flags); return { score: result.score, @@ -278,8 +275,8 @@ export const regexFactory: EvaluatorFactoryFn = (config) => { }; /** Factory for `is-json` deterministic assertion. */ -export const isJsonFactory: EvaluatorFactoryFn = () => { - return new DeterministicAssertionEvaluator('is-json', (ctx) => { +export const isJsonFactory: GraderFactoryFn = () => { + return new DeterministicAssertionGrader('is-json', (ctx) => { const result = runIsJsonAssertion(ctx.candidate); return { score: result.score, @@ -291,9 +288,9 @@ export const isJsonFactory: EvaluatorFactoryFn = () => { }; /** Factory for `equals` deterministic assertion. */ -export const equalsFactory: EvaluatorFactoryFn = (config) => { - const c = config as EqualsEvaluatorConfig; - return new DeterministicAssertionEvaluator('equals', (ctx) => { +export const equalsFactory: GraderFactoryFn = (config) => { + const c = config as EqualsGraderConfig; + return new DeterministicAssertionGrader('equals', (ctx) => { const result = runEqualsAssertion(ctx.candidate, c.value); return { score: result.score, @@ -305,9 +302,9 @@ export const equalsFactory: EvaluatorFactoryFn = (config) => { }; /** Factory for `contains-any` deterministic assertion. */ -export const containsAnyFactory: EvaluatorFactoryFn = (config) => { - const c = config as ContainsAnyEvaluatorConfig; - return new DeterministicAssertionEvaluator('contains-any', (ctx) => { +export const containsAnyFactory: GraderFactoryFn = (config) => { + const c = config as ContainsAnyGraderConfig; + return new DeterministicAssertionGrader('contains-any', (ctx) => { const result = runContainsAnyAssertion(ctx.candidate, c.value); return { score: result.score, @@ -319,9 +316,9 @@ export const containsAnyFactory: EvaluatorFactoryFn = (config) => { }; /** Factory for `contains-all` deterministic assertion. */ -export const containsAllFactory: EvaluatorFactoryFn = (config) => { - const c = config as ContainsAllEvaluatorConfig; - return new DeterministicAssertionEvaluator('contains-all', (ctx) => { +export const containsAllFactory: GraderFactoryFn = (config) => { + const c = config as ContainsAllGraderConfig; + return new DeterministicAssertionGrader('contains-all', (ctx) => { const result = runContainsAllAssertion(ctx.candidate, c.value); return { score: result.score, @@ -333,9 +330,9 @@ export const containsAllFactory: EvaluatorFactoryFn = (config) => { }; /** Factory for `icontains` deterministic assertion. */ -export const icontainsFactory: EvaluatorFactoryFn = (config) => { - const c = config as IcontainsEvaluatorConfig; - return new DeterministicAssertionEvaluator('icontains', (ctx) => { +export const icontainsFactory: GraderFactoryFn = (config) => { + const c = config as IcontainsGraderConfig; + return new DeterministicAssertionGrader('icontains', (ctx) => { const result = runIcontainsAssertion(ctx.candidate, c.value); return { score: result.score, @@ -347,9 +344,9 @@ export const icontainsFactory: EvaluatorFactoryFn = (config) => { }; /** Factory for `icontains-any` deterministic assertion. */ -export const icontainsAnyFactory: EvaluatorFactoryFn = (config) => { - const c = config as IcontainsAnyEvaluatorConfig; - return new DeterministicAssertionEvaluator('icontains-any', (ctx) => { +export const icontainsAnyFactory: GraderFactoryFn = (config) => { + const c = config as IcontainsAnyGraderConfig; + return new DeterministicAssertionGrader('icontains-any', (ctx) => { const result = runIcontainsAnyAssertion(ctx.candidate, c.value); return { score: result.score, @@ -361,9 +358,9 @@ export const icontainsAnyFactory: EvaluatorFactoryFn = (config) => { }; /** Factory for `icontains-all` deterministic assertion. */ -export const icontainsAllFactory: EvaluatorFactoryFn = (config) => { - const c = config as IcontainsAllEvaluatorConfig; - return new DeterministicAssertionEvaluator('icontains-all', (ctx) => { +export const icontainsAllFactory: GraderFactoryFn = (config) => { + const c = config as IcontainsAllGraderConfig; + return new DeterministicAssertionGrader('icontains-all', (ctx) => { const result = runIcontainsAllAssertion(ctx.candidate, c.value); return { score: result.score, @@ -375,9 +372,9 @@ export const icontainsAllFactory: EvaluatorFactoryFn = (config) => { }; /** Factory for `starts-with` deterministic assertion. */ -export const startsWithFactory: EvaluatorFactoryFn = (config) => { - const c = config as StartsWithEvaluatorConfig; - return new DeterministicAssertionEvaluator('starts-with', (ctx) => { +export const startsWithFactory: GraderFactoryFn = (config) => { + const c = config as StartsWithGraderConfig; + return new DeterministicAssertionGrader('starts-with', (ctx) => { const result = runStartsWithAssertion(ctx.candidate, c.value); return { score: result.score, @@ -389,9 +386,9 @@ export const startsWithFactory: EvaluatorFactoryFn = (config) => { }; /** Factory for `ends-with` deterministic assertion. */ -export const endsWithFactory: EvaluatorFactoryFn = (config) => { - const c = config as EndsWithEvaluatorConfig; - return new DeterministicAssertionEvaluator('ends-with', (ctx) => { +export const endsWithFactory: GraderFactoryFn = (config) => { + const c = config as EndsWithGraderConfig; + return new DeterministicAssertionGrader('ends-with', (ctx) => { const result = runEndsWithAssertion(ctx.candidate, c.value); return { score: result.score, @@ -403,10 +400,10 @@ export const endsWithFactory: EvaluatorFactoryFn = (config) => { }; /** - * Create a new EvaluatorRegistry with all built-in evaluator types registered. + * Create a new GraderRegistry with all built-in grader types registered. */ -export function createBuiltinRegistry(): EvaluatorRegistry { - const registry = new EvaluatorRegistry(); +export function createBuiltinRegistry(): GraderRegistry { + const registry = new GraderRegistry(); registry .register('llm-grader', llmGraderFactory) @@ -440,7 +437,7 @@ export function createBuiltinRegistry(): EvaluatorRegistry { `No inline assert function found on config for "${config.name}". Inline assert functions must be attached via INLINE_ASSERT_FN symbol.`, ); } - return new InlineAssertEvaluator(fn, config.name ?? 'inline-assert'); + return new InlineAssertGrader(fn, config.name ?? 'inline-assert'); }); return registry; diff --git a/packages/core/src/evaluation/registry/grader-discovery.ts b/packages/core/src/evaluation/registry/grader-discovery.ts index 90d8890b0..bd721083d 100644 --- a/packages/core/src/evaluation/registry/grader-discovery.ts +++ b/packages/core/src/evaluation/registry/grader-discovery.ts @@ -2,8 +2,8 @@ * Convention-based discovery of custom grader scripts. * * Scans `.agentv/graders/` (and legacy `.agentv/judges/`) for TypeScript/JavaScript - * files and registers them as code-grader evaluators in the registry. The file name - * (without extension) becomes the evaluator type name. + * files and registers them as code graders in the registry. The file name + * (without extension) becomes the grader type name. * * Example: `.agentv/graders/custom-grader.ts` → type "custom-grader" in EVAL.yaml */ @@ -11,20 +11,20 @@ import path from 'node:path'; import fg from 'fast-glob'; -import { CodeEvaluator } from '../evaluators/code-evaluator.js'; -import type { EvaluatorFactoryFn } from './evaluator-registry.js'; -import type { EvaluatorRegistry } from './evaluator-registry.js'; +import { CodeGrader } from '../graders/code-grader.js'; +import type { GraderFactoryFn } from './grader-registry.js'; +import type { GraderRegistry } from './grader-registry.js'; /** * Discover custom grader scripts from `.agentv/graders/` (and legacy `.agentv/judges/`) - * and register them as evaluator types in the registry. + * and register them as grader types in the registry. * - * @param registry - The evaluator registry to register discovered graders into + * @param registry - The grader registry to register discovered graders into * @param baseDir - The base directory to search from (typically project root or eval file dir) * @returns Names of discovered grader types */ export async function discoverGraders( - registry: EvaluatorRegistry, + registry: GraderRegistry, baseDir: string, ): Promise { const patterns = ['*.ts', '*.js', '*.mts', '*.mjs']; @@ -64,8 +64,8 @@ export async function discoverGraders( continue; } - const factory: EvaluatorFactoryFn = (_config, context) => { - return new CodeEvaluator({ + const factory: GraderFactoryFn = (_config, context) => { + return new CodeGrader({ command: ['bun', 'run', filePath], agentTimeoutMs: context.agentTimeoutMs, }); @@ -77,6 +77,3 @@ export async function discoverGraders( return discoveredTypes; } - -/** @deprecated Use `discoverGraders` instead */ -export const discoverJudges = discoverGraders; diff --git a/packages/core/src/evaluation/registry/evaluator-registry.ts b/packages/core/src/evaluation/registry/grader-registry.ts similarity index 61% rename from packages/core/src/evaluation/registry/evaluator-registry.ts rename to packages/core/src/evaluation/registry/grader-registry.ts index 1f858b16e..5f4a042c7 100644 --- a/packages/core/src/evaluation/registry/evaluator-registry.ts +++ b/packages/core/src/evaluation/registry/grader-registry.ts @@ -1,5 +1,5 @@ /** - * Extensible evaluator registry. + * Extensible grader registry. * * Replaces the hardcoded switch/case dispatch in the orchestrator with * a registry of named factory functions. Built-in evaluators are registered @@ -7,16 +7,16 @@ * `@agentv/eval` or by dropping files in `.agentv/assertions/`. */ -import type { EvaluationContext, EvaluationScore, Evaluator } from '../evaluators/types.js'; -import type { TargetResolver } from '../evaluators/types.js'; +import type { EvaluationContext, EvaluationScore, Grader } from '../graders/types.js'; +import type { TargetResolver } from '../graders/types.js'; import type { Provider } from '../providers/types.js'; -import type { EvaluatorConfig } from '../types.js'; +import type { GraderConfig } from '../types.js'; /** - * Context passed to evaluator factory functions during creation. + * Context passed to grader factory functions during creation. * Contains shared resources needed by evaluator instances. */ -export interface EvaluatorDispatchContext { +export interface GraderDispatchContext { /** Shared LLM grader provider (resolved at suite level) */ readonly graderProvider?: Provider; /** @deprecated Use `graderProvider` instead */ @@ -30,43 +30,43 @@ export interface EvaluatorDispatchContext { /** Directory containing the eval file (for composite member resolution) */ readonly evalFileDir?: string; /** Shared LLM grader evaluator instance */ - readonly llmGrader: Evaluator; + readonly llmGrader: Grader; /** @deprecated Use `llmGrader` instead */ - readonly llmJudge?: Evaluator; + readonly llmJudge?: Grader; /** Reference to the registry itself (for composite evaluators that need to create children) */ - readonly registry: EvaluatorRegistry; + readonly registry: GraderRegistry; } /** - * Factory function that creates an Evaluator instance from a config. + * Factory function that creates an Grader instance from a config. * * Factory functions handle all type-specific initialization logic: * - Reading prompt files for LLM graders * - Resolving script paths for code graders * - Creating adapter evaluators for deterministic assertions */ -export type EvaluatorFactoryFn = ( - config: EvaluatorConfig, - context: EvaluatorDispatchContext, -) => Evaluator | Promise; +export type GraderFactoryFn = ( + config: GraderConfig, + context: GraderDispatchContext, +) => Grader | Promise; /** - * Registry of evaluator factory functions keyed by evaluator type name. + * Registry of grader factory functions keyed by grader type name. * * Built-in evaluators are registered at startup. Custom evaluators can be * registered via the `register()` method or discovered from `.agentv/assertions/`. */ -export class EvaluatorRegistry { - private readonly factories = new Map(); +export class GraderRegistry { + private readonly factories = new Map(); - /** Register a factory function for an evaluator type. */ - register(type: string, factory: EvaluatorFactoryFn): this { + /** Register a factory function for an grader type. */ + register(type: string, factory: GraderFactoryFn): this { this.factories.set(type, factory); return this; } - /** Get the factory function for an evaluator type. */ - get(type: string): EvaluatorFactoryFn | undefined { + /** Get the factory function for an grader type. */ + get(type: string): GraderFactoryFn | undefined { return this.factories.get(type); } @@ -75,20 +75,20 @@ export class EvaluatorRegistry { return this.factories.has(type); } - /** List all registered evaluator type names. */ + /** List all registered grader type names. */ list(): string[] { return [...this.factories.keys()]; } /** * Create an evaluator instance from a config, using the registered factory. - * Throws if no factory is registered for the evaluator type. + * Throws if no factory is registered for the grader type. */ - async create(config: EvaluatorConfig, context: EvaluatorDispatchContext): Promise { + async create(config: GraderConfig, context: GraderDispatchContext): Promise { const factory = this.factories.get(config.type); if (!factory) { throw new Error( - `Unknown evaluator type: "${config.type}". Registered types: ${this.list().join(', ')}`, + `Unknown grader type: "${config.type}". Registered types: ${this.list().join(', ')}`, ); } return factory(config, context); @@ -96,10 +96,10 @@ export class EvaluatorRegistry { } /** - * Adapter that wraps a synchronous assertion function as an Evaluator. + * Adapter that wraps a synchronous assertion function as an Grader. * Used for deterministic assertions (contains, regex, is-json, equals). */ -export class DeterministicAssertionEvaluator implements Evaluator { +export class DeterministicAssertionGrader implements Grader { readonly kind: string; constructor( diff --git a/packages/core/src/evaluation/registry/index.ts b/packages/core/src/evaluation/registry/index.ts index 0f4f130cc..b738eb7a1 100644 --- a/packages/core/src/evaluation/registry/index.ts +++ b/packages/core/src/evaluation/registry/index.ts @@ -1,10 +1,10 @@ /** - * Evaluator registry — extensible evaluator type dispatch. + * Grader registry — extensible grader type dispatch. * * @module */ -export { EvaluatorRegistry, DeterministicAssertionEvaluator } from './evaluator-registry.js'; -export type { EvaluatorDispatchContext, EvaluatorFactoryFn } from './evaluator-registry.js'; -export { createBuiltinRegistry } from './builtin-evaluators.js'; +export { GraderRegistry, DeterministicAssertionGrader } from './grader-registry.js'; +export type { GraderDispatchContext, GraderFactoryFn } from './grader-registry.js'; +export { createBuiltinRegistry } from './builtin-graders.js'; export { discoverAssertions } from './assertion-discovery.js'; -export { discoverGraders, discoverGraders as discoverJudges } from './grader-discovery.js'; +export { discoverGraders } from './grader-discovery.js'; diff --git a/packages/core/src/evaluation/template-variables.ts b/packages/core/src/evaluation/template-variables.ts index 31d289145..508d837db 100644 --- a/packages/core/src/evaluation/template-variables.ts +++ b/packages/core/src/evaluation/template-variables.ts @@ -1,6 +1,6 @@ /** * Template variable constants for evaluator prompts. - * These variables can be used in custom evaluator templates with {{ variable_name }} syntax. + * These variables can be used in custom grader templates with {{ variable_name }} syntax. * * Primary variables: * - {{ input }} — input as plain text (single-turn) or role-prefixed conversation (multi-turn) @@ -40,7 +40,7 @@ export const VALID_TEMPLATE_VARIABLES = new Set(Object.values(TEMPLATE_V /** * Template variables that are required for meaningful evaluation. - * At least one of these should be present in a custom evaluator template. + * At least one of these should be present in a custom grader template. */ export const REQUIRED_TEMPLATE_VARIABLES = new Set([ TEMPLATE_VARIABLES.OUTPUT, diff --git a/packages/core/src/evaluation/trace.ts b/packages/core/src/evaluation/trace.ts index 65d0aedaf..6639548d8 100644 --- a/packages/core/src/evaluation/trace.ts +++ b/packages/core/src/evaluation/trace.ts @@ -59,7 +59,7 @@ export type ArgsMatchMode = 'exact' | 'ignore' | 'subset' | 'superset'; /** * Configuration for tool-trajectory evaluator. */ -export interface ToolTrajectoryEvaluatorConfig { +export interface ToolTrajectoryGraderConfig { readonly name: string; readonly type: 'tool-trajectory'; /** Matching mode */ @@ -73,7 +73,7 @@ export interface ToolTrajectoryEvaluatorConfig { readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; /** Default argument matching mode for all expected items (defaults to 'exact') */ readonly argsMatch?: ArgsMatchMode | readonly string[]; diff --git a/packages/core/src/evaluation/types.ts b/packages/core/src/evaluation/types.ts index 333a3bd88..6a7739216 100644 --- a/packages/core/src/evaluation/types.ts +++ b/packages/core/src/evaluation/types.ts @@ -1,4 +1,4 @@ -import type { TokenUsage, ToolTrajectoryEvaluatorConfig, TraceSummary } from './trace.js'; +import type { TokenUsage, ToolTrajectoryGraderConfig, TraceSummary } from './trace.js'; /** A single assertion verdict with optional evidence. */ export interface AssertionEntry { @@ -163,7 +163,7 @@ export function isTestMessage(value: unknown): value is TestMessage { return false; } -const EVALUATOR_KIND_VALUES = [ +const GRADER_KIND_VALUES = [ 'code-grader', 'llm-grader', 'rubric', @@ -190,12 +190,12 @@ const EVALUATOR_KIND_VALUES = [ 'inline-assert', ] as const; -export type EvaluatorKind = (typeof EVALUATOR_KIND_VALUES)[number]; +export type GraderKind = (typeof GRADER_KIND_VALUES)[number]; -const EVALUATOR_KIND_SET: ReadonlySet = new Set(EVALUATOR_KIND_VALUES); +const GRADER_KIND_SET: ReadonlySet = new Set(GRADER_KIND_VALUES); -export function isEvaluatorKind(value: unknown): value is EvaluatorKind { - return typeof value === 'string' && EVALUATOR_KIND_SET.has(value); +export function isGraderKind(value: unknown): value is GraderKind { + return typeof value === 'string' && GRADER_KIND_SET.has(value); } /** @@ -361,7 +361,7 @@ export type WorkspaceConfig = { readonly workspaceFileDir?: string; }; -export type CodeEvaluatorConfig = { +export type CodeGraderConfig = { readonly name: string; readonly type: 'code-grader'; readonly command: readonly string[]; @@ -374,7 +374,7 @@ export type CodeEvaluatorConfig = { readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; /** Pass-through configuration for the code-grader (any unrecognized YAML properties) */ readonly config?: JsonObject; @@ -406,7 +406,7 @@ export type ContentPreprocessorConfig = { readonly resolvedCommand?: readonly string[]; }; -export type LlmGraderEvaluatorConfig = { +export type LlmGraderConfig = { readonly name: string; readonly type: 'llm-grader'; /** Text prompt (inline or file path) or executable script config */ @@ -421,7 +421,7 @@ export type LlmGraderEvaluatorConfig = { readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; /** Optional target override for this grader (uses a named LLM target from targets.yaml). */ readonly target?: string; @@ -435,9 +435,6 @@ export type LlmGraderEvaluatorConfig = { readonly preprocessors?: readonly ContentPreprocessorConfig[]; }; -/** @deprecated Use `LlmGraderEvaluatorConfig` instead */ -export type LlmJudgeEvaluatorConfig = LlmGraderEvaluatorConfig; - /** * Score range definition for analytic rubric scoring. * Each range maps an integer score band (0-10) to an outcome description. @@ -496,16 +493,16 @@ export type CompositeAggregatorConfig = } | { readonly type: 'threshold'; readonly threshold: number }; -export type CompositeEvaluatorConfig = { +export type CompositeGraderConfig = { readonly name: string; readonly type: 'composite'; - readonly assertions: readonly EvaluatorConfig[]; + readonly assertions: readonly GraderConfig[]; readonly aggregator: CompositeAggregatorConfig; readonly weight?: number; readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; }; @@ -544,7 +541,7 @@ export type FieldConfig = { /** * Configuration for the field-accuracy evaluator. */ -export type FieldAccuracyEvaluatorConfig = { +export type FieldAccuracyGraderConfig = { readonly name: string; readonly type: 'field-accuracy'; /** Fields to compare between candidate and expected */ @@ -555,7 +552,7 @@ export type FieldAccuracyEvaluatorConfig = { readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; }; @@ -563,7 +560,7 @@ export type FieldAccuracyEvaluatorConfig = { * Configuration for the latency evaluator. * Checks execution duration against a threshold. */ -export type LatencyEvaluatorConfig = { +export type LatencyGraderConfig = { readonly name: string; readonly type: 'latency'; /** Maximum allowed duration in milliseconds */ @@ -572,7 +569,7 @@ export type LatencyEvaluatorConfig = { readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; }; @@ -580,7 +577,7 @@ export type LatencyEvaluatorConfig = { * Configuration for the cost evaluator. * Checks execution cost against a budget. */ -export type CostEvaluatorConfig = { +export type CostGraderConfig = { readonly name: string; readonly type: 'cost'; /** Maximum allowed cost in USD */ @@ -589,7 +586,7 @@ export type CostEvaluatorConfig = { readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; }; @@ -597,7 +594,7 @@ export type CostEvaluatorConfig = { * Configuration for the token-usage evaluator. * Checks provider-reported token usage against configured limits. */ -export type TokenUsageEvaluatorConfig = { +export type TokenUsageGraderConfig = { readonly name: string; readonly type: 'token-usage'; /** Maximum allowed total tokens (input + output + cached, when present) */ @@ -610,7 +607,7 @@ export type TokenUsageEvaluatorConfig = { readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; }; @@ -619,7 +616,7 @@ export type TokenUsageEvaluatorConfig = { * Provides declarative threshold-based checks on execution metrics. * Only specified thresholds are checked; omitted ones are ignored. */ -export type ExecutionMetricsEvaluatorConfig = { +export type ExecutionMetricsGraderConfig = { readonly name: string; readonly type: 'execution-metrics'; /** Maximum allowed number of tool calls */ @@ -640,7 +637,7 @@ export type ExecutionMetricsEvaluatorConfig = { readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; }; @@ -648,7 +645,7 @@ export type ExecutionMetricsEvaluatorConfig = { * Configuration for the contains assertion evaluator. * Checks whether the candidate output contains a specified substring. */ -export type ContainsEvaluatorConfig = { +export type ContainsGraderConfig = { readonly name: string; readonly type: 'contains'; readonly value: string; @@ -656,7 +653,7 @@ export type ContainsEvaluatorConfig = { readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; }; @@ -664,7 +661,7 @@ export type ContainsEvaluatorConfig = { * Configuration for the contains_any assertion evaluator. * Checks whether the candidate output contains ANY of the specified substrings. */ -export type ContainsAnyEvaluatorConfig = { +export type ContainsAnyGraderConfig = { readonly name: string; readonly type: 'contains-any'; readonly value: readonly string[]; @@ -672,7 +669,7 @@ export type ContainsAnyEvaluatorConfig = { readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; }; @@ -680,7 +677,7 @@ export type ContainsAnyEvaluatorConfig = { * Configuration for the contains_all assertion evaluator. * Checks whether the candidate output contains ALL of the specified substrings. */ -export type ContainsAllEvaluatorConfig = { +export type ContainsAllGraderConfig = { readonly name: string; readonly type: 'contains-all'; readonly value: readonly string[]; @@ -688,7 +685,7 @@ export type ContainsAllEvaluatorConfig = { readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; }; @@ -696,7 +693,7 @@ export type ContainsAllEvaluatorConfig = { * Configuration for the icontains assertion evaluator. * Case-insensitive check whether the candidate output contains a specified substring. */ -export type IcontainsEvaluatorConfig = { +export type IcontainsGraderConfig = { readonly name: string; readonly type: 'icontains'; readonly value: string; @@ -704,7 +701,7 @@ export type IcontainsEvaluatorConfig = { readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; }; @@ -712,7 +709,7 @@ export type IcontainsEvaluatorConfig = { * Configuration for the icontains_any assertion evaluator. * Case-insensitive check whether the candidate output contains ANY of the specified substrings. */ -export type IcontainsAnyEvaluatorConfig = { +export type IcontainsAnyGraderConfig = { readonly name: string; readonly type: 'icontains-any'; readonly value: readonly string[]; @@ -720,7 +717,7 @@ export type IcontainsAnyEvaluatorConfig = { readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; }; @@ -728,7 +725,7 @@ export type IcontainsAnyEvaluatorConfig = { * Configuration for the icontains_all assertion evaluator. * Case-insensitive check whether the candidate output contains ALL of the specified substrings. */ -export type IcontainsAllEvaluatorConfig = { +export type IcontainsAllGraderConfig = { readonly name: string; readonly type: 'icontains-all'; readonly value: readonly string[]; @@ -736,7 +733,7 @@ export type IcontainsAllEvaluatorConfig = { readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; }; @@ -744,7 +741,7 @@ export type IcontainsAllEvaluatorConfig = { * Configuration for the starts_with assertion evaluator. * Checks whether the candidate output starts with a specified string (both trimmed). */ -export type StartsWithEvaluatorConfig = { +export type StartsWithGraderConfig = { readonly name: string; readonly type: 'starts-with'; readonly value: string; @@ -752,7 +749,7 @@ export type StartsWithEvaluatorConfig = { readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; }; @@ -760,7 +757,7 @@ export type StartsWithEvaluatorConfig = { * Configuration for the ends_with assertion evaluator. * Checks whether the candidate output ends with a specified string (both trimmed). */ -export type EndsWithEvaluatorConfig = { +export type EndsWithGraderConfig = { readonly name: string; readonly type: 'ends-with'; readonly value: string; @@ -768,7 +765,7 @@ export type EndsWithEvaluatorConfig = { readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; }; @@ -776,7 +773,7 @@ export type EndsWithEvaluatorConfig = { * Configuration for the regex assertion evaluator. * Checks whether the candidate output matches a regular expression pattern. */ -export type RegexEvaluatorConfig = { +export type RegexGraderConfig = { readonly name: string; readonly type: 'regex'; readonly value: string; @@ -786,7 +783,7 @@ export type RegexEvaluatorConfig = { readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; }; @@ -794,14 +791,14 @@ export type RegexEvaluatorConfig = { * Configuration for the is_json assertion evaluator. * Checks whether the candidate output is valid JSON. */ -export type IsJsonEvaluatorConfig = { +export type IsJsonGraderConfig = { readonly name: string; readonly type: 'is-json'; readonly weight?: number; readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; }; @@ -809,7 +806,7 @@ export type IsJsonEvaluatorConfig = { * Configuration for the equals assertion evaluator. * Checks whether the candidate output exactly equals a specified string. */ -export type EqualsEvaluatorConfig = { +export type EqualsGraderConfig = { readonly name: string; readonly type: 'equals'; readonly value: string; @@ -817,7 +814,7 @@ export type EqualsEvaluatorConfig = { readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; }; @@ -833,7 +830,7 @@ export type RubricsEvaluatorConfig = { readonly required?: boolean | number; /** Minimum score (0-1) for this evaluator to pass. Independent of `required` gate. */ readonly min_score?: number; - /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ + /** When true, inverts the grader score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; }; @@ -843,7 +840,7 @@ export type RubricsEvaluatorConfig = { * Tool-name resolution is automatic based on the provider kind. * For providers not covered by the built-in mapping, use a code-grader. */ -export type SkillTriggerEvaluatorConfig = { +export type SkillTriggerGraderConfig = { readonly name: string; readonly type: 'skill-trigger'; /** The skill name to check for (case-sensitive substring match) */ @@ -871,28 +868,28 @@ export type InlineAssertEvaluatorConfig = { readonly negate?: boolean; }; -export type EvaluatorConfig = - | CodeEvaluatorConfig - | LlmGraderEvaluatorConfig - | CompositeEvaluatorConfig - | ToolTrajectoryEvaluatorConfig - | FieldAccuracyEvaluatorConfig - | LatencyEvaluatorConfig - | CostEvaluatorConfig - | TokenUsageEvaluatorConfig - | ExecutionMetricsEvaluatorConfig - | SkillTriggerEvaluatorConfig - | ContainsEvaluatorConfig - | ContainsAnyEvaluatorConfig - | ContainsAllEvaluatorConfig - | IcontainsEvaluatorConfig - | IcontainsAnyEvaluatorConfig - | IcontainsAllEvaluatorConfig - | StartsWithEvaluatorConfig - | EndsWithEvaluatorConfig - | RegexEvaluatorConfig - | IsJsonEvaluatorConfig - | EqualsEvaluatorConfig +export type GraderConfig = + | CodeGraderConfig + | LlmGraderConfig + | CompositeGraderConfig + | ToolTrajectoryGraderConfig + | FieldAccuracyGraderConfig + | LatencyGraderConfig + | CostGraderConfig + | TokenUsageGraderConfig + | ExecutionMetricsGraderConfig + | SkillTriggerGraderConfig + | ContainsGraderConfig + | ContainsAnyGraderConfig + | ContainsAllGraderConfig + | IcontainsGraderConfig + | IcontainsAnyGraderConfig + | IcontainsAllGraderConfig + | StartsWithGraderConfig + | EndsWithGraderConfig + | RegexGraderConfig + | IsJsonGraderConfig + | EqualsGraderConfig | RubricsEvaluatorConfig | InlineAssertEvaluatorConfig; @@ -906,7 +903,7 @@ export interface ConversationTurn { /** Reference assistant response for grading (NOT carried forward — actual LLM response is used) */ readonly expected_output?: TestMessageContent; /** Per-turn assertions. Strings become rubric criteria via shorthand. */ - readonly assertions?: readonly (string | EvaluatorConfig)[]; + readonly assertions?: readonly (string | GraderConfig)[]; } /** @@ -945,8 +942,8 @@ export interface EvalTest { readonly reference_answer?: string; readonly file_paths: readonly string[]; readonly criteria: string; - readonly evaluator?: EvaluatorKind; - readonly assertions?: readonly EvaluatorConfig[]; + readonly evaluator?: GraderKind; + readonly assertions?: readonly GraderConfig[]; /** Suite-level preprocessors used by the implicit default llm-grader. */ readonly preprocessors?: readonly ContentPreprocessorConfig[]; /** Workspace configuration (merged from suite-level and case-level) */ @@ -1016,7 +1013,7 @@ export interface TrialResult { readonly attempt: number; readonly score: number; readonly verdict: EvaluationVerdict; - readonly scores?: readonly EvaluatorResult[]; + readonly scores?: readonly GraderResult[]; readonly error?: string; readonly costUsd?: number; /** Primary classification for this trial attempt */ @@ -1091,7 +1088,7 @@ export interface ExecutionError { export type FailOnError = boolean; /** - * Evaluator scorecard for a single eval case run. + * Grader scorecard for a single eval case run. */ export interface EvaluationResult { readonly timestamp: string; @@ -1122,7 +1119,7 @@ export interface EvaluationResult { readonly lm?: JsonObject; readonly evaluator?: JsonObject; }; - readonly scores?: readonly EvaluatorResult[]; + readonly scores?: readonly GraderResult[]; readonly error?: string; /** Lightweight summary of the execution trace (always included when available) */ readonly trace?: TraceSummary; @@ -1167,9 +1164,9 @@ export interface EvaluationResult { export type EvaluationVerdict = 'pass' | 'fail' | 'skip'; -export interface EvaluatorResult { +export interface GraderResult { readonly name: string; - readonly type: EvaluatorKind; + readonly type: GraderKind; readonly score: number; readonly weight?: number; readonly verdict?: EvaluationVerdict; @@ -1178,7 +1175,7 @@ export interface EvaluatorResult { readonly input?: JsonObject; /** Target name used for grading (e.g., the LLM provider name). */ readonly target?: string; - readonly scores?: readonly EvaluatorResult[]; + readonly scores?: readonly GraderResult[]; /** Optional structured details from code graders (e.g., TP/TN/FP/FN counts). */ readonly details?: JsonObject; /** Token usage from LLM calls made by this evaluator (optional). */ diff --git a/packages/core/src/evaluation/validation/eval-file.schema.ts b/packages/core/src/evaluation/validation/eval-file.schema.ts index 92f338c83..4cb155d2c 100644 --- a/packages/core/src/evaluation/validation/eval-file.schema.ts +++ b/packages/core/src/evaluation/validation/eval-file.schema.ts @@ -32,7 +32,7 @@ const InputSchema = z.union([z.string(), z.array(MessageSchema)]); const ExpectedOutputSchema = z.union([z.string(), z.record(z.unknown()), z.array(MessageSchema)]); // --------------------------------------------------------------------------- -// Evaluator schemas (YAML input format) +// Grader schemas (YAML input format) // --------------------------------------------------------------------------- /** Common fields shared by all evaluators */ @@ -230,7 +230,7 @@ const RubricsSchema = EvaluatorCommonSchema.extend({ criteria: z.array(RubricItemSchema).min(1), }); -/** Union of all evaluator types */ +/** Union of all grader types */ const EvaluatorSchema = z.union([ CodeGraderSchema, LlmGraderSchema, diff --git a/packages/core/src/evaluation/validation/eval-validator.ts b/packages/core/src/evaluation/validation/eval-validator.ts index f22f82e15..f5e8af962 100644 --- a/packages/core/src/evaluation/validation/eval-validator.ts +++ b/packages/core/src/evaluation/validation/eval-validator.ts @@ -4,14 +4,14 @@ import { parse } from 'yaml'; import { interpolateEnv } from '../interpolation.js'; import { loadCasesFromFile } from '../loaders/case-file-loader.js'; -import { isEvaluatorKind } from '../types.js'; +import { isGraderKind } from '../types.js'; import type { ValidationError, ValidationResult } from './types.js'; type JsonValue = string | number | boolean | null | JsonObject | JsonArray; type JsonObject = { readonly [key: string]: JsonValue }; type JsonArray = readonly JsonValue[]; -/** Assertion evaluator types that require a string `value` field. */ +/** Assertion grader types that require a string `value` field. */ const ASSERTION_TYPES_WITH_STRING_VALUE = new Set([ 'contains', 'icontains', @@ -20,7 +20,7 @@ const ASSERTION_TYPES_WITH_STRING_VALUE = new Set([ 'equals', 'regex', ]); -/** Assertion evaluator types that require a string[] `value` field. */ +/** Assertion grader types that require a string[] `value` field. */ const ASSERTION_TYPES_WITH_ARRAY_VALUE = new Set([ 'contains-any', 'contains-all', @@ -756,7 +756,7 @@ function validateAssertArray( // Normalize snake_case to kebab-case for backward compatibility const typeValue = rawTypeValue.replace(/_/g, '-'); - if (!isEvaluatorKind(typeValue) && !customAssertionTypes.has(typeValue)) { + if (!isGraderKind(typeValue) && !customAssertionTypes.has(typeValue)) { errors.push({ severity: 'warning', filePath, diff --git a/packages/core/src/evaluation/validation/prompt-validator.ts b/packages/core/src/evaluation/validation/prompt-validator.ts index 8f8101809..14ddf363e 100644 --- a/packages/core/src/evaluation/validation/prompt-validator.ts +++ b/packages/core/src/evaluation/validation/prompt-validator.ts @@ -72,7 +72,7 @@ export function validateTemplateVariables(content: string, source: string): void // WARNING: Invalid variables - show warning but continue if (invalidVariables.length > 0) { - const warningMessage = `${ANSI_YELLOW}Warning: Custom evaluator template at ${source} + const warningMessage = `${ANSI_YELLOW}Warning: Custom grader template at ${source} Contains invalid variables: ${invalidVariables.map((v) => `{{ ${v} }}`).join(', ')} Valid variables: ${Array.from(VALID_TEMPLATE_VARIABLES) .map((v) => `{{ ${v} }}`) diff --git a/packages/core/src/evaluation/yaml-parser.ts b/packages/core/src/evaluation/yaml-parser.ts index 569000496..6b6c8f004 100644 --- a/packages/core/src/evaluation/yaml-parser.ts +++ b/packages/core/src/evaluation/yaml-parser.ts @@ -20,14 +20,14 @@ import { extractWorkersFromSuite, loadConfig, } from './loaders/config-loader.js'; +import { buildSearchRoots, resolveToAbsolutePath } from './loaders/file-resolver.js'; import { coerceEvaluator, - parseEvaluators, + parseGraders, parseInlineRubrics, parsePreprocessors, warnUnconsumedCriteria, -} from './loaders/evaluator-parser.js'; -import { buildSearchRoots, resolveToAbsolutePath } from './loaders/file-resolver.js'; +} from './loaders/grader-parser.js'; import { detectFormat, loadTestsFromJsonl } from './loaders/jsonl-parser.js'; import { processExpectedMessages, processMessages } from './loaders/message-processor.js'; import { @@ -42,7 +42,7 @@ import type { ConversationTurn, DockerWorkspaceConfig, EvalTest, - EvaluatorConfig, + GraderConfig, JsonObject, JsonValue, RepoConfig, @@ -482,9 +482,9 @@ async function loadTestsFromYaml( .join(' '); const testCaseEvaluatorKind = coerceEvaluator(testCaseConfig.evaluator, id) ?? globalEvaluator; - let evaluators: Awaited>; + let evaluators: Awaited>; try { - evaluators = await parseEvaluators( + evaluators = await parseGraders( testCaseConfig, globalExecution, searchRoots, @@ -624,12 +624,12 @@ function parseTurns(rawTurns: readonly unknown[]): ConversationTurn[] { const expectedOutput = turn.expected_output as TestMessageContent | undefined; // Parse per-turn assertions (string shorthand or structured evaluator config) - let assertions: (string | EvaluatorConfig)[] | undefined; + let assertions: (string | GraderConfig)[] | undefined; if (Array.isArray(turn.assertions)) { assertions = turn.assertions.map((a: unknown) => { if (typeof a === 'string') return a; // Structured evaluator config — pass through as-is (validated by Zod schema) - return a as EvaluatorConfig; + return a as GraderConfig; }); } diff --git a/packages/core/src/import/transcript-provider.ts b/packages/core/src/import/transcript-provider.ts index 2437c1f3f..376d2f58e 100644 --- a/packages/core/src/import/transcript-provider.ts +++ b/packages/core/src/import/transcript-provider.ts @@ -8,7 +8,7 @@ * 1. Reads a transcript JSONL file (produced by `agentv import`) * 2. Each invocation pops the next line from the transcript * 3. Returns a ProviderResponse with pre-populated output, token usage, etc. - * 4. Evaluators run identically to live eval — they see the same ProviderResponse + * 4. Graders run identically to live eval — they see the same ProviderResponse * * The provider name in results is set to the source provider from the transcript * (e.g., "claude", "codex", "copilot"). diff --git a/packages/core/src/index.ts b/packages/core/src/index.ts index 656fd4c26..3fcae757c 100644 --- a/packages/core/src/index.ts +++ b/packages/core/src/index.ts @@ -23,7 +23,7 @@ export type { } from './evaluation/loaders/eval-yaml-transpiler.js'; export * from './evaluation/file-utils.js'; export * from './evaluation/providers/index.js'; -export * from './evaluation/evaluators.js'; +export * from './evaluation/graders.js'; export * from './evaluation/orchestrator.js'; export { evaluate, @@ -99,14 +99,14 @@ export * from './observability/index.js'; // Registry exports export { - EvaluatorRegistry, - DeterministicAssertionEvaluator, -} from './evaluation/registry/evaluator-registry.js'; + GraderRegistry, + DeterministicAssertionGrader, +} from './evaluation/registry/grader-registry.js'; export type { - EvaluatorDispatchContext, - EvaluatorFactoryFn, -} from './evaluation/registry/evaluator-registry.js'; -export { createBuiltinRegistry } from './evaluation/registry/builtin-evaluators.js'; + GraderDispatchContext, + GraderFactoryFn, +} from './evaluation/registry/grader-registry.js'; +export { createBuiltinRegistry } from './evaluation/registry/builtin-graders.js'; export { discoverAssertions } from './evaluation/registry/assertion-discovery.js'; export { runContainsAssertion, @@ -121,11 +121,8 @@ export { runIsJsonAssertion, runEqualsAssertion, type AssertionResult, -} from './evaluation/evaluators/assertions.js'; -export { - discoverGraders, - discoverGraders as discoverJudges, -} from './evaluation/registry/grader-discovery.js'; +} from './evaluation/graders/assertions.js'; +export { discoverGraders } from './evaluation/registry/grader-discovery.js'; // Import pipeline export * from './import/index.js'; diff --git a/packages/core/test/evaluation/baseline.test.ts b/packages/core/test/evaluation/baseline.test.ts index 2174f1eb4..28ccc0443 100644 --- a/packages/core/test/evaluation/baseline.test.ts +++ b/packages/core/test/evaluation/baseline.test.ts @@ -1,6 +1,6 @@ import { describe, expect, it } from 'vitest'; import { trimBaselineResult } from '../../src/evaluation/baseline.js'; -import type { EvaluationResult, EvaluatorResult } from '../../src/evaluation/types.js'; +import type { EvaluationResult, GraderResult } from '../../src/evaluation/types.js'; function makeFullResult(overrides: Partial = {}): EvaluationResult { return { @@ -34,7 +34,7 @@ function makeFullResult(overrides: Partial = {}): EvaluationRe }; } -function makeEvaluatorResult(overrides: Partial = {}): EvaluatorResult { +function makeEvaluatorResult(overrides: Partial = {}): GraderResult { return { name: 'test-evaluator', type: 'llm-grader', @@ -78,7 +78,7 @@ describe('trimBaselineResult', () => { expect(trimmed.error).toBe('something went wrong'); }); - it('trims evaluator results', () => { + it('trims grader results', () => { const evaluatorResult = makeEvaluatorResult(); const full = makeFullResult({ scores: [evaluatorResult] }); const trimmed = trimBaselineResult(full); @@ -97,7 +97,7 @@ describe('trimBaselineResult', () => { expect(er.input).toBeUndefined(); }); - it('recursively trims composite evaluator results', () => { + it('recursively trims composite grader results', () => { const inner = makeEvaluatorResult({ name: 'inner' }); const composite = makeEvaluatorResult({ name: 'composite', @@ -129,7 +129,7 @@ describe('trimBaselineResult', () => { expect(JSON.stringify(full)).toBe(originalJson); }); - it('handles result with no evaluator results', () => { + it('handles result with no grader results', () => { const full = makeFullResult(); const trimmed = trimBaselineResult(full); expect(trimmed.scores).toBeUndefined(); diff --git a/packages/core/test/evaluation/code-evaluator-file-backed.test.ts b/packages/core/test/evaluation/code-grader-file-backed.test.ts similarity index 92% rename from packages/core/test/evaluation/code-evaluator-file-backed.test.ts rename to packages/core/test/evaluation/code-grader-file-backed.test.ts index 6bae33521..459f60118 100644 --- a/packages/core/test/evaluation/code-evaluator-file-backed.test.ts +++ b/packages/core/test/evaluation/code-grader-file-backed.test.ts @@ -4,7 +4,7 @@ import { mkdtemp, rm, writeFile } from 'node:fs/promises'; import { tmpdir } from 'node:os'; import { join } from 'node:path'; -import { CodeEvaluator } from '../../src/evaluation/evaluators/code-evaluator.js'; +import { CodeGrader } from '../../src/evaluation/graders/code-grader.js'; import type { EvalTest } from '../../src/evaluation/types.js'; const baseTestCase: EvalTest = { @@ -49,7 +49,7 @@ async function createScoringGrader(dir: string): Promise { return [process.execPath, script]; } -describe('CodeEvaluator file-backed output', () => { +describe('CodeGrader file-backed output', () => { let tmpDir: string; beforeEach(async () => { @@ -64,7 +64,7 @@ describe('CodeEvaluator file-backed output', () => { const command = await createEchoGrader(tmpDir); const smallOutput = [{ role: 'assistant' as const, content: 'short response' }]; - const evaluator = new CodeEvaluator({ command }); + const evaluator = new CodeGrader({ command }); const result = await evaluator.evaluate({ evalCase: baseTestCase, candidate: 'answer', @@ -81,7 +81,7 @@ describe('CodeEvaluator file-backed output', () => { const largeContent = 'x'.repeat(60_000); const largeOutput = [{ role: 'assistant' as const, content: largeContent }]; - const evaluator = new CodeEvaluator({ command }); + const evaluator = new CodeGrader({ command }); const result = await evaluator.evaluate({ evalCase: baseTestCase, candidate: 'answer', @@ -102,7 +102,7 @@ describe('CodeEvaluator file-backed output', () => { const largeContent = 'x'.repeat(60_000); const largeOutput = [{ role: 'assistant' as const, content: largeContent }]; - const evaluator = new CodeEvaluator({ command }); + const evaluator = new CodeGrader({ command }); const result = await evaluator.evaluate({ evalCase: baseTestCase, candidate: 'answer', diff --git a/packages/core/test/evaluation/code-evaluator-multimodal.test.ts b/packages/core/test/evaluation/code-grader-multimodal.test.ts similarity index 97% rename from packages/core/test/evaluation/code-evaluator-multimodal.test.ts rename to packages/core/test/evaluation/code-grader-multimodal.test.ts index 78784e6f2..25f92711d 100644 --- a/packages/core/test/evaluation/code-evaluator-multimodal.test.ts +++ b/packages/core/test/evaluation/code-grader-multimodal.test.ts @@ -4,8 +4,8 @@ import { mkdtemp, rm, writeFile } from 'node:fs/promises'; import { tmpdir } from 'node:os'; import { join } from 'node:path'; -import { materializeContentForGrader } from '../../src/evaluation/evaluators/code-evaluator.js'; -import { CodeEvaluator } from '../../src/evaluation/evaluators/code-evaluator.js'; +import { materializeContentForGrader } from '../../src/evaluation/graders/code-grader.js'; +import { CodeGrader } from '../../src/evaluation/graders/code-grader.js'; import type { EvalTest } from '../../src/evaluation/types.js'; const baseTestCase: EvalTest = { @@ -244,7 +244,7 @@ describe('materializeContentForGrader', () => { }); }); -describe('CodeEvaluator multimodal integration', () => { +describe('CodeGrader multimodal integration', () => { let tmpDir: string; beforeEach(async () => { @@ -259,7 +259,7 @@ describe('CodeEvaluator multimodal integration', () => { const command = await createPayloadEchoGrader(tmpDir); const output = [{ role: 'assistant' as const, content: 'Hello world' }]; - const evaluator = new CodeEvaluator({ command }); + const evaluator = new CodeGrader({ command }); const result = await evaluator.evaluate({ evalCase: baseTestCase, candidate: 'answer', @@ -288,7 +288,7 @@ describe('CodeEvaluator multimodal integration', () => { }, ]; - const evaluator = new CodeEvaluator({ command }); + const evaluator = new CodeGrader({ command }); const result = await evaluator.evaluate({ evalCase: baseTestCase, candidate: 'answer', @@ -324,7 +324,7 @@ describe('CodeEvaluator multimodal integration', () => { }, ]; - const evaluator = new CodeEvaluator({ command }); + const evaluator = new CodeGrader({ command }); await evaluator.evaluate({ evalCase: baseTestCase, candidate: 'answer', diff --git a/packages/core/test/evaluation/evaluators_variables.test.ts b/packages/core/test/evaluation/evaluators_variables.test.ts index d7ea0fdbd..dbd925480 100644 --- a/packages/core/test/evaluation/evaluators_variables.test.ts +++ b/packages/core/test/evaluation/evaluators_variables.test.ts @@ -1,6 +1,6 @@ import { describe, expect, it } from 'bun:test'; -import { LlmGraderEvaluator } from '../../src/evaluation/evaluators.js'; +import { LlmGrader } from '../../src/evaluation/graders.js'; import type { ResolvedTarget } from '../../src/evaluation/providers/targets.js'; import type { Provider, @@ -41,7 +41,7 @@ const baseTarget: ResolvedTarget = { config: { response: '{}' }, }; -describe('LlmGraderEvaluator Variable Substitution', () => { +describe('LlmGrader Variable Substitution', () => { it('substitutes template variables in custom prompt', async () => { const formattedQuestion = '@[User]: What is the status?\n\n@[Assistant]: Requesting more info.'; const customPrompt = ` @@ -59,9 +59,9 @@ File Changes: {{file_changes}} }), }); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, - evaluatorTemplate: customPrompt, + graderTemplate: customPrompt, }); const answer = 'Candidate Answer Text'; @@ -111,9 +111,9 @@ Candidate: {{output_text}} }), }); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, - evaluatorTemplate: customPrompt, + graderTemplate: customPrompt, }); await evaluator.evaluate({ @@ -143,9 +143,9 @@ Candidate: {{output_text}} text: JSON.stringify({ score: 0.5, assertions: [] }), }); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, - evaluatorTemplate: customPrompt, + graderTemplate: customPrompt, }); await evaluator.evaluate({ @@ -160,7 +160,7 @@ Candidate: {{output_text}} const request = graderProvider.lastRequest; - // When custom evaluatorTemplate is provided, it goes in user prompt (question) + // When custom graderTemplate is provided, it goes in user prompt (question) expect(request?.question).toContain('Fixed prompt without variables'); // System prompt only contains output schema, not custom template @@ -184,9 +184,9 @@ Candidate: {{ output }} }), }); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, - evaluatorTemplate: customPrompt, + graderTemplate: customPrompt, }); const answer = 'Candidate Answer Text'; @@ -215,7 +215,7 @@ Candidate: {{ output }} }); it('preserves freeform details returned by the grader', async () => { - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => undefined, }); @@ -224,7 +224,7 @@ Candidate: {{ output }} parseAgentResult: ( text: string, rubrics: undefined, - evaluatorRawRequest: Record, + graderRawRequest: Record, details: Record, graderTarget?: string, ) => { details?: Record }; diff --git a/packages/core/test/evaluation/execution-metrics.test.ts b/packages/core/test/evaluation/execution-metrics.test.ts index aa0efecba..83a01e000 100644 --- a/packages/core/test/evaluation/execution-metrics.test.ts +++ b/packages/core/test/evaluation/execution-metrics.test.ts @@ -2,7 +2,7 @@ import { describe, expect, it } from 'bun:test'; import { dirname, join } from 'node:path'; import { fileURLToPath } from 'node:url'; -import { CodeEvaluator } from '../../src/evaluation/evaluators.js'; +import { CodeGrader } from '../../src/evaluation/graders.js'; import type { ResolvedTarget } from '../../src/evaluation/providers/targets.js'; import { type TraceComputeResult, @@ -264,7 +264,7 @@ describe('Code Grader Metrics Integration', () => { const __dirname = dirname(fileURLToPath(import.meta.url)); const script = ['node', join(__dirname, '../fixtures/test-trace-summary.cjs')]; - const evaluator = new CodeEvaluator({ command: script }); + const evaluator = new CodeGrader({ command: script }); const trace: TraceSummary = { eventCount: 3, @@ -303,7 +303,7 @@ describe('Code Grader Metrics Integration', () => { const __dirname = dirname(fileURLToPath(import.meta.url)); const script = ['node', join(__dirname, '../fixtures/test-no-trace-summary.cjs')]; - const evaluator = new CodeEvaluator({ command: script }); + const evaluator = new CodeGrader({ command: script }); const result = await evaluator.evaluate({ evalCase: baseTestCase, diff --git a/packages/core/test/evaluation/evaluators.test.ts b/packages/core/test/evaluation/graders.test.ts similarity index 93% rename from packages/core/test/evaluation/evaluators.test.ts rename to packages/core/test/evaluation/graders.test.ts index eca0bce73..2f4bb7be8 100644 --- a/packages/core/test/evaluation/evaluators.test.ts +++ b/packages/core/test/evaluation/graders.test.ts @@ -3,20 +3,20 @@ import { dirname, join } from 'node:path'; import { fileURLToPath } from 'node:url'; import { - CodeEvaluator, - CostEvaluator, - FieldAccuracyEvaluator, - LatencyEvaluator, - LlmGraderEvaluator, - TokenUsageEvaluator, -} from '../../src/evaluation/evaluators.js'; + CodeGrader, + CostGrader, + FieldAccuracyGrader, + LatencyGrader, + LlmGrader, + TokenUsageGrader, +} from '../../src/evaluation/graders.js'; import type { ResolvedTarget } from '../../src/evaluation/providers/targets.js'; import type { Provider, ProviderRequest, ProviderResponse, } from '../../src/evaluation/providers/types.js'; -import { llmGraderFactory } from '../../src/evaluation/registry/builtin-evaluators.js'; +import { llmGraderFactory } from '../../src/evaluation/registry/builtin-graders.js'; import type { EvalTest } from '../../src/evaluation/types.js'; /** Helper to create a ProviderResponse with text wrapped in output */ @@ -95,7 +95,7 @@ const baseTarget: ResolvedTarget = { config: { response: '{}' }, }; -describe('LlmGraderEvaluator (llm-grader)', () => { +describe('LlmGrader (llm-grader)', () => { it('parses JSON response and returns evaluation score', async () => { const graderProvider = new StubProvider({ output: [ @@ -112,7 +112,7 @@ describe('LlmGraderEvaluator (llm-grader)', () => { ], }); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, }); @@ -134,7 +134,7 @@ describe('LlmGraderEvaluator (llm-grader)', () => { expect(result.assertions.filter((a) => !a.passed).map((a) => a.text)).toContain( 'Did not mention tests', ); - expect(result.evaluatorRawRequest).toBeDefined(); + expect(result.graderRawRequest).toBeDefined(); }); it('parses JSON from markdown code block', async () => { @@ -154,7 +154,7 @@ describe('LlmGraderEvaluator (llm-grader)', () => { ], }); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, }); @@ -193,7 +193,7 @@ describe('LlmGraderEvaluator (llm-grader)', () => { ], }); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, }); @@ -231,7 +231,7 @@ describe('LlmGraderEvaluator (llm-grader)', () => { ], }); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, }); @@ -277,7 +277,7 @@ describe('LlmGraderEvaluator (llm-grader)', () => { ], }); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, }); @@ -325,9 +325,9 @@ describe('LlmGraderEvaluator (llm-grader)', () => { ], }); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, - evaluatorTemplate: customPrompt, + graderTemplate: customPrompt, }); const result = await evaluator.evaluate({ @@ -350,11 +350,11 @@ describe('LlmGraderEvaluator (llm-grader)', () => { ); expect(graderProvider.lastRequest?.systemPrompt).not.toContain(customPrompt); - expect(result.evaluatorRawRequest?.userPrompt).toContain(customPrompt); - expect(result.evaluatorRawRequest?.systemPrompt).toContain( + expect(result.graderRawRequest?.userPrompt).toContain(customPrompt); + expect(result.graderRawRequest?.systemPrompt).toContain( 'You must respond with a single JSON object', ); - expect(result.evaluatorRawRequest?.systemPrompt).not.toContain(customPrompt); + expect(result.graderRawRequest?.systemPrompt).not.toContain(customPrompt); }); it('uses evaluator target overrides when configured', async () => { @@ -383,7 +383,7 @@ describe('LlmGraderEvaluator (llm-grader)', () => { graderProvider: defaultGraderProvider, targetResolver: (targetName) => targetName === 'grader-low-cost-b' ? overrideGraderProvider : undefined, - llmGrader: new LlmGraderEvaluator({ + llmGrader: new LlmGrader({ resolveGraderProvider: async () => defaultGraderProvider, }), registry: {} as never, @@ -419,7 +419,7 @@ describe('LlmGraderEvaluator (llm-grader)', () => { ], }); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, }); @@ -442,7 +442,7 @@ describe('LlmGraderEvaluator (llm-grader)', () => { it('tolerates non-JSON output by falling back to skip', async () => { const graderProvider = new StubProvider(textResponse('Final score: 0.5')); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, }); @@ -478,7 +478,7 @@ describe('LlmGraderEvaluator (llm-grader)', () => { ], }); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, }); @@ -520,7 +520,7 @@ describe('LlmGraderEvaluator (llm-grader)', () => { const graderProvider = new CapturingProvider({ output: [{ role: 'assistant', content: JSON.stringify({ score: 0.65, assertions: [] }) }], }); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, }); @@ -538,15 +538,15 @@ describe('LlmGraderEvaluator (llm-grader)', () => { }); expect(graderProvider.lastRequest?.question).toContain(multiTurnQuestion); - expect(result.evaluatorRawRequest?.userPrompt).toContain('@[Assistant]:'); - expect(result.evaluatorRawRequest?.userPrompt).toContain('@[System]:'); + expect(result.graderRawRequest?.userPrompt).toContain('@[Assistant]:'); + expect(result.graderRawRequest?.userPrompt).toContain('@[System]:'); }); it('keeps single-turn prompts flat when no markers are needed', async () => { const graderProvider = new CapturingProvider({ output: [{ role: 'assistant', content: JSON.stringify({ score: 0.8, assertions: [] }) }], }); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, }); @@ -564,14 +564,14 @@ describe('LlmGraderEvaluator (llm-grader)', () => { expect(graderProvider.lastRequest?.question).toContain(flatQuestion); expect(graderProvider.lastRequest?.question).not.toContain('@[User]:'); - expect(result.evaluatorRawRequest?.userPrompt).toContain(flatQuestion); - expect(result.evaluatorRawRequest?.userPrompt).not.toContain('@[User]:'); + expect(result.graderRawRequest?.userPrompt).toContain(flatQuestion); + expect(result.graderRawRequest?.userPrompt).not.toContain('@[User]:'); }); it('returns skip verdict when rubric mode receives malformed JSON', async () => { const graderProvider = new StubProvider(textResponse('not valid json at all')); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, }); @@ -600,7 +600,7 @@ describe('LlmGraderEvaluator (llm-grader)', () => { it('returns skip verdict when score-range rubric mode receives malformed JSON', async () => { const graderProvider = new StubProvider(textResponse('truncated {')); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, }); @@ -642,7 +642,7 @@ describe('LlmGraderEvaluator (llm-grader)', () => { const graderProvider = new StubProvider(textResponse('not valid json at all')); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, }); @@ -688,7 +688,7 @@ describe('LlmGraderEvaluator (llm-grader)', () => { repairedResponse, ]); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, }); @@ -722,7 +722,7 @@ describe('LlmGraderEvaluator (llm-grader)', () => { textResponse('{"score":'), ]); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, }); @@ -745,7 +745,7 @@ describe('LlmGraderEvaluator (llm-grader)', () => { it('keeps skipping on unrecoverable malformed JSON', async () => { const graderProvider = new StubProvider(textResponse('{"score":')); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, }); @@ -764,12 +764,12 @@ describe('LlmGraderEvaluator (llm-grader)', () => { expect(result.assertions[0]?.text).toContain('Grader parse failure'); }); - it('emits stderr warning with default name when evaluator name is not set', async () => { + it('emits stderr warning with default name when grader name is not set', async () => { const warnSpy = spyOn(console, 'warn').mockImplementation(() => {}); const graderProvider = new StubProvider(textResponse('garbage')); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => graderProvider, }); @@ -814,7 +814,7 @@ describe('LlmGraderEvaluator (llm-grader)', () => { }, { graderProvider, - llmGrader: new LlmGraderEvaluator({ + llmGrader: new LlmGrader({ resolveGraderProvider: async () => graderProvider, }), registry: {} as never, @@ -870,7 +870,7 @@ describe('LlmGraderEvaluator (llm-grader)', () => { }, { graderProvider, - llmGrader: new LlmGraderEvaluator({ + llmGrader: new LlmGrader({ resolveGraderProvider: async () => graderProvider, }), registry: {} as never, @@ -895,7 +895,7 @@ describe('LlmGraderEvaluator (llm-grader)', () => { }); }); -describe('CodeEvaluator', () => { +describe('CodeGrader', () => { it('passes required fields to code-grader scripts', async () => { const graderProvider = new StubProvider(textResponse('{}')); @@ -910,7 +910,7 @@ describe('CodeEvaluator', () => { const __dirname = dirname(fileURLToPath(import.meta.url)); const script = ['node', join(__dirname, '../fixtures/test-grader.cjs')]; - const evaluator = new CodeEvaluator({ command: script }); + const evaluator = new CodeGrader({ command: script }); const result = await evaluator.evaluate({ evalCase: evalCaseWithExpectedMessages, @@ -937,7 +937,7 @@ describe('CodeEvaluator', () => { const __dirname = dirname(fileURLToPath(import.meta.url)); const script = ['node', join(__dirname, '../fixtures/test-grader-error.cjs')]; - const evaluator = new CodeEvaluator({ command: script }); + const evaluator = new CodeGrader({ command: script }); const result = await evaluator.evaluate({ evalCase: baseTestCase, @@ -961,7 +961,7 @@ describe('CodeEvaluator', () => { const __dirname = dirname(fileURLToPath(import.meta.url)); const script = ['bun', 'run', join(__dirname, '../fixtures/test-define-grader.ts')]; - const evaluator = new CodeEvaluator({ command: script }); + const evaluator = new CodeGrader({ command: script }); const result = await evaluator.evaluate({ evalCase: baseTestCase, @@ -985,7 +985,7 @@ describe('CodeEvaluator', () => { const __dirname = dirname(fileURLToPath(import.meta.url)); const script = ['node', join(__dirname, '../fixtures/test-grader-with-details.cjs')]; - const evaluator = new CodeEvaluator({ command: script }); + const evaluator = new CodeGrader({ command: script }); const result = await evaluator.evaluate({ evalCase: { @@ -1016,7 +1016,7 @@ describe('CodeEvaluator', () => { const __dirname = dirname(fileURLToPath(import.meta.url)); const script = ['node', join(__dirname, '../fixtures/test-grader-workspace.cjs')]; - const evaluator = new CodeEvaluator({ command: script }); + const evaluator = new CodeGrader({ command: script }); const result = await evaluator.evaluate({ evalCase: baseTestCase, @@ -1043,7 +1043,7 @@ describe('CodeEvaluator', () => { const __dirname = dirname(fileURLToPath(import.meta.url)); const script = ['node', join(__dirname, '../fixtures/test-grader.cjs')]; - const evaluator = new CodeEvaluator({ command: script }); + const evaluator = new CodeGrader({ command: script }); const result = await evaluator.evaluate({ evalCase: { @@ -1064,7 +1064,7 @@ describe('CodeEvaluator', () => { }); }); -describe('FieldAccuracyEvaluator', () => { +describe('FieldAccuracyGrader', () => { const baseTestCaseWithExpected: EvalTest = { ...baseTestCase, expected_output: [ @@ -1083,7 +1083,7 @@ describe('FieldAccuracyEvaluator', () => { const graderProvider = new StubProvider(textResponse('{}')); it('evaluates exact match fields correctly', () => { - const evaluator = new FieldAccuracyEvaluator({ + const evaluator = new FieldAccuracyGrader({ config: { name: 'test', type: 'field-accuracy', @@ -1111,7 +1111,7 @@ describe('FieldAccuracyEvaluator', () => { }); it('handles missing required fields', () => { - const evaluator = new FieldAccuracyEvaluator({ + const evaluator = new FieldAccuracyGrader({ config: { name: 'test', type: 'field-accuracy', @@ -1141,7 +1141,7 @@ describe('FieldAccuracyEvaluator', () => { }); it('applies numeric tolerance matching', () => { - const evaluator = new FieldAccuracyEvaluator({ + const evaluator = new FieldAccuracyGrader({ config: { name: 'test', type: 'field-accuracy', @@ -1174,7 +1174,7 @@ describe('FieldAccuracyEvaluator', () => { }); it('fails numeric tolerance when outside range', () => { - const evaluator = new FieldAccuracyEvaluator({ + const evaluator = new FieldAccuracyGrader({ config: { name: 'test', type: 'field-accuracy', @@ -1208,7 +1208,7 @@ describe('FieldAccuracyEvaluator', () => { }); it('applies date matching with format normalization', () => { - const evaluator = new FieldAccuracyEvaluator({ + const evaluator = new FieldAccuracyGrader({ config: { name: 'test', type: 'field-accuracy', @@ -1240,7 +1240,7 @@ describe('FieldAccuracyEvaluator', () => { }); it('respects weighted averaging', () => { - const evaluator = new FieldAccuracyEvaluator({ + const evaluator = new FieldAccuracyGrader({ config: { name: 'test', type: 'field-accuracy', @@ -1269,7 +1269,7 @@ describe('FieldAccuracyEvaluator', () => { }); it('supports all_or_nothing aggregation', () => { - const evaluator = new FieldAccuracyEvaluator({ + const evaluator = new FieldAccuracyGrader({ config: { name: 'test', type: 'field-accuracy', @@ -1297,7 +1297,7 @@ describe('FieldAccuracyEvaluator', () => { }); it('handles nested field paths', () => { - const evaluator = new FieldAccuracyEvaluator({ + const evaluator = new FieldAccuracyGrader({ config: { name: 'test', type: 'field-accuracy', @@ -1338,7 +1338,7 @@ describe('FieldAccuracyEvaluator', () => { ], }; - const evaluator = new FieldAccuracyEvaluator({ + const evaluator = new FieldAccuracyGrader({ config: { name: 'test', type: 'field-accuracy', @@ -1369,7 +1369,7 @@ describe('FieldAccuracyEvaluator', () => { }); it('returns failure for invalid JSON candidate', () => { - const evaluator = new FieldAccuracyEvaluator({ + const evaluator = new FieldAccuracyGrader({ config: { name: 'test', type: 'field-accuracy', @@ -1393,9 +1393,9 @@ describe('FieldAccuracyEvaluator', () => { }); }); -describe('LatencyEvaluator', () => { +describe('LatencyGrader', () => { it('passes when duration is under threshold', () => { - const evaluator = new LatencyEvaluator({ + const evaluator = new LatencyGrader({ config: { name: 'latency_check', type: 'latency', @@ -1420,7 +1420,7 @@ describe('LatencyEvaluator', () => { }); it('fails when duration exceeds threshold', () => { - const evaluator = new LatencyEvaluator({ + const evaluator = new LatencyGrader({ config: { name: 'latency_check', type: 'latency', @@ -1445,7 +1445,7 @@ describe('LatencyEvaluator', () => { }); it('fails when no duration data available', () => { - const evaluator = new LatencyEvaluator({ + const evaluator = new LatencyGrader({ config: { name: 'latency_check', type: 'latency', @@ -1470,7 +1470,7 @@ describe('LatencyEvaluator', () => { }); it('passes when duration equals threshold exactly', () => { - const evaluator = new LatencyEvaluator({ + const evaluator = new LatencyGrader({ config: { name: 'latency_check', type: 'latency', @@ -1494,9 +1494,9 @@ describe('LatencyEvaluator', () => { }); }); -describe('CostEvaluator', () => { +describe('CostGrader', () => { it('passes when cost is under budget', () => { - const evaluator = new CostEvaluator({ + const evaluator = new CostGrader({ config: { name: 'cost_check', type: 'cost', @@ -1521,7 +1521,7 @@ describe('CostEvaluator', () => { }); it('fails when cost exceeds budget', () => { - const evaluator = new CostEvaluator({ + const evaluator = new CostGrader({ config: { name: 'cost_check', type: 'cost', @@ -1546,7 +1546,7 @@ describe('CostEvaluator', () => { }); it('fails when no cost data available', () => { - const evaluator = new CostEvaluator({ + const evaluator = new CostGrader({ config: { name: 'cost_check', type: 'cost', @@ -1571,7 +1571,7 @@ describe('CostEvaluator', () => { }); it('passes when cost equals budget exactly', () => { - const evaluator = new CostEvaluator({ + const evaluator = new CostGrader({ config: { name: 'cost_check', type: 'cost', @@ -1595,9 +1595,9 @@ describe('CostEvaluator', () => { }); }); -describe('TokenUsageEvaluator', () => { +describe('TokenUsageGrader', () => { it('passes when total tokens are under max_total', () => { - const evaluator = new TokenUsageEvaluator({ + const evaluator = new TokenUsageGrader({ config: { name: 'token_budget', type: 'token-usage', max_total: 1000 }, }); @@ -1623,7 +1623,7 @@ describe('TokenUsageEvaluator', () => { }); it('fails when output tokens exceed max_output', () => { - const evaluator = new TokenUsageEvaluator({ + const evaluator = new TokenUsageGrader({ config: { name: 'token_budget', type: 'token-usage', max_output: 100 }, }); @@ -1649,7 +1649,7 @@ describe('TokenUsageEvaluator', () => { }); it('fails when no token usage data available', () => { - const evaluator = new TokenUsageEvaluator({ + const evaluator = new TokenUsageGrader({ config: { name: 'token_budget', type: 'token-usage', max_total: 1000 }, }); diff --git a/packages/core/test/evaluation/evaluators/assertions.test.ts b/packages/core/test/evaluation/graders/assertions.test.ts similarity index 97% rename from packages/core/test/evaluation/evaluators/assertions.test.ts rename to packages/core/test/evaluation/graders/assertions.test.ts index cebe33625..5dfd574fb 100644 --- a/packages/core/test/evaluation/evaluators/assertions.test.ts +++ b/packages/core/test/evaluation/graders/assertions.test.ts @@ -5,7 +5,7 @@ import { runEqualsAssertion, runIsJsonAssertion, runRegexAssertion, -} from '../../../src/evaluation/evaluators/assertions.js'; +} from '../../../src/evaluation/graders/assertions.js'; describe('deterministic assertions', () => { describe('contains', () => { diff --git a/packages/core/test/evaluation/evaluators/composite-threshold.test.ts b/packages/core/test/evaluation/graders/composite-threshold.test.ts similarity index 91% rename from packages/core/test/evaluation/evaluators/composite-threshold.test.ts rename to packages/core/test/evaluation/graders/composite-threshold.test.ts index 1034bcbd0..bbba8942e 100644 --- a/packages/core/test/evaluation/evaluators/composite-threshold.test.ts +++ b/packages/core/test/evaluation/graders/composite-threshold.test.ts @@ -1,14 +1,14 @@ import { describe, expect, it } from 'bun:test'; -import { CompositeEvaluator } from '../../../src/evaluation/evaluators/composite.js'; +import { CompositeGrader } from '../../../src/evaluation/graders/composite.js'; import type { EvaluationContext, EvaluationScore, - Evaluator, - EvaluatorFactory, -} from '../../../src/evaluation/evaluators/types.js'; + Grader, + GraderFactory, +} from '../../../src/evaluation/graders/types.js'; import type { ResolvedTarget } from '../../../src/evaluation/providers/targets.js'; -import type { EvalTest, EvaluatorConfig } from '../../../src/evaluation/types.js'; +import type { EvalTest, GraderConfig } from '../../../src/evaluation/types.js'; const baseTestCase: EvalTest = { id: 'threshold-test', @@ -56,9 +56,9 @@ function makeResult(verdict: 'pass' | 'fail', score: number): EvaluationScore { }; } -function createMockFactory(results: Record): EvaluatorFactory { +function createMockFactory(results: Record): GraderFactory { return { - create(config: EvaluatorConfig): Evaluator { + create(config: GraderConfig): Grader { return { kind: config.type, evaluate: () => results[config.name], @@ -67,7 +67,7 @@ function createMockFactory(results: Record): EvaluatorF }; } -describe('CompositeEvaluator threshold aggregation', () => { +describe('CompositeGrader threshold aggregation', () => { it('all children pass, threshold 0.5 → pass, score = 1.0', async () => { const factory = createMockFactory({ a: makeResult('pass', 1.0), @@ -76,7 +76,7 @@ describe('CompositeEvaluator threshold aggregation', () => { d: makeResult('pass', 0.8), }); - const evaluator = new CompositeEvaluator({ + const evaluator = new CompositeGrader({ config: { name: 'gate', type: 'composite', @@ -105,7 +105,7 @@ describe('CompositeEvaluator threshold aggregation', () => { d: makeResult('fail', 0.1), }); - const evaluator = new CompositeEvaluator({ + const evaluator = new CompositeGrader({ config: { name: 'gate', type: 'composite', @@ -133,7 +133,7 @@ describe('CompositeEvaluator threshold aggregation', () => { d: makeResult('fail', 0.1), }); - const evaluator = new CompositeEvaluator({ + const evaluator = new CompositeGrader({ config: { name: 'gate', type: 'composite', @@ -161,7 +161,7 @@ describe('CompositeEvaluator threshold aggregation', () => { d: makeResult('fail', 0.1), }); - const evaluator = new CompositeEvaluator({ + const evaluator = new CompositeGrader({ config: { name: 'gate', type: 'composite', @@ -187,7 +187,7 @@ describe('CompositeEvaluator threshold aggregation', () => { b: makeResult('fail', 0.1), }); - const evaluator = new CompositeEvaluator({ + const evaluator = new CompositeGrader({ config: { name: 'gate', type: 'composite', @@ -213,7 +213,7 @@ describe('CompositeEvaluator threshold aggregation', () => { d: makeResult('fail', 0.3), }); - const evaluator = new CompositeEvaluator({ + const evaluator = new CompositeGrader({ config: { name: 'gate', type: 'composite', @@ -239,7 +239,7 @@ describe('CompositeEvaluator threshold aggregation', () => { b: makeResult('fail', 0.3), }); - const evaluator = new CompositeEvaluator({ + const evaluator = new CompositeGrader({ config: { name: 'gate', type: 'composite', @@ -258,7 +258,7 @@ describe('CompositeEvaluator threshold aggregation', () => { const results = result.scores as NonNullable; expect(results[0].name).toBe('a'); expect(results[1].name).toBe('b'); - expect(result.evaluatorRawRequest).toEqual({ + expect(result.graderRawRequest).toEqual({ aggregator: 'threshold', threshold: 0.5, }); @@ -276,7 +276,7 @@ function makeSkipResult(): EvaluationScore { }; } -describe('CompositeEvaluator skip-verdict handling', () => { +describe('CompositeGrader skip-verdict handling', () => { it('weighted average: skip-verdict members excluded from average', async () => { const factory = createMockFactory({ a: makeResult('pass', 1.0), @@ -284,7 +284,7 @@ describe('CompositeEvaluator skip-verdict handling', () => { c: makeResult('pass', 0.8), }); - const evaluator = new CompositeEvaluator({ + const evaluator = new CompositeGrader({ config: { name: 'combo', type: 'composite', @@ -310,7 +310,7 @@ describe('CompositeEvaluator skip-verdict handling', () => { b: makeSkipResult(), }); - const evaluator = new CompositeEvaluator({ + const evaluator = new CompositeGrader({ config: { name: 'combo', type: 'composite', @@ -336,7 +336,7 @@ describe('CompositeEvaluator skip-verdict handling', () => { c: makeResult('fail', 0.2), }); - const evaluator = new CompositeEvaluator({ + const evaluator = new CompositeGrader({ config: { name: 'gate', type: 'composite', @@ -363,7 +363,7 @@ describe('CompositeEvaluator skip-verdict handling', () => { b: makeSkipResult(), }); - const evaluator = new CompositeEvaluator({ + const evaluator = new CompositeGrader({ config: { name: 'gate', type: 'composite', @@ -388,7 +388,7 @@ describe('CompositeEvaluator skip-verdict handling', () => { b: makeSkipResult(), }); - const evaluator = new CompositeEvaluator({ + const evaluator = new CompositeGrader({ config: { name: 'combo', type: 'composite', diff --git a/packages/core/test/evaluation/evaluators/execution-metrics.test.ts b/packages/core/test/evaluation/graders/execution-metrics.test.ts similarity index 84% rename from packages/core/test/evaluation/evaluators/execution-metrics.test.ts rename to packages/core/test/evaluation/graders/execution-metrics.test.ts index 0a53f6671..33a931dc2 100644 --- a/packages/core/test/evaluation/evaluators/execution-metrics.test.ts +++ b/packages/core/test/evaluation/graders/execution-metrics.test.ts @@ -1,9 +1,9 @@ import { describe, expect, it } from 'bun:test'; -import { ExecutionMetricsEvaluator } from '../../../src/evaluation/evaluators/execution-metrics.js'; +import { ExecutionMetricsGrader } from '../../../src/evaluation/graders/execution-metrics.js'; import type { ResolvedTarget } from '../../../src/evaluation/providers/targets.js'; import type { TokenUsage, TraceSummary } from '../../../src/evaluation/trace.js'; -import type { EvalTest, ExecutionMetricsEvaluatorConfig } from '../../../src/evaluation/types.js'; +import type { EvalTest, ExecutionMetricsGraderConfig } from '../../../src/evaluation/types.js'; const baseTestCase: EvalTest = { id: 'metrics-test', @@ -57,16 +57,16 @@ function createContext(traceAndMetrics?: TraceSummary & Record) }; } -describe('ExecutionMetricsEvaluator', () => { +describe('ExecutionMetricsGrader', () => { describe('max_tool_calls', () => { it('passes when tool calls are within limit', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', max_tool_calls: 10, }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 5, @@ -84,13 +84,13 @@ describe('ExecutionMetricsEvaluator', () => { }); it('fails when tool calls exceed limit', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', max_tool_calls: 5, }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 10, @@ -110,13 +110,13 @@ describe('ExecutionMetricsEvaluator', () => { describe('max_llm_calls', () => { it('passes when LLM calls are within limit', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', max_llm_calls: 5, }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 3, @@ -134,13 +134,13 @@ describe('ExecutionMetricsEvaluator', () => { }); it('fails when LLM calls exceed limit', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', max_llm_calls: 2, }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 3, @@ -160,13 +160,13 @@ describe('ExecutionMetricsEvaluator', () => { describe('max_tokens', () => { it('passes when token usage is within limit', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', max_tokens: 2000, }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 3, @@ -184,13 +184,13 @@ describe('ExecutionMetricsEvaluator', () => { }); it('fails when token usage exceeds limit', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', max_tokens: 1000, }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 3, @@ -210,13 +210,13 @@ describe('ExecutionMetricsEvaluator', () => { describe('max_cost_usd', () => { it('passes when cost is within budget', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', max_cost_usd: 0.1, }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 3, @@ -234,13 +234,13 @@ describe('ExecutionMetricsEvaluator', () => { }); it('fails when cost exceeds budget', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', max_cost_usd: 0.05, }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 3, @@ -260,13 +260,13 @@ describe('ExecutionMetricsEvaluator', () => { describe('max_duration_ms', () => { it('passes when duration is within limit', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', max_duration_ms: 5000, }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 3, @@ -284,13 +284,13 @@ describe('ExecutionMetricsEvaluator', () => { }); it('fails when duration exceeds limit', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', max_duration_ms: 2000, }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 3, @@ -310,14 +310,14 @@ describe('ExecutionMetricsEvaluator', () => { describe('target_exploration_ratio', () => { it('passes when exploration ratio is within tolerance of target', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', target_exploration_ratio: 0.6, exploration_tolerance: 0.2, }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 10, @@ -334,14 +334,14 @@ describe('ExecutionMetricsEvaluator', () => { }); it('fails when exploration ratio is outside tolerance', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', target_exploration_ratio: 0.8, exploration_tolerance: 0.1, }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 10, @@ -358,14 +358,14 @@ describe('ExecutionMetricsEvaluator', () => { }); it('uses default tolerance of 0.2 when not specified', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', target_exploration_ratio: 0.5, // No exploration_tolerance specified, should default to 0.2 }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 10, @@ -381,7 +381,7 @@ describe('ExecutionMetricsEvaluator', () => { describe('combined thresholds', () => { it('passes when all specified thresholds are within limits', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', max_tool_calls: 10, @@ -391,7 +391,7 @@ describe('ExecutionMetricsEvaluator', () => { max_duration_ms: 5000, }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 5, @@ -411,7 +411,7 @@ describe('ExecutionMetricsEvaluator', () => { }); it('calculates proportional score based on passed and failed assertions', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', max_tool_calls: 10, @@ -421,7 +421,7 @@ describe('ExecutionMetricsEvaluator', () => { max_duration_ms: 5000, }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 5, @@ -444,14 +444,14 @@ describe('ExecutionMetricsEvaluator', () => { describe('omitted thresholds', () => { it('only checks specified thresholds', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', max_tool_calls: 10, // Other thresholds not specified }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 5, @@ -476,13 +476,13 @@ describe('ExecutionMetricsEvaluator', () => { describe('missing data handling', () => { it('fails when no trace is available', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', max_tool_calls: 10, }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate(createContext(undefined)); expect(result.score).toBe(0); @@ -493,13 +493,13 @@ describe('ExecutionMetricsEvaluator', () => { }); it('fails threshold check when required metric is missing', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', max_cost_usd: 0.1, // Checking cost }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 5, @@ -517,13 +517,13 @@ describe('ExecutionMetricsEvaluator', () => { }); it('fails when tokenUsage is missing for max_tokens check', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', max_tokens: 1000, }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 5, @@ -541,13 +541,13 @@ describe('ExecutionMetricsEvaluator', () => { }); it('fails when durationMs is missing for max_duration_ms check', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', max_duration_ms: 5000, }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 5, @@ -565,13 +565,13 @@ describe('ExecutionMetricsEvaluator', () => { }); it('fails when llmCallCount is missing for max_llm_calls check', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', max_llm_calls: 5, }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 5, @@ -589,9 +589,9 @@ describe('ExecutionMetricsEvaluator', () => { }); }); - describe('evaluatorRawRequest', () => { + describe('graderRawRequest', () => { it('includes all config values and actual metrics in rawRequest', () => { - const config: ExecutionMetricsEvaluatorConfig = { + const config: ExecutionMetricsGraderConfig = { name: 'test-metrics', type: 'execution-metrics', max_tool_calls: 10, @@ -599,7 +599,7 @@ describe('ExecutionMetricsEvaluator', () => { weight: 2.0, }; - const evaluator = new ExecutionMetricsEvaluator({ config }); + const evaluator = new ExecutionMetricsGrader({ config }); const result = evaluator.evaluate( createContext({ eventCount: 5, @@ -609,7 +609,7 @@ describe('ExecutionMetricsEvaluator', () => { }), ); - expect(result.evaluatorRawRequest).toEqual({ + expect(result.graderRawRequest).toEqual({ type: 'execution-metrics', config: { max_tool_calls: 10, diff --git a/packages/core/test/evaluation/evaluators/inline-assert.test.ts b/packages/core/test/evaluation/graders/inline-assert.test.ts similarity index 82% rename from packages/core/test/evaluation/evaluators/inline-assert.test.ts rename to packages/core/test/evaluation/graders/inline-assert.test.ts index 3efd2b028..360d0385f 100644 --- a/packages/core/test/evaluation/evaluators/inline-assert.test.ts +++ b/packages/core/test/evaluation/graders/inline-assert.test.ts @@ -1,8 +1,8 @@ import { describe, expect, it } from 'vitest'; import type { AssertFn } from '../../../src/evaluation/assertions.js'; -import { InlineAssertEvaluator } from '../../../src/evaluation/evaluators/inline-assert.js'; +import { InlineAssertGrader } from '../../../src/evaluation/graders/inline-assert.js'; -describe('InlineAssertEvaluator', () => { +describe('InlineAssertGrader', () => { const makeContext = (candidate: string, referenceAnswer?: string) => ({ evalCase: { @@ -24,7 +24,7 @@ describe('InlineAssertEvaluator', () => { score: output.includes('hello') ? 1.0 : 0.0, }); - const evaluator = new InlineAssertEvaluator(fn, 'test-assert'); + const evaluator = new InlineAssertGrader(fn, 'test-assert'); const score = await evaluator.evaluate(makeContext('hello world')); expect(score.score).toBe(1.0); @@ -38,7 +38,7 @@ describe('InlineAssertEvaluator', () => { score: output.includes('goodbye') ? 1.0 : 0.0, }); - const evaluator = new InlineAssertEvaluator(fn, 'fail-assert'); + const evaluator = new InlineAssertGrader(fn, 'fail-assert'); const score = await evaluator.evaluate(makeContext('hello world')); expect(score.score).toBe(0.0); @@ -52,7 +52,7 @@ describe('InlineAssertEvaluator', () => { score: output.length > 0 ? 1.0 : 0.0, }); - const evaluator = new InlineAssertEvaluator(fn, 'async-assert'); + const evaluator = new InlineAssertGrader(fn, 'async-assert'); const score = await evaluator.evaluate(makeContext('some output')); expect(score.score).toBe(1.0); @@ -60,7 +60,7 @@ describe('InlineAssertEvaluator', () => { it('clamps scores to 0-1 range', async () => { const fn: AssertFn = () => ({ name: 'clamped', score: 1.5 }); - const evaluator = new InlineAssertEvaluator(fn, 'clamped'); + const evaluator = new InlineAssertGrader(fn, 'clamped'); const score = await evaluator.evaluate(makeContext('output')); expect(score.score).toBe(1.0); @@ -68,7 +68,7 @@ describe('InlineAssertEvaluator', () => { it('clamps negative scores to 0', async () => { const fn: AssertFn = () => ({ name: 'negative', score: -0.5 }); - const evaluator = new InlineAssertEvaluator(fn, 'negative'); + const evaluator = new InlineAssertGrader(fn, 'negative'); const score = await evaluator.evaluate(makeContext('output')); expect(score.score).toBe(0.0); @@ -81,7 +81,7 @@ describe('InlineAssertEvaluator', () => { score: expectedOutput === 'expected' ? 1.0 : 0.0, }); - const evaluator = new InlineAssertEvaluator(fn, 'expected-check'); + const evaluator = new InlineAssertGrader(fn, 'expected-check'); const score = await evaluator.evaluate(makeContext('candidate', 'expected')); expect(score.score).toBe(1.0); diff --git a/packages/core/test/evaluation/evaluators/negation.test.ts b/packages/core/test/evaluation/graders/negation.test.ts similarity index 94% rename from packages/core/test/evaluation/evaluators/negation.test.ts rename to packages/core/test/evaluation/graders/negation.test.ts index 41d1ef18b..e5c195542 100644 --- a/packages/core/test/evaluation/evaluators/negation.test.ts +++ b/packages/core/test/evaluation/graders/negation.test.ts @@ -1,7 +1,7 @@ import { describe, expect, it } from 'bun:test'; -import { negateScore } from '../../../src/evaluation/evaluators/scoring.js'; -import type { EvaluationScore } from '../../../src/evaluation/evaluators/types.js'; +import { negateScore } from '../../../src/evaluation/graders/scoring.js'; +import type { EvaluationScore } from '../../../src/evaluation/graders/types.js'; describe('negateScore', () => { it('inverts a passing score to failing', () => { diff --git a/packages/core/test/evaluation/evaluators/prompt-resolution.test.ts b/packages/core/test/evaluation/graders/prompt-resolution.test.ts similarity index 97% rename from packages/core/test/evaluation/evaluators/prompt-resolution.test.ts rename to packages/core/test/evaluation/graders/prompt-resolution.test.ts index 0d94ffe98..02ef7d946 100644 --- a/packages/core/test/evaluation/evaluators/prompt-resolution.test.ts +++ b/packages/core/test/evaluation/graders/prompt-resolution.test.ts @@ -3,7 +3,7 @@ import { describe, expect, it } from 'bun:test'; import { containsTemplateVariables, resolveCustomPrompt, -} from '../../../src/evaluation/evaluators/prompt-resolution.js'; +} from '../../../src/evaluation/graders/prompt-resolution.js'; describe('containsTemplateVariables', () => { it('returns true for template with {{output}}', () => { diff --git a/packages/core/test/evaluation/evaluators/skill-trigger.test.ts b/packages/core/test/evaluation/graders/skill-trigger.test.ts similarity index 84% rename from packages/core/test/evaluation/evaluators/skill-trigger.test.ts rename to packages/core/test/evaluation/graders/skill-trigger.test.ts index 0532204fe..ccb41907b 100644 --- a/packages/core/test/evaluation/evaluators/skill-trigger.test.ts +++ b/packages/core/test/evaluation/graders/skill-trigger.test.ts @@ -1,7 +1,7 @@ import { describe, expect, it } from 'vitest'; -import { SkillTriggerEvaluator } from '../../../src/evaluation/evaluators/skill-trigger.js'; -import type { EvaluationContext } from '../../../src/evaluation/evaluators/types.js'; -import type { SkillTriggerEvaluatorConfig } from '../../../src/evaluation/types.js'; +import { SkillTriggerGrader } from '../../../src/evaluation/graders/skill-trigger.js'; +import type { EvaluationContext } from '../../../src/evaluation/graders/types.js'; +import type { SkillTriggerGraderConfig } from '../../../src/evaluation/types.js'; // biome-ignore lint/suspicious/noExplicitAny: test helper with partial context function makeContext(overrides: Record = {}): EvaluationContext { @@ -18,9 +18,7 @@ function makeContext(overrides: Record = {}): EvaluationContext { } as any; } -function makeConfig( - overrides: Partial = {}, -): SkillTriggerEvaluatorConfig { +function makeConfig(overrides: Partial = {}): SkillTriggerGraderConfig { return { name: 'test-trigger', type: 'skill-trigger', @@ -29,10 +27,10 @@ function makeConfig( }; } -describe('SkillTriggerEvaluator', () => { +describe('SkillTriggerGrader', () => { describe('canonical tool names (provider-agnostic)', () => { it('should detect Skill tool with matching skill name', () => { - const evaluator = new SkillTriggerEvaluator(makeConfig()); + const evaluator = new SkillTriggerGrader(makeConfig()); const context = makeContext({ output: [ { @@ -48,7 +46,7 @@ describe('SkillTriggerEvaluator', () => { }); it('should detect Read tool loading skill file via file_path', () => { - const evaluator = new SkillTriggerEvaluator(makeConfig()); + const evaluator = new SkillTriggerGrader(makeConfig()); const context = makeContext({ output: [ { @@ -69,7 +67,7 @@ describe('SkillTriggerEvaluator', () => { }); it('should detect skill via tool output reference', () => { - const evaluator = new SkillTriggerEvaluator(makeConfig()); + const evaluator = new SkillTriggerGrader(makeConfig()); const context = makeContext({ output: [ { @@ -90,7 +88,7 @@ describe('SkillTriggerEvaluator', () => { }); it('should fail when skill name does not match', () => { - const evaluator = new SkillTriggerEvaluator(makeConfig()); + const evaluator = new SkillTriggerGrader(makeConfig()); const context = makeContext({ output: [ { @@ -105,7 +103,7 @@ describe('SkillTriggerEvaluator', () => { }); it('should fail when Read loads non-skill file', () => { - const evaluator = new SkillTriggerEvaluator(makeConfig()); + const evaluator = new SkillTriggerGrader(makeConfig()); const context = makeContext({ output: [ { @@ -120,7 +118,7 @@ describe('SkillTriggerEvaluator', () => { }); it('should fail when only unrelated tools are called', () => { - const evaluator = new SkillTriggerEvaluator(makeConfig()); + const evaluator = new SkillTriggerGrader(makeConfig()); const context = makeContext({ output: [ { @@ -135,7 +133,7 @@ describe('SkillTriggerEvaluator', () => { }); it('should handle no tool calls', () => { - const evaluator = new SkillTriggerEvaluator(makeConfig()); + const evaluator = new SkillTriggerGrader(makeConfig()); const context = makeContext({ output: [{ role: 'assistant', content: 'no tools used' }], }); @@ -146,7 +144,7 @@ describe('SkillTriggerEvaluator', () => { it('should work with any provider kind (provider-agnostic)', () => { for (const kind of ['claude-cli', 'copilot-cli', 'codex', 'pi-cli', 'openai']) { - const evaluator = new SkillTriggerEvaluator(makeConfig()); + const evaluator = new SkillTriggerGrader(makeConfig()); const context = makeContext({ provider: { kind, targetName: 'test' }, output: [ @@ -165,7 +163,7 @@ describe('SkillTriggerEvaluator', () => { describe('should_trigger: false', () => { it('should pass when skill is not triggered', () => { - const evaluator = new SkillTriggerEvaluator(makeConfig({ should_trigger: false })); + const evaluator = new SkillTriggerGrader(makeConfig({ should_trigger: false })); const context = makeContext({ output: [ { @@ -180,7 +178,7 @@ describe('SkillTriggerEvaluator', () => { }); it('should fail when skill is triggered unexpectedly', () => { - const evaluator = new SkillTriggerEvaluator(makeConfig({ should_trigger: false })); + const evaluator = new SkillTriggerGrader(makeConfig({ should_trigger: false })); const context = makeContext({ output: [ { @@ -195,7 +193,7 @@ describe('SkillTriggerEvaluator', () => { }); it('should pass with no tool calls', () => { - const evaluator = new SkillTriggerEvaluator(makeConfig({ should_trigger: false })); + const evaluator = new SkillTriggerGrader(makeConfig({ should_trigger: false })); const context = makeContext({ output: [{ role: 'assistant', content: 'no tools used' }], }); @@ -206,7 +204,7 @@ describe('SkillTriggerEvaluator', () => { describe('full transcript scanning', () => { it('should pass when skill triggers after a preamble skill', () => { - const evaluator = new SkillTriggerEvaluator(makeConfig()); + const evaluator = new SkillTriggerGrader(makeConfig()); const context = makeContext({ output: [ { @@ -224,7 +222,7 @@ describe('SkillTriggerEvaluator', () => { }); it('should pass when skill triggers in a later message', () => { - const evaluator = new SkillTriggerEvaluator(makeConfig()); + const evaluator = new SkillTriggerGrader(makeConfig()); const context = makeContext({ output: [ { @@ -244,7 +242,7 @@ describe('SkillTriggerEvaluator', () => { }); it('should fail when target skill never appears anywhere in transcript', () => { - const evaluator = new SkillTriggerEvaluator(makeConfig()); + const evaluator = new SkillTriggerGrader(makeConfig()); const context = makeContext({ output: [ { @@ -262,7 +260,7 @@ describe('SkillTriggerEvaluator', () => { }); it('should pass for should_trigger:false when skill never appears', () => { - const evaluator = new SkillTriggerEvaluator(makeConfig({ should_trigger: false })); + const evaluator = new SkillTriggerGrader(makeConfig({ should_trigger: false })); const context = makeContext({ output: [ { @@ -277,7 +275,7 @@ describe('SkillTriggerEvaluator', () => { }); it('should fail for should_trigger:false when skill appears later', () => { - const evaluator = new SkillTriggerEvaluator(makeConfig({ should_trigger: false })); + const evaluator = new SkillTriggerGrader(makeConfig({ should_trigger: false })); const context = makeContext({ output: [ { @@ -295,7 +293,7 @@ describe('SkillTriggerEvaluator', () => { }); it('should detect skill loaded via Read in .agents/skills path', () => { - const evaluator = new SkillTriggerEvaluator(makeConfig()); + const evaluator = new SkillTriggerGrader(makeConfig()); const context = makeContext({ output: [ { @@ -315,7 +313,7 @@ describe('SkillTriggerEvaluator', () => { }); it('should detect skill loaded via Read in global path', () => { - const evaluator = new SkillTriggerEvaluator(makeConfig()); + const evaluator = new SkillTriggerGrader(makeConfig()); const context = makeContext({ output: [ { diff --git a/packages/core/test/evaluation/llm-grader-multimodal.test.ts b/packages/core/test/evaluation/llm-grader-multimodal.test.ts index 158d10245..7c94014e2 100644 --- a/packages/core/test/evaluation/llm-grader-multimodal.test.ts +++ b/packages/core/test/evaluation/llm-grader-multimodal.test.ts @@ -49,8 +49,8 @@ mock.module('ai', () => { }); // Import AFTER mock is set up -const { extractImageBlocks } = await import('../../src/evaluation/evaluators/llm-grader.js'); -const { LlmGraderEvaluator } = await import('../../src/evaluation/evaluators.js'); +const { extractImageBlocks } = await import('../../src/evaluation/graders/llm-grader.js'); +const { LlmGrader } = await import('../../src/evaluation/graders.js'); // --------------------------------------------------------------------------- // Test helpers @@ -196,7 +196,7 @@ describe('extractImageBlocks', () => { // LLM grader multimodal integration tests // --------------------------------------------------------------------------- -describe('LlmGraderEvaluator multimodal', () => { +describe('LlmGrader multimodal', () => { let tempDir: string | undefined; beforeEach(() => { @@ -213,7 +213,7 @@ describe('LlmGraderEvaluator multimodal', () => { it('sends plain text prompt when output has no images', async () => { const provider = createLmProvider(); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => provider, }); @@ -239,7 +239,7 @@ describe('LlmGraderEvaluator multimodal', () => { it('sends multi-part messages when output contains images', async () => { const provider = createLmProvider(); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => provider, }); @@ -292,7 +292,7 @@ describe('LlmGraderEvaluator multimodal', () => { it('appends multiple images from output', async () => { const provider = createLmProvider(); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => provider, }); @@ -331,7 +331,7 @@ describe('LlmGraderEvaluator multimodal', () => { it('ignores images in user/tool messages (only assistant)', async () => { const provider = createLmProvider(); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => provider, }); @@ -381,7 +381,7 @@ console.log('spreadsheet:' + path.basename(payload.original_path));`, ); const provider = createLmProvider(); - const evaluator = new LlmGraderEvaluator({ + const evaluator = new LlmGrader({ resolveGraderProvider: async () => provider, }); diff --git a/packages/core/test/evaluation/loaders/evaluator-parser.test.ts b/packages/core/test/evaluation/loaders/grader-parser.test.ts similarity index 81% rename from packages/core/test/evaluation/loaders/evaluator-parser.test.ts rename to packages/core/test/evaluation/loaders/grader-parser.test.ts index e5cd4e571..de1984798 100644 --- a/packages/core/test/evaluation/loaders/evaluator-parser.test.ts +++ b/packages/core/test/evaluation/loaders/grader-parser.test.ts @@ -3,20 +3,20 @@ import { mkdir, rm, writeFile } from 'node:fs/promises'; import os from 'node:os'; import path from 'node:path'; -import { parseEvaluators } from '../../../src/evaluation/loaders/evaluator-parser.js'; -import type { ToolTrajectoryEvaluatorConfig } from '../../../src/evaluation/trace.js'; +import { parseGraders } from '../../../src/evaluation/loaders/grader-parser.js'; +import type { ToolTrajectoryGraderConfig } from '../../../src/evaluation/trace.js'; import type { - CodeEvaluatorConfig, - CompositeEvaluatorConfig, - ContainsEvaluatorConfig, - EqualsEvaluatorConfig, - IsJsonEvaluatorConfig, - LatencyEvaluatorConfig, - LlmGraderEvaluatorConfig, - RegexEvaluatorConfig, + CodeGraderConfig, + CompositeGraderConfig, + ContainsGraderConfig, + EqualsGraderConfig, + IsJsonGraderConfig, + LatencyGraderConfig, + LlmGraderConfig, + RegexGraderConfig, } from '../../../src/evaluation/types.js'; -describe('parseEvaluators - deterministic assertion types', () => { +describe('parseGraders - deterministic assertion types', () => { let tempDir: string; beforeAll(async () => { @@ -29,7 +29,7 @@ describe('parseEvaluators - deterministic assertion types', () => { }); it('parses type: contains', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ name: 'check-denied', type: 'contains', value: 'DENIED' }], }, @@ -39,13 +39,13 @@ describe('parseEvaluators - deterministic assertion types', () => { ); expect(evaluators).toHaveLength(1); expect(evaluators?.[0].type).toBe('contains'); - const config = evaluators?.[0] as ContainsEvaluatorConfig; + const config = evaluators?.[0] as ContainsGraderConfig; expect(config.name).toBe('check-denied'); expect(config.value).toBe('DENIED'); }); it('auto-generates name for contains when not provided', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ type: 'contains', value: 'DENIED' }], }, @@ -59,7 +59,7 @@ describe('parseEvaluators - deterministic assertion types', () => { }); it('skips contains evaluator with missing value', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ name: 'no-value', type: 'contains' }], }, @@ -71,7 +71,7 @@ describe('parseEvaluators - deterministic assertion types', () => { }); it('parses type: contains with weight', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ name: 'weighted-contains', type: 'contains', value: 'OK', weight: 2.0 }], }, @@ -80,12 +80,12 @@ describe('parseEvaluators - deterministic assertion types', () => { 'test-1', ); expect(evaluators).toHaveLength(1); - const config = evaluators?.[0] as ContainsEvaluatorConfig; + const config = evaluators?.[0] as ContainsGraderConfig; expect(config.weight).toBe(2.0); }); it('parses type: regex', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ name: 'risk-check', type: 'regex', value: 'risk: \\w+' }], }, @@ -95,13 +95,13 @@ describe('parseEvaluators - deterministic assertion types', () => { ); expect(evaluators).toHaveLength(1); expect(evaluators?.[0].type).toBe('regex'); - const config = evaluators?.[0] as RegexEvaluatorConfig; + const config = evaluators?.[0] as RegexGraderConfig; expect(config.name).toBe('risk-check'); expect(config.value).toBe('risk: \\w+'); }); it('auto-generates name for regex when not provided', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ type: 'regex', value: '^\\d{3}-\\d{4}$' }], }, @@ -115,7 +115,7 @@ describe('parseEvaluators - deterministic assertion types', () => { }); it('skips regex evaluator with missing value', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ name: 'no-pattern', type: 'regex' }], }, @@ -127,7 +127,7 @@ describe('parseEvaluators - deterministic assertion types', () => { }); it('parses type: is-json', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ name: 'json-check', type: 'is-json' }], }, @@ -137,12 +137,12 @@ describe('parseEvaluators - deterministic assertion types', () => { ); expect(evaluators).toHaveLength(1); expect(evaluators?.[0].type).toBe('is-json'); - const config = evaluators?.[0] as IsJsonEvaluatorConfig; + const config = evaluators?.[0] as IsJsonGraderConfig; expect(config.name).toBe('json-check'); }); it('auto-generates name for is-json when not provided', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ type: 'is-json' }], }, @@ -156,7 +156,7 @@ describe('parseEvaluators - deterministic assertion types', () => { }); it('parses type: is-json with weight', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ name: 'json-weighted', type: 'is-json', weight: 0.5 }], }, @@ -165,12 +165,12 @@ describe('parseEvaluators - deterministic assertion types', () => { 'test-1', ); expect(evaluators).toHaveLength(1); - const config = evaluators?.[0] as IsJsonEvaluatorConfig; + const config = evaluators?.[0] as IsJsonGraderConfig; expect(config.weight).toBe(0.5); }); it('parses type: equals', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ name: 'exact-match', type: 'equals', value: 'DENIED' }], }, @@ -180,13 +180,13 @@ describe('parseEvaluators - deterministic assertion types', () => { ); expect(evaluators).toHaveLength(1); expect(evaluators?.[0].type).toBe('equals'); - const config = evaluators?.[0] as EqualsEvaluatorConfig; + const config = evaluators?.[0] as EqualsGraderConfig; expect(config.name).toBe('exact-match'); expect(config.value).toBe('DENIED'); }); it('auto-generates name for equals when not provided', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ type: 'equals', value: 'APPROVED' }], }, @@ -200,7 +200,7 @@ describe('parseEvaluators - deterministic assertion types', () => { }); it('skips equals evaluator with missing value', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ name: 'no-value', type: 'equals' }], }, @@ -212,7 +212,7 @@ describe('parseEvaluators - deterministic assertion types', () => { }); it('parses type: rubrics with criteria as llm-grader', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [ { @@ -228,11 +228,11 @@ describe('parseEvaluators - deterministic assertion types', () => { ); expect(evaluators).toHaveLength(1); expect(evaluators?.[0].type).toBe('llm-grader'); - expect((evaluators?.[0] as LlmGraderEvaluatorConfig).rubrics).toHaveLength(1); + expect((evaluators?.[0] as LlmGraderConfig).rubrics).toHaveLength(1); }); it('parses multiple assertion types in one evaluators array', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [ { name: 'c1', type: 'contains', value: 'hello' }, @@ -253,7 +253,7 @@ describe('parseEvaluators - deterministic assertion types', () => { }); }); -describe('parseEvaluators - tool-trajectory', () => { +describe('parseGraders - tool-trajectory', () => { let tempDir: string; beforeAll(async () => { @@ -280,10 +280,10 @@ describe('parseEvaluators - tool-trajectory', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [tempDir], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [tempDir], 'test-case'); expect(evaluators).toHaveLength(1); - const config = evaluators?.[0] as ToolTrajectoryEvaluatorConfig; + const config = evaluators?.[0] as ToolTrajectoryGraderConfig; expect(config.type).toBe('tool-trajectory'); expect(config.name).toBe('tool-usage-check'); expect(config.mode).toBe('any_order'); @@ -303,10 +303,10 @@ describe('parseEvaluators - tool-trajectory', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [tempDir], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [tempDir], 'test-case'); expect(evaluators).toHaveLength(1); - const config = evaluators?.[0] as ToolTrajectoryEvaluatorConfig; + const config = evaluators?.[0] as ToolTrajectoryGraderConfig; expect(config.type).toBe('tool-trajectory'); expect(config.mode).toBe('in_order'); expect(config.expected).toEqual([{ tool: 'search' }, { tool: 'analyze' }, { tool: 'report' }]); @@ -324,10 +324,10 @@ describe('parseEvaluators - tool-trajectory', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [tempDir], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [tempDir], 'test-case'); expect(evaluators).toHaveLength(1); - const config = evaluators?.[0] as ToolTrajectoryEvaluatorConfig; + const config = evaluators?.[0] as ToolTrajectoryGraderConfig; expect(config.type).toBe('tool-trajectory'); expect(config.mode).toBe('exact'); expect(config.expected).toEqual([{ tool: 'toolA' }, { tool: 'toolB' }]); @@ -344,7 +344,7 @@ describe('parseEvaluators - tool-trajectory', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [tempDir], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [tempDir], 'test-case'); expect(evaluators).toBeUndefined(); }); @@ -360,7 +360,7 @@ describe('parseEvaluators - tool-trajectory', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [tempDir], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [tempDir], 'test-case'); expect(evaluators).toBeUndefined(); }); @@ -376,7 +376,7 @@ describe('parseEvaluators - tool-trajectory', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [tempDir], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [tempDir], 'test-case'); expect(evaluators).toBeUndefined(); }); @@ -392,7 +392,7 @@ describe('parseEvaluators - tool-trajectory', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [tempDir], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [tempDir], 'test-case'); expect(evaluators).toBeUndefined(); }); @@ -414,10 +414,10 @@ describe('parseEvaluators - tool-trajectory', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [tempDir], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [tempDir], 'test-case'); expect(evaluators).toHaveLength(1); - const config = evaluators?.[0] as ToolTrajectoryEvaluatorConfig; + const config = evaluators?.[0] as ToolTrajectoryGraderConfig; // Should keep valid numbers (including 0), filter out invalid ones expect(config.minimums).toEqual({ validTool: 5, zeroTool: 0 }); }); @@ -439,15 +439,15 @@ describe('parseEvaluators - tool-trajectory', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [tempDir], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [tempDir], 'test-case'); expect(evaluators).toHaveLength(1); - const config = evaluators?.[0] as ToolTrajectoryEvaluatorConfig; + const config = evaluators?.[0] as ToolTrajectoryGraderConfig; expect(config.expected).toEqual([{ tool: 'validTool' }, { tool: 'anotherValid' }]); }); }); -describe('parseEvaluators - code-grader config pass-through', () => { +describe('parseGraders - code-grader config pass-through', () => { let tempDir: string; beforeAll(async () => { @@ -478,10 +478,10 @@ describe('parseEvaluators - code-grader config pass-through', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [tempDir], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [tempDir], 'test-case'); expect(evaluators).toHaveLength(1); - const config = evaluators?.[0] as CodeEvaluatorConfig; + const config = evaluators?.[0] as CodeGraderConfig; expect(config.type).toBe('code-grader'); expect(config.name).toBe('fuzzy-matcher'); expect(config.config).toEqual({ @@ -505,10 +505,10 @@ describe('parseEvaluators - code-grader config pass-through', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [tempDir], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [tempDir], 'test-case'); expect(evaluators).toHaveLength(1); - const config = evaluators?.[0] as CodeEvaluatorConfig; + const config = evaluators?.[0] as CodeGraderConfig; expect(config.type).toBe('code-grader'); expect(config.config).toBeUndefined(); }); @@ -527,10 +527,10 @@ describe('parseEvaluators - code-grader config pass-through', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [tempDir], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [tempDir], 'test-case'); expect(evaluators).toHaveLength(1); - const config = evaluators?.[0] as CodeEvaluatorConfig; + const config = evaluators?.[0] as CodeGraderConfig; expect(config.weight).toBe(2.0); expect(config.config).toEqual({ threshold: 0.85 }); }); @@ -546,10 +546,10 @@ describe('parseEvaluators - code-grader config pass-through', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [tempDir], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [tempDir], 'test-case'); expect(evaluators).toHaveLength(1); - const config = evaluators?.[0] as CodeEvaluatorConfig; + const config = evaluators?.[0] as CodeGraderConfig; if (process.platform === 'win32') { expect(config.command).toEqual(['cmd.exe', '/c', './test_script.ts']); } else { @@ -558,10 +558,10 @@ describe('parseEvaluators - code-grader config pass-through', () => { }); }); -describe('parseEvaluators - kebab-case type normalization', () => { +describe('parseGraders - kebab-case type normalization', () => { const tempDir = '/tmp'; - it('normalizes kebab-case evaluator types to snake_case', async () => { + it('normalizes kebab-case grader types to snake_case', async () => { const rawEvalCase = { evaluators: [ { @@ -573,11 +573,11 @@ describe('parseEvaluators - kebab-case type normalization', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [tempDir], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [tempDir], 'test-case'); expect(evaluators).toHaveLength(1); expect(evaluators?.[0].type).toBe('llm-grader'); - expect((evaluators?.[0] as LlmGraderEvaluatorConfig).target).toBe('grader-low-cost-a'); + expect((evaluators?.[0] as LlmGraderConfig).target).toBe('grader-low-cost-a'); }); it('accepts code-grader kebab-case as canonical form', async () => { @@ -591,7 +591,7 @@ describe('parseEvaluators - kebab-case type normalization', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [tempDir], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [tempDir], 'test-case'); expect(evaluators).toHaveLength(1); expect(evaluators?.[0].type).toBe('code-grader'); @@ -607,7 +607,7 @@ describe('parseEvaluators - kebab-case type normalization', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [tempDir], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [tempDir], 'test-case'); expect(evaluators).toHaveLength(1); expect(evaluators?.[0].type).toBe('is-json'); @@ -624,7 +624,7 @@ describe('parseEvaluators - kebab-case type normalization', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [tempDir], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [tempDir], 'test-case'); expect(evaluators).toHaveLength(1); expect(evaluators?.[0].type).toBe('llm-grader'); @@ -641,7 +641,7 @@ describe('parseEvaluators - kebab-case type normalization', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [tempDir], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [tempDir], 'test-case'); expect(evaluators).toBeUndefined(); }); @@ -657,14 +657,14 @@ describe('parseEvaluators - kebab-case type normalization', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [tempDir], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [tempDir], 'test-case'); expect(evaluators).toHaveLength(1); expect(evaluators?.[0].type).toBe('contains'); }); }); -describe('parseEvaluators - score_ranges rubrics', () => { +describe('parseGraders - score_ranges rubrics', () => { it('parses valid score_ranges with min_score', async () => { const rawEvalCase = { evaluators: [ @@ -688,7 +688,7 @@ describe('parseEvaluators - score_ranges rubrics', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [process.cwd()], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [process.cwd()], 'test-case'); expect(evaluators).toHaveLength(1); const config = evaluators?.[0]; @@ -724,7 +724,7 @@ describe('parseEvaluators - score_ranges rubrics', () => { }; await expect( - parseEvaluators(rawEvalCase, undefined, [process.cwd()], 'test-case'), + parseGraders(rawEvalCase, undefined, [process.cwd()], 'test-case'), ).rejects.toThrow(/overlapping/i); }); @@ -748,7 +748,7 @@ describe('parseEvaluators - score_ranges rubrics', () => { }; await expect( - parseEvaluators(rawEvalCase, undefined, [process.cwd()], 'test-case'), + parseGraders(rawEvalCase, undefined, [process.cwd()], 'test-case'), ).rejects.toThrow(/coverage/i); }); @@ -771,7 +771,7 @@ describe('parseEvaluators - score_ranges rubrics', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [process.cwd()], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [process.cwd()], 'test-case'); expect(evaluators).toHaveLength(1); const config = evaluators?.[0]; @@ -784,7 +784,7 @@ describe('parseEvaluators - score_ranges rubrics', () => { }); }); -describe('parseEvaluators - score_ranges shorthand map', () => { +describe('parseGraders - score_ranges shorthand map', () => { it('normalizes shorthand map to correct array format', async () => { const rawEvalCase = { evaluators: [ @@ -808,7 +808,7 @@ describe('parseEvaluators - score_ranges shorthand map', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [process.cwd()], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [process.cwd()], 'test-case'); expect(evaluators).toHaveLength(1); const config = evaluators?.[0]; @@ -860,7 +860,7 @@ describe('parseEvaluators - score_ranges shorthand map', () => { }; await expect( - parseEvaluators(rawEvalCase, undefined, [process.cwd()], 'test-case'), + parseGraders(rawEvalCase, undefined, [process.cwd()], 'test-case'), ).rejects.toThrow(/must start at 0/); }); @@ -885,7 +885,7 @@ describe('parseEvaluators - score_ranges shorthand map', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [process.cwd()], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [process.cwd()], 'test-case'); expect(evaluators).toHaveLength(1); const config = evaluators?.[0]; @@ -895,7 +895,7 @@ describe('parseEvaluators - score_ranges shorthand map', () => { }); }); -describe('parseEvaluators - token-usage', () => { +describe('parseGraders - token-usage', () => { it('parses token-usage evaluator with limits', async () => { const rawEvalCase = { evaluators: [ @@ -908,7 +908,7 @@ describe('parseEvaluators - token-usage', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [process.cwd()], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [process.cwd()], 'test-case'); expect(evaluators).toHaveLength(1); expect(evaluators?.[0]).toEqual({ @@ -939,7 +939,7 @@ describe('parseEvaluators - token-usage', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, globalExecution, [process.cwd()], 'test'); + const evaluators = await parseGraders(rawEvalCase, globalExecution, [process.cwd()], 'test'); expect(evaluators).toHaveLength(1); expect(evaluators?.[0]).toEqual({ @@ -951,7 +951,7 @@ describe('parseEvaluators - token-usage', () => { }); }); -describe('parseEvaluators - execution-metrics', () => { +describe('parseGraders - execution-metrics', () => { it('parses execution-metrics evaluator with all thresholds', async () => { const rawEvalCase = { evaluators: [ @@ -970,7 +970,7 @@ describe('parseEvaluators - execution-metrics', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [process.cwd()], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [process.cwd()], 'test-case'); expect(evaluators).toHaveLength(1); expect(evaluators?.[0]).toEqual({ @@ -998,7 +998,7 @@ describe('parseEvaluators - execution-metrics', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [process.cwd()], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [process.cwd()], 'test-case'); expect(evaluators).toHaveLength(1); expect(evaluators?.[0]).toEqual({ @@ -1023,7 +1023,7 @@ describe('parseEvaluators - execution-metrics', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [process.cwd()], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [process.cwd()], 'test-case'); expect(evaluators).toHaveLength(1); expect(evaluators?.[0]).toEqual({ @@ -1047,7 +1047,7 @@ describe('parseEvaluators - execution-metrics', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [process.cwd()], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [process.cwd()], 'test-case'); expect(evaluators).toBeUndefined(); }); @@ -1063,7 +1063,7 @@ describe('parseEvaluators - execution-metrics', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [process.cwd()], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [process.cwd()], 'test-case'); expect(evaluators).toBeUndefined(); }); @@ -1079,7 +1079,7 @@ describe('parseEvaluators - execution-metrics', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [process.cwd()], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [process.cwd()], 'test-case'); expect(evaluators).toBeUndefined(); }); @@ -1095,7 +1095,7 @@ describe('parseEvaluators - execution-metrics', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [process.cwd()], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [process.cwd()], 'test-case'); expect(evaluators).toBeUndefined(); }); @@ -1111,7 +1111,7 @@ describe('parseEvaluators - execution-metrics', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [process.cwd()], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [process.cwd()], 'test-case'); expect(evaluators).toBeUndefined(); }); @@ -1127,7 +1127,7 @@ describe('parseEvaluators - execution-metrics', () => { ], }; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [process.cwd()], 'test-case'); + const evaluators = await parseGraders(rawEvalCase, undefined, [process.cwd()], 'test-case'); expect(evaluators).toHaveLength(1); expect(evaluators?.[0]).toEqual({ @@ -1138,7 +1138,7 @@ describe('parseEvaluators - execution-metrics', () => { }); }); -describe('parseEvaluators - default evaluators merge', () => { +describe('parseGraders - default evaluators merge', () => { it('appends root evaluators after case-level evaluators', async () => { const rawEvalCase = { execution: { @@ -1150,7 +1150,7 @@ describe('parseEvaluators - default evaluators merge', () => { evaluators: [{ name: 'root-eval', type: 'latency', threshold: 5000 }], }; - const evaluators = await parseEvaluators(rawEvalCase, globalExecution, [process.cwd()], 'test'); + const evaluators = await parseGraders(rawEvalCase, globalExecution, [process.cwd()], 'test'); expect(evaluators).toHaveLength(2); expect(evaluators?.[0]).toEqual({ name: 'case-eval', type: 'latency', threshold: 3000 }); @@ -1164,7 +1164,7 @@ describe('parseEvaluators - default evaluators merge', () => { evaluators: [{ name: 'root-eval', type: 'latency', threshold: 5000 }], }; - const evaluators = await parseEvaluators(rawEvalCase, globalExecution, [process.cwd()], 'test'); + const evaluators = await parseGraders(rawEvalCase, globalExecution, [process.cwd()], 'test'); expect(evaluators).toHaveLength(1); expect(evaluators?.[0]).toEqual({ name: 'root-eval', type: 'latency', threshold: 5000 }); @@ -1182,7 +1182,7 @@ describe('parseEvaluators - default evaluators merge', () => { evaluators: [{ name: 'root-eval', type: 'latency', threshold: 5000 }], }; - const evaluators = await parseEvaluators(rawEvalCase, globalExecution, [process.cwd()], 'test'); + const evaluators = await parseGraders(rawEvalCase, globalExecution, [process.cwd()], 'test'); expect(evaluators).toHaveLength(1); expect(evaluators?.[0]).toEqual({ name: 'case-eval', type: 'latency', threshold: 3000 }); @@ -1190,7 +1190,7 @@ describe('parseEvaluators - default evaluators merge', () => { it('returns undefined when no evaluators at any level', async () => { const rawEvalCase = {}; - const evaluators = await parseEvaluators(rawEvalCase, undefined, [process.cwd()], 'test'); + const evaluators = await parseGraders(rawEvalCase, undefined, [process.cwd()], 'test'); expect(evaluators).toBeUndefined(); }); @@ -1203,7 +1203,7 @@ describe('parseEvaluators - default evaluators merge', () => { evaluators: [{ name: 'root-eval', type: 'latency', threshold: 5000 }], }; - const evaluators = await parseEvaluators(rawEvalCase, globalExecution, [process.cwd()], 'test'); + const evaluators = await parseGraders(rawEvalCase, globalExecution, [process.cwd()], 'test'); expect(evaluators).toBeUndefined(); }); @@ -1218,7 +1218,7 @@ describe('parseEvaluators - default evaluators merge', () => { evaluators: [{ name: 'root-eval', type: 'latency', threshold: 5000 }], }; - const evaluators = await parseEvaluators(rawEvalCase, globalExecution, [process.cwd()], 'test'); + const evaluators = await parseGraders(rawEvalCase, globalExecution, [process.cwd()], 'test'); expect(evaluators).toHaveLength(1); expect(evaluators?.[0]).toEqual({ name: 'root-eval', type: 'latency', threshold: 5000 }); @@ -1233,7 +1233,7 @@ describe('parseEvaluators - default evaluators merge', () => { evaluators: [{ name: 'root-eval', type: 'latency', threshold: 5000 }], }; - const evaluators = await parseEvaluators(rawEvalCase, globalExecution, [process.cwd()], 'test'); + const evaluators = await parseGraders(rawEvalCase, globalExecution, [process.cwd()], 'test'); expect(evaluators).toHaveLength(2); expect(evaluators?.[0]).toEqual({ name: 'case-eval', type: 'latency', threshold: 3000 }); @@ -1241,7 +1241,7 @@ describe('parseEvaluators - default evaluators merge', () => { }); }); -describe('parseEvaluators - assert field', () => { +describe('parseGraders - assert field', () => { let tempDir: string; beforeAll(async () => { @@ -1254,7 +1254,7 @@ describe('parseEvaluators - assert field', () => { }); it('parses assertions field as evaluators', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [{ type: 'contains', value: 'DENIED' }], }, @@ -1267,7 +1267,7 @@ describe('parseEvaluators - assert field', () => { }); it('parses legacy assert field as evaluators (backward compat)', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assert: [{ type: 'contains', value: 'DENIED' }], }, @@ -1280,7 +1280,7 @@ describe('parseEvaluators - assert field', () => { }); it('assertions takes precedence over execution.evaluators', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [{ type: 'contains', value: 'DENIED' }], execution: { @@ -1296,7 +1296,7 @@ describe('parseEvaluators - assert field', () => { }); it('assertions takes precedence over top-level evaluators', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [{ type: 'contains', value: 'DENIED' }], evaluators: [{ name: 'latency-check', type: 'latency', threshold: 5000 }], @@ -1310,7 +1310,7 @@ describe('parseEvaluators - assert field', () => { }); it('merges suite-level assertions with test-level assertions', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [{ type: 'contains', value: 'DENIED' }], }, @@ -1324,7 +1324,7 @@ describe('parseEvaluators - assert field', () => { }); it('skip_defaults prevents suite-level assertions from being appended', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [{ type: 'contains', value: 'DENIED' }], execution: { skip_defaults: true }, @@ -1338,7 +1338,7 @@ describe('parseEvaluators - assert field', () => { }); it('falls back to execution.evaluators when assert is not present', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { execution: { evaluators: [{ name: 'latency-check', type: 'latency', threshold: 5000 }], @@ -1353,7 +1353,7 @@ describe('parseEvaluators - assert field', () => { }); it('suite-level assertions takes precedence over suite-level execution.evaluators', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( {}, { assertions: [{ type: 'contains', value: 'HELLO' }], @@ -1367,7 +1367,7 @@ describe('parseEvaluators - assert field', () => { }); it('falls back to suite-level execution.evaluators when suite assertions is not present', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( {}, { evaluators: [{ name: 'latency-check', type: 'latency', threshold: 5000 }], @@ -1380,7 +1380,7 @@ describe('parseEvaluators - assert field', () => { }); }); -describe('parseEvaluators - assertion templates', () => { +describe('parseGraders - assertion templates', () => { let tempDir: string; beforeAll(async () => { @@ -1409,7 +1409,7 @@ assertions: `, ); - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ include: 'shared' }], }, @@ -1432,7 +1432,7 @@ assertions: `, ); - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [{ type: 'contains', value: 'case-only' }], execution: { skip_defaults: true }, @@ -1467,7 +1467,7 @@ assertions: `, ); - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [{ include: './templates/nested/outer.yaml' }], }, @@ -1505,7 +1505,7 @@ assertions: `, ); - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ include: 'level-a' }], }, @@ -1550,7 +1550,7 @@ assertions: ); await expect( - parseEvaluators( + parseGraders( { evaluators: [{ include: 'depth-a' }], }, @@ -1563,7 +1563,7 @@ assertions: it('throws a clear error when a template is missing', async () => { await expect( - parseEvaluators( + parseGraders( { evaluators: [{ include: 'missing-template' }], }, @@ -1575,7 +1575,7 @@ assertions: }); }); -describe('parseEvaluators - type: rubrics with criteria', () => { +describe('parseGraders - type: rubrics with criteria', () => { let tempDir: string; beforeAll(async () => { @@ -1588,7 +1588,7 @@ describe('parseEvaluators - type: rubrics with criteria', () => { }); it('parses rubrics type with criteria array', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [ { @@ -1607,12 +1607,12 @@ describe('parseEvaluators - type: rubrics with criteria', () => { ); expect(evaluators).toHaveLength(1); expect(evaluators?.[0].type).toBe('llm-grader'); - expect((evaluators?.[0] as LlmGraderEvaluatorConfig).rubrics).toHaveLength(2); - expect((evaluators?.[0] as LlmGraderEvaluatorConfig).weight).toBe(4.0); + expect((evaluators?.[0] as LlmGraderConfig).rubrics).toHaveLength(2); + expect((evaluators?.[0] as LlmGraderConfig).weight).toBe(4.0); }); it('auto-generates name for rubrics type', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [ { @@ -1631,7 +1631,7 @@ describe('parseEvaluators - type: rubrics with criteria', () => { it('skips rubrics with empty criteria array', async () => { const warnSpy = spyOn(console, 'warn').mockImplementation(() => {}); - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [ { @@ -1653,7 +1653,7 @@ describe('parseEvaluators - type: rubrics with criteria', () => { it('skips rubrics with missing criteria', async () => { const warnSpy = spyOn(console, 'warn').mockImplementation(() => {}); - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [ { @@ -1673,7 +1673,7 @@ describe('parseEvaluators - type: rubrics with criteria', () => { }); it('supports string shorthand in criteria', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [ { @@ -1687,11 +1687,11 @@ describe('parseEvaluators - type: rubrics with criteria', () => { 'test-1', ); expect(evaluators).toHaveLength(1); - expect((evaluators?.[0] as LlmGraderEvaluatorConfig).rubrics).toHaveLength(2); + expect((evaluators?.[0] as LlmGraderConfig).rubrics).toHaveLength(2); }); }); -describe('parseEvaluators - required field', () => { +describe('parseGraders - required field', () => { let tempDir: string; beforeAll(async () => { @@ -1705,7 +1705,7 @@ describe('parseEvaluators - required field', () => { }); it('parses required: true on contains evaluator', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ name: 'check', type: 'contains', value: 'DENIED', required: true }], }, @@ -1714,12 +1714,12 @@ describe('parseEvaluators - required field', () => { 'test-1', ); expect(evaluators).toHaveLength(1); - const config = evaluators?.[0] as ContainsEvaluatorConfig; + const config = evaluators?.[0] as ContainsGraderConfig; expect(config.required).toBe(true); }); it('parses required: 0.6 (numeric threshold) on contains evaluator', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ name: 'check', type: 'contains', value: 'DENIED', required: 0.6 }], }, @@ -1728,12 +1728,12 @@ describe('parseEvaluators - required field', () => { 'test-1', ); expect(evaluators).toHaveLength(1); - const config = evaluators?.[0] as ContainsEvaluatorConfig; + const config = evaluators?.[0] as ContainsGraderConfig; expect(config.required).toBe(0.6); }); it('ignores required: false', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ name: 'check', type: 'contains', value: 'DENIED', required: false }], }, @@ -1742,12 +1742,12 @@ describe('parseEvaluators - required field', () => { 'test-1', ); expect(evaluators).toHaveLength(1); - const config = evaluators?.[0] as ContainsEvaluatorConfig; + const config = evaluators?.[0] as ContainsGraderConfig; expect(config.required).toBeUndefined(); }); it('parses required on latency evaluator', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ name: 'lat', type: 'latency', threshold: 5000, required: true }], }, @@ -1756,12 +1756,12 @@ describe('parseEvaluators - required field', () => { 'test-1', ); expect(evaluators).toHaveLength(1); - const config = evaluators?.[0] as LatencyEvaluatorConfig; + const config = evaluators?.[0] as LatencyGraderConfig; expect(config.required).toBe(true); }); it('parses required on code-grader evaluator', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [ { @@ -1777,12 +1777,12 @@ describe('parseEvaluators - required field', () => { 'test-1', ); expect(evaluators).toHaveLength(1); - const config = evaluators?.[0] as CodeEvaluatorConfig; + const config = evaluators?.[0] as CodeGraderConfig; expect(config.required).toBe(true); }); it('parses required on llm-grader evaluator', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [{ name: 'grader', type: 'llm-grader', required: 0.7 }], }, @@ -1791,12 +1791,12 @@ describe('parseEvaluators - required field', () => { 'test-1', ); expect(evaluators).toHaveLength(1); - const config = evaluators?.[0] as LlmGraderEvaluatorConfig; + const config = evaluators?.[0] as LlmGraderConfig; expect(config.required).toBe(0.7); }); it('ignores invalid required values (string, negative, > 1)', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [ { name: 'c1', type: 'contains', value: 'A', required: 'yes' }, @@ -1812,12 +1812,12 @@ describe('parseEvaluators - required field', () => { expect(evaluators).toHaveLength(4); // All invalid required values should be dropped (undefined) for (const config of evaluators ?? []) { - expect((config as ContainsEvaluatorConfig).required).toBeUndefined(); + expect((config as ContainsGraderConfig).required).toBeUndefined(); } }); }); -describe('parseEvaluators - composite assertions field', () => { +describe('parseGraders - composite assertions field', () => { let tempDir: string; beforeAll(async () => { @@ -1833,7 +1833,7 @@ describe('parseEvaluators - composite assertions field', () => { }); it('parses composite with assertions field', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [ { @@ -1856,7 +1856,7 @@ describe('parseEvaluators - composite assertions field', () => { }); it('composite still works with evaluators field (backward compat)', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { evaluators: [ { @@ -1879,7 +1879,7 @@ describe('parseEvaluators - composite assertions field', () => { }); it('composite assertions takes precedence over evaluators', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [ { @@ -1897,15 +1897,15 @@ describe('parseEvaluators - composite assertions field', () => { ); expect(evaluators).toHaveLength(1); // assertions takes precedence - only 1 inner evaluator - const composite = evaluators?.[0] as CompositeEvaluatorConfig; + const composite = evaluators?.[0] as CompositeGraderConfig; expect(composite.assertions).toHaveLength(1); expect(composite.assertions[0].name).toBe('safety'); }); }); -describe('parseEvaluators - string shorthand in assertions', () => { +describe('parseGraders - string shorthand in assertions', () => { it('treats all-string assertions array as a single rubrics evaluator', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [ 'Mentions divide-and-conquer approach', @@ -1921,20 +1921,16 @@ describe('parseEvaluators - string shorthand in assertions', () => { expect(evaluators).toHaveLength(1); const rubrics = evaluators?.[0]; expect(rubrics?.type).toBe('llm-grader'); - expect((rubrics as LlmGraderEvaluatorConfig).rubrics).toHaveLength(3); - expect((rubrics as LlmGraderEvaluatorConfig).rubrics?.[0].outcome).toBe( + expect((rubrics as LlmGraderConfig).rubrics).toHaveLength(3); + expect((rubrics as LlmGraderConfig).rubrics?.[0].outcome).toBe( 'Mentions divide-and-conquer approach', ); - expect((rubrics as LlmGraderEvaluatorConfig).rubrics?.[1].outcome).toBe( - 'Explains partition step', - ); - expect((rubrics as LlmGraderEvaluatorConfig).rubrics?.[2].outcome).toBe( - 'States time complexity', - ); + expect((rubrics as LlmGraderConfig).rubrics?.[1].outcome).toBe('Explains partition step'); + expect((rubrics as LlmGraderConfig).rubrics?.[2].outcome).toBe('States time complexity'); }); it('groups strings into rubrics and preserves object evaluators', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [ 'Mentions divide-and-conquer approach', @@ -1950,8 +1946,8 @@ describe('parseEvaluators - string shorthand in assertions', () => { expect(evaluators).toHaveLength(2); // First: rubrics (at position of first string) expect(evaluators?.[0].type).toBe('llm-grader'); - expect((evaluators?.[0] as LlmGraderEvaluatorConfig).rubrics).toHaveLength(2); - expect((evaluators?.[0] as LlmGraderEvaluatorConfig).rubrics?.[0].outcome).toBe( + expect((evaluators?.[0] as LlmGraderConfig).rubrics).toHaveLength(2); + expect((evaluators?.[0] as LlmGraderConfig).rubrics?.[0].outcome).toBe( 'Mentions divide-and-conquer approach', ); // Second: the contains evaluator @@ -1960,7 +1956,7 @@ describe('parseEvaluators - string shorthand in assertions', () => { }); it('treats a single string as a single-criterion rubrics evaluator', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: ['Response must be polite'], }, @@ -1971,14 +1967,14 @@ describe('parseEvaluators - string shorthand in assertions', () => { expect(evaluators).toHaveLength(1); expect(evaluators?.[0].type).toBe('llm-grader'); - expect((evaluators?.[0] as LlmGraderEvaluatorConfig).rubrics).toHaveLength(1); - expect((evaluators?.[0] as LlmGraderEvaluatorConfig).rubrics?.[0].outcome).toBe( + expect((evaluators?.[0] as LlmGraderConfig).rubrics).toHaveLength(1); + expect((evaluators?.[0] as LlmGraderConfig).rubrics?.[0].outcome).toBe( 'Response must be polite', ); }); it('ignores all-whitespace strings and produces no rubrics evaluator', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [' ', ''], }, @@ -1993,7 +1989,7 @@ describe('parseEvaluators - string shorthand in assertions', () => { it('sets rubrics grader weight = criteria count when mixed with other graders', async () => { // User sees 4 assertions; each should contribute equal weight. // rubrics(w=3) + contains(w=1) → each visible assertion = 1/4. - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [ 'Identifies the undefined access', @@ -2008,7 +2004,7 @@ describe('parseEvaluators - string shorthand in assertions', () => { ); expect(evaluators).toHaveLength(2); - const rubrics = evaluators?.[0] as LlmGraderEvaluatorConfig; + const rubrics = evaluators?.[0] as LlmGraderConfig; expect(rubrics.type).toBe('llm-grader'); expect(rubrics.rubrics).toHaveLength(3); expect(rubrics.weight).toBe(3); @@ -2017,7 +2013,7 @@ describe('parseEvaluators - string shorthand in assertions', () => { }); }); -describe('parseEvaluators - file:// prefix prompt resolution', () => { +describe('parseGraders - file:// prefix prompt resolution', () => { let tempDir: string; beforeAll(async () => { @@ -2031,7 +2027,7 @@ describe('parseEvaluators - file:// prefix prompt resolution', () => { }); it('file:// prefix resolves existing file', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [{ name: 'quality', type: 'llm-grader', prompt: 'file://grader.md' }], }, @@ -2040,14 +2036,14 @@ describe('parseEvaluators - file:// prefix prompt resolution', () => { 'test-1', ); expect(evaluators).toHaveLength(1); - const config = evaluators?.[0] as LlmGraderEvaluatorConfig; + const config = evaluators?.[0] as LlmGraderConfig; expect(config.promptPath).toBeTruthy(); expect(config.promptPath).toContain('grader.md'); }); it('file:// prefix throws when file not found', async () => { await expect( - parseEvaluators( + parseGraders( { assertions: [{ name: 'missing', type: 'llm-grader', prompt: 'file://nonexistent.md' }], }, @@ -2059,7 +2055,7 @@ describe('parseEvaluators - file:// prefix prompt resolution', () => { }); it('bare path is always treated as inline text even if file exists', async () => { - const evaluators = await parseEvaluators( + const evaluators = await parseGraders( { assertions: [{ name: 'quality', type: 'llm-grader', prompt: 'grader.md' }], }, @@ -2068,7 +2064,7 @@ describe('parseEvaluators - file:// prefix prompt resolution', () => { 'test-1', ); expect(evaluators).toHaveLength(1); - const config = evaluators?.[0] as LlmGraderEvaluatorConfig; + const config = evaluators?.[0] as LlmGraderConfig; // Bare string is inline text — no file resolution, no promptPath expect(config.prompt).toBe('grader.md'); expect(config.promptPath).toBeUndefined(); diff --git a/packages/core/test/evaluation/orchestrator.test.ts b/packages/core/test/evaluation/orchestrator.test.ts index 21e4d986e..70e75bad6 100644 --- a/packages/core/test/evaluation/orchestrator.test.ts +++ b/packages/core/test/evaluation/orchestrator.test.ts @@ -3,7 +3,7 @@ import { mkdtempSync, readFileSync, readdirSync, writeFileSync } from 'node:fs'; import { tmpdir } from 'node:os'; import path from 'node:path'; -import { LlmGraderEvaluator, ToolTrajectoryEvaluator } from '../../src/evaluation/evaluators.js'; +import { LlmGrader, ToolTrajectoryGrader } from '../../src/evaluation/graders.js'; import { type EvaluationCache, runEvalCase, @@ -551,7 +551,7 @@ console.log('spreadsheet: revenue,total\\nQ1,42');`, }); const evaluatorRegistry = { - 'llm-grader': new LlmGraderEvaluator({ + 'llm-grader': new LlmGrader({ resolveGraderProvider: async () => graderProvider, }), }; @@ -871,7 +871,7 @@ describe('runEvalCase trace integration', () => { output, ); - const trajectoryEvaluator = new ToolTrajectoryEvaluator({ + const trajectoryEvaluator = new ToolTrajectoryGrader({ config: { name: 'tool-check', type: 'tool-trajectory', @@ -911,7 +911,7 @@ describe('runEvalCase trace integration', () => { output: [{ role: 'assistant', content: 'Result' }], }); - const trajectoryEvaluator = new ToolTrajectoryEvaluator({ + const trajectoryEvaluator = new ToolTrajectoryGrader({ config: { name: 'tool-check', type: 'tool-trajectory', @@ -1163,9 +1163,9 @@ Reference: \${ref}\`); let receivedQuestion = ''; const captureGrader = { kind: 'llm-grader' as const, - async evaluate(context: { evalCase: EvalTest; evaluatorTemplateOverride?: string }) { - // The evaluatorTemplateOverride should contain our custom prompt - receivedQuestion = context.evaluatorTemplateOverride ?? ''; + async evaluate(context: { evalCase: EvalTest; graderTemplateOverride?: string }) { + // The graderTemplateOverride should contain our custom prompt + receivedQuestion = context.graderTemplateOverride ?? ''; return { score: 1.0, verdict: 'pass' as const, @@ -1229,8 +1229,8 @@ console.log('Question: ' + question + '\\nAnswer: ' + answer); let receivedPrompt = ''; const captureGrader = { kind: 'llm-grader' as const, - async evaluate(context: { evaluatorTemplateOverride?: string }) { - receivedPrompt = context.evaluatorTemplateOverride ?? ''; + async evaluate(context: { graderTemplateOverride?: string }) { + receivedPrompt = context.graderTemplateOverride ?? ''; return { score: 1.0, verdict: 'pass' as const, @@ -1282,8 +1282,8 @@ console.log('Question: ' + question + '\\nAnswer: ' + answer); let receivedPrompt = ''; const captureGrader = { kind: 'llm-grader' as const, - async evaluate(context: { evaluatorTemplateOverride?: string }) { - receivedPrompt = context.evaluatorTemplateOverride ?? ''; + async evaluate(context: { graderTemplateOverride?: string }) { + receivedPrompt = context.graderTemplateOverride ?? ''; return { score: 1.0, verdict: 'pass' as const, @@ -1325,7 +1325,7 @@ console.log('Question: ' + question + '\\nAnswer: ' + answer); }); describe('runEvaluation with trials', () => { - // Provider that returns configurable scores via alternating evaluator results + // Provider that returns configurable scores via alternating grader results class MultiCallProvider implements Provider { readonly id = 'multi:mock'; readonly kind = 'mock' as const; @@ -1340,7 +1340,7 @@ describe('runEvaluation with trials', () => { } } - // Evaluator that returns different scores on successive calls + // Grader that returns different scores on successive calls function createScoringEvaluator(scores: number[]) { let callIndex = 0; return { @@ -1957,7 +1957,7 @@ describe('deterministic assertion evaluators in orchestrator', () => { expectedVerdict: 'fail', }, ])( - '$label: $type evaluator scores $expectedScore', + '$label: $type grader scores $expectedScore', async ({ evaluator, output, @@ -2283,7 +2283,7 @@ describe('required gates', () => { }); it('required: true uses 0.8 threshold (llm-grader score below 0.8 triggers gate)', async () => { - // Create an evaluator registry where llm-grader returns 0.7 (below 0.8 threshold) + // Create an grader registry where llm-grader returns 0.7 (below 0.8 threshold) const lowScoreEvaluatorRegistry = { 'llm-grader': { kind: 'llm-grader' as const, diff --git a/packages/core/test/evaluation/token-usage.test.ts b/packages/core/test/evaluation/token-usage.test.ts index 590cb8d44..7480e194f 100644 --- a/packages/core/test/evaluation/token-usage.test.ts +++ b/packages/core/test/evaluation/token-usage.test.ts @@ -9,7 +9,7 @@ import type { ProviderRequest, ProviderResponse, } from '../../src/evaluation/providers/types.js'; -import type { EvaluatorResult } from '../../src/evaluation/types.js'; +import type { GraderResult } from '../../src/evaluation/types.js'; // ─── AI SDK mapResponse ──────────────────────────────────────────────── // The mapResponse function is private, but we can test the public invoke() @@ -17,8 +17,8 @@ import type { EvaluatorResult } from '../../src/evaluation/types.js'; // Instead, we verify the type contracts that mapResponse must satisfy. describe('token usage type contracts', () => { - it('EvaluatorResult accepts tokenUsage', () => { - const result: EvaluatorResult = { + it('GraderResult accepts tokenUsage', () => { + const result: GraderResult = { name: 'test', type: 'llm-grader', score: 0.9, @@ -28,8 +28,8 @@ describe('token usage type contracts', () => { expect(result.tokenUsage).toEqual({ input: 100, output: 50 }); }); - it('EvaluatorResult tokenUsage is optional', () => { - const result: EvaluatorResult = { + it('GraderResult tokenUsage is optional', () => { + const result: GraderResult = { name: 'test', type: 'llm-grader', score: 0.9, @@ -39,7 +39,7 @@ describe('token usage type contracts', () => { }); it('nested scores carry tokenUsage', () => { - const result: EvaluatorResult = { + const result: GraderResult = { name: 'composite', type: 'composite', score: 0.8, diff --git a/packages/core/test/evaluation/tool-trajectory-evaluator.test.ts b/packages/core/test/evaluation/tool-trajectory-grader.test.ts similarity index 86% rename from packages/core/test/evaluation/tool-trajectory-evaluator.test.ts rename to packages/core/test/evaluation/tool-trajectory-grader.test.ts index caba93608..f7074f8cd 100644 --- a/packages/core/test/evaluation/tool-trajectory-evaluator.test.ts +++ b/packages/core/test/evaluation/tool-trajectory-grader.test.ts @@ -1,10 +1,10 @@ import { describe, expect, it } from 'bun:test'; -import { ToolTrajectoryEvaluator } from '../../src/evaluation/evaluators.js'; -import type { EvaluationContext } from '../../src/evaluation/evaluators.js'; +import { ToolTrajectoryGrader } from '../../src/evaluation/graders.js'; +import type { EvaluationContext } from '../../src/evaluation/graders.js'; import type { ResolvedTarget } from '../../src/evaluation/providers/targets.js'; import type { Message, Provider } from '../../src/evaluation/providers/types.js'; -import type { ToolTrajectoryEvaluatorConfig, TraceSummary } from '../../src/evaluation/trace.js'; +import type { ToolTrajectoryGraderConfig, TraceSummary } from '../../src/evaluation/trace.js'; import { computeTraceSummary } from '../../src/evaluation/trace.js'; import type { EvalTest } from '../../src/evaluation/types.js'; @@ -50,16 +50,16 @@ function createContext(options: { }; } -describe('ToolTrajectoryEvaluator', () => { +describe('ToolTrajectoryGrader', () => { describe('no trace available', () => { it('returns score 0 when no trace is provided', () => { - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'any_order', minimums: { search: 1 }, }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({})); @@ -86,13 +86,13 @@ describe('ToolTrajectoryEvaluator', () => { ]; const { trace } = computeTraceSummary(output); - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'any_order', minimums: { search: 3, analyze: 1 }, }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate( createContext({ @@ -115,13 +115,13 @@ describe('ToolTrajectoryEvaluator', () => { ]; const { trace } = computeTraceSummary(output); - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'any_order', minimums: { search: 3, analyze: 1 }, }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate( createContext({ @@ -257,13 +257,13 @@ describe('ToolTrajectoryEvaluator', () => { failMissPattern, }) => { it('passes when expected tools are satisfied', () => { - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode, expected: passExpected, }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output: passOutput })); @@ -273,13 +273,13 @@ describe('ToolTrajectoryEvaluator', () => { }); it('fails when expected tools are not satisfied', () => { - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode, expected: failExpected, }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output: failOutput })); @@ -304,13 +304,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'in_order', expected: [{ tool: 'search' }, { tool: 'analyze' }, { tool: 'report' }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate( createContext({ @@ -338,13 +338,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'in_order', expected: [{ tool: 'search' }, { tool: 'report' }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate( createContext({ @@ -371,13 +371,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', expected: [{ tool: 'search' }, { tool: 'analyze' }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate( createContext({ @@ -407,7 +407,7 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', @@ -416,7 +416,7 @@ describe('ToolTrajectoryEvaluator', () => { { tool: 'analyze', args: { format: 'json' } }, ], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -432,13 +432,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', expected: [{ tool: 'search', args: { query: 'test', limit: 10 } }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -458,13 +458,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', expected: [{ tool: 'search', args: { query: 'test', limit: 10 } }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -483,13 +483,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', expected: [{ tool: 'search', args: 'any' }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -505,13 +505,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', expected: [{ tool: 'search' }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -529,13 +529,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', expected: [{ tool: 'search', args: { query: 'test', limit: 10 }, argsMatch: 'superset' }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -551,13 +551,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', expected: [{ tool: 'search', args: { query: 'test', limit: 10 }, argsMatch: 'superset' }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -575,13 +575,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', expected: [{ tool: 'search', args: { query: 'test', limit: 10 }, argsMatch: 'subset' }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -597,13 +597,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', expected: [{ tool: 'search', args: { query: 'test' }, argsMatch: 'subset' }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -621,13 +621,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', expected: [{ tool: 'search', args: { query: 'test', limit: 10 }, argsMatch: 'ignore' }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -650,7 +650,7 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', @@ -662,7 +662,7 @@ describe('ToolTrajectoryEvaluator', () => { }, ], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -684,7 +684,7 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', @@ -696,7 +696,7 @@ describe('ToolTrajectoryEvaluator', () => { }, ], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -720,7 +720,7 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', @@ -732,7 +732,7 @@ describe('ToolTrajectoryEvaluator', () => { }, ], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -754,7 +754,7 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', @@ -767,7 +767,7 @@ describe('ToolTrajectoryEvaluator', () => { }, ], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -788,7 +788,7 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', @@ -798,7 +798,7 @@ describe('ToolTrajectoryEvaluator', () => { { tool: 'analyze', args: { format: 'json' } }, ], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -818,7 +818,7 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', @@ -830,7 +830,7 @@ describe('ToolTrajectoryEvaluator', () => { { tool: 'analyze', args: { format: 'json' }, argsMatch: 'exact' }, ], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -850,7 +850,7 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', @@ -860,7 +860,7 @@ describe('ToolTrajectoryEvaluator', () => { { tool: 'analyze', args: { format: 'xml', depth: 1 } }, ], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -882,7 +882,7 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'in_order', @@ -892,7 +892,7 @@ describe('ToolTrajectoryEvaluator', () => { { tool: 'analyze', args: { format: 'json' } }, ], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -913,13 +913,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', expected: [{ tool: 'search', args: { tags: ['a', 'b', 'c'] } }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -935,13 +935,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', expected: [{ tool: 'search', args: { tags: ['a', 'b', 'c'] } }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -959,13 +959,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', expected: [{ tool: 'search', args: {} }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -981,13 +981,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', expected: [{ tool: 'search', args: { query: 'test' } }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -1010,7 +1010,7 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', @@ -1021,7 +1021,7 @@ describe('ToolTrajectoryEvaluator', () => { }, ], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -1041,13 +1041,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'in_order', expected: [{ tool: 'Read', maxDurationMs: 100 }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -1068,13 +1068,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'in_order', expected: [{ tool: 'Read', maxDurationMs: 50 }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -1095,13 +1095,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'in_order', expected: [{ tool: 'Read', maxDurationMs: 100 }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -1132,7 +1132,7 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'in_order', @@ -1142,7 +1142,7 @@ describe('ToolTrajectoryEvaluator', () => { { tool: 'Write', maxDurationMs: 500 }, ], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -1175,7 +1175,7 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', @@ -1184,7 +1184,7 @@ describe('ToolTrajectoryEvaluator', () => { { tool: 'Edit', maxDurationMs: 500 }, ], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -1204,13 +1204,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', expected: [{ tool: 'Read', maxDurationMs: 100 }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -1231,13 +1231,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', expected: [{ tool: 'Read', maxDurationMs: 100 }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -1266,13 +1266,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', expected: [{ tool: 'Read', maxDurationMs: 100 }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -1296,13 +1296,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'exact', expected: [{ tool: 'Read', args: { file_path: 'config.json' }, maxDurationMs: 100 }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -1326,7 +1326,7 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'superset', @@ -1335,7 +1335,7 @@ describe('ToolTrajectoryEvaluator', () => { { tool: 'search' }, // Needs a second search call ], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -1351,13 +1351,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'superset', expected: [], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -1373,14 +1373,14 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'superset', argsMatch: 'superset', expected: [{ tool: 'search', args: { query: 'test' } }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -1402,7 +1402,7 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'subset', @@ -1410,7 +1410,7 @@ describe('ToolTrajectoryEvaluator', () => { { tool: 'read' }, // Single expected item allows any number of reads ], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -1426,13 +1426,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'subset', expected: [{ tool: 'read' }, { tool: 'search' }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -1448,13 +1448,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'subset', expected: [], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -1470,13 +1470,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'subset', expected: [{ tool: 'search', args: { query: 'test' }, argsMatch: 'superset' }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); @@ -1492,13 +1492,13 @@ describe('ToolTrajectoryEvaluator', () => { }, ]; - const config: ToolTrajectoryEvaluatorConfig = { + const config: ToolTrajectoryGraderConfig = { name: 'test', type: 'tool-trajectory', mode: 'subset', expected: [{ tool: 'search', args: { query: 'test' } }], }; - const evaluator = new ToolTrajectoryEvaluator({ config }); + const evaluator = new ToolTrajectoryGrader({ config }); const result = evaluator.evaluate(createContext({ output })); diff --git a/packages/eval/README.md b/packages/eval/README.md index 120c1276e..2db9802ac 100644 --- a/packages/eval/README.md +++ b/packages/eval/README.md @@ -1,6 +1,6 @@ # @agentv/eval -Evaluation SDK for AgentV - build custom evaluators with zero boilerplate. +Evaluation SDK for AgentV - build custom graders with zero boilerplate. ## Installation @@ -41,12 +41,12 @@ Both functions handle stdin/stdout parsing, snake_case conversion, Zod validatio ## Exports - `defineAssertion(handler)` - Define a custom assertion (pass/fail + optional score) -- `defineCodeGrader(handler)` - Define a code grader evaluator (full score control) +- `defineCodeGrader(handler)` - Define a code grader grader (full score control) - `definePromptTemplate(handler)` - Define a dynamic prompt template - `AssertionContext`, `AssertionScore` - Assertion types - `CodeGraderInput`, `CodeGraderResult` - Code grader types - `TraceSummary`, `Message`, `ToolCall` - Trace data types -- `createTargetClient()` - LLM target proxy for evaluators +- `createTargetClient()` - LLM target proxy for graders - `z` - Re-exported Zod for custom config schemas ## Documentation @@ -57,7 +57,7 @@ For complete documentation including: - Execution metrics usage - Best practices -See the [Custom Evaluators Guide](../../plugins/agentv-dev/skills/agentv-eval-writer/references/custom-evaluators.md) or run AgentV's `/agentv-eval-builder` skill. +See the [Custom Graders Guide](../../plugins/agentv-dev/skills/agentv-eval-writer/references/custom-graders.md) or run AgentV's `/agentv-eval-builder` skill. ## Repository diff --git a/packages/eval/src/runtime.ts b/packages/eval/src/runtime.ts index 42099dce6..4a404008f 100644 --- a/packages/eval/src/runtime.ts +++ b/packages/eval/src/runtime.ts @@ -106,9 +106,3 @@ export async function runCodeGrader(handler: CodeGraderHandler): Promise { process.exit(1); } } - -// ── Backward-compat aliases (deprecated) ──────────────────────────────────────── -/** @deprecated Use CodeGraderHandler */ -export type CodeJudgeHandler = CodeGraderHandler; -/** @deprecated Use runCodeGrader */ -export const runCodeJudge = runCodeGrader; diff --git a/packages/eval/src/target-client.ts b/packages/eval/src/target-client.ts index 8a5f633fd..e2922c555 100644 --- a/packages/eval/src/target-client.ts +++ b/packages/eval/src/target-client.ts @@ -116,7 +116,7 @@ export class TargetInvocationError extends Error { * * const response = await target.invoke({ * question: `Is this answer correct? Question: ${question}, Expected: ${criteria}`, - * systemPrompt: 'You are an expert evaluator. Respond with JSON: { "correct": true/false }' + * systemPrompt: 'You are an expert grader. Respond with JSON: { "correct": true/false }' * }); * * const result = JSON.parse(response.rawText ?? '{}'); diff --git a/plugins/agentv-dev/skills/agentv-bench/SKILL.md b/plugins/agentv-dev/skills/agentv-bench/SKILL.md index 522cd26d0..5ee0657b5 100644 --- a/plugins/agentv-dev/skills/agentv-bench/SKILL.md +++ b/plugins/agentv-dev/skills/agentv-bench/SKILL.md @@ -33,7 +33,7 @@ After the agent is working well, you can also run description optimization to im This skill is used by people across a wide range of familiarity with evaluation tooling. Pay attention to context cues: - "evaluation" and "benchmark" are borderline but OK in most cases -- For "YAML", "evaluator", "assertion", "deterministic judge" — see serious cues from the user that they know what those mean before using them without explanation +- For "YAML", "grader", "assertion", "deterministic judge" — see serious cues from the user that they know what those mean before using them without explanation - Briefly explain terms if in doubt When presenting results, default to summary tables. Offer detail on request. In CI/headless mode, skip interactive prompts and exit with status codes. @@ -48,7 +48,7 @@ Before running or optimizing, understand what you're working with. 2. **Identify success criteria** — what does "good" look like for this agent? What are the edge cases? What would a failure look like? Talk to the user if this isn't clear from the artifacts alone. -3. **Understand the target harness** — which provider runs the agent (Claude, GPT, Copilot CLI, Gemini, custom CLI)? This affects what evaluator types are available and how to run tests. Targets are configured in `.agentv/targets.yaml` (canonical location, searched from the eval file directory upward). Sensitive values like `api_key` must use `${{ ENV_VAR }}` syntax — literal secrets are rejected as a security guardrail. +3. **Understand the target harness** — which provider runs the agent (Claude, GPT, Copilot CLI, Gemini, custom CLI)? This affects what grader types are available and how to run tests. Targets are configured in `.agentv/targets.yaml` (canonical location, searched from the eval file directory upward). Sensitive values like `api_key` must use `${{ ENV_VAR }}` syntax — literal secrets are rejected as a security guardrail. 4. **Challenge assumptions** — if evals already exist, review their quality before running: - Are the test cases testing the right things? @@ -56,7 +56,7 @@ Before running or optimizing, understand what you're working with. - Are there ambiguous or contradictory test cases? - Flag eval issues before proceeding — running bad evals wastes time. -5. **Check integrity** — ensure task prompts (what the agent receives) are not also used as evaluator prompts (how outputs are scored). If a prompt file appears in both locations, note the overlap and optimize only for the task purpose. +5. **Check integrity** — ensure task prompts (what the agent receives) are not also used as grader prompts (how outputs are scored). If a prompt file appears in both locations, note the overlap and optimize only for the task purpose. --- @@ -89,7 +89,7 @@ Multi-skill evaluation is handled naturally via input messages — describe the **evals.json** (skill-creator compatible) — auto-promoted to EVAL-equivalent format: - `prompt` → input messages - `expected_output` → reference answer -- `assertions` → evaluators +- `assertions` → graders - `files[]` paths resolved relative to the evals.json location ```json @@ -112,9 +112,9 @@ Start with 2-3 realistic test cases — the kind of thing a real user would actu Good assertions are objectively verifiable and have descriptive names. Subjective quality ("the output is good") is better evaluated qualitatively — don't force assertions onto things that need human judgment. -**Evaluator types** (cheapest to most expensive): `exact`, `contains`, `regex`, `is-json`, `field-accuracy`, `composite`, `code-grader`, `tool-trajectory`, `llm-grader`. See `references/eval-yaml-spec.md` for full config and grading recipes for each type. +**Grader types** (cheapest to most expensive): `exact`, `contains`, `regex`, `is-json`, `field-accuracy`, `composite`, `code-grader`, `tool-trajectory`, `llm-grader`. See `references/eval-yaml-spec.md` for full config and grading recipes for each type. -Prefer deterministic evaluators over LLM graders whenever possible. If an assertion can be checked with `contains` or `regex`, don't use `llm-grader`. +Prefer deterministic graders over LLM graders whenever possible. If an assertion can be checked with `contains` or `regex`, don't use `llm-grader`. --- @@ -162,7 +162,7 @@ agentv eval --target claude --target gpt --target copilot - **Improving existing**: snapshot the current version before editing (`cp -r /prompt-snapshot/`), use as baseline throughout - **Multi-target**: each target is its own baseline — no need for a separate "without" run -### While runs are in progress, draft evaluators +### While runs are in progress, draft graders Don't just wait for runs to finish — use this time productively. If assertions don't exist yet, draft them now. If they exist, review them and explain what they check to the user. @@ -311,9 +311,9 @@ This is the heart of the loop. You've run the test cases, analyzed the results, ### Evaluation integrity -**Critical**: Only optimize **task prompts** (what the agent receives), never **judge prompts** (how evaluators score outputs). Modifying judge prompts games the evaluation without improving the agent. +**Critical**: Only optimize **task prompts** (what the agent receives), never **judge prompts** (how graders score outputs). Modifying judge prompts games the evaluation without improving the agent. -If a prompt file is referenced in both task input and evaluator configs, optimize for the task purpose only. Document which prompts were modified in the optimization log. +If a prompt file is referenced in both task input and grader configs, optimize for the task purpose only. Document which prompts were modified in the optimization log. ### The iteration loop diff --git a/plugins/agentv-dev/skills/agentv-bench/agents/analyzer.md b/plugins/agentv-dev/skills/agentv-bench/agents/analyzer.md index d1e363eaf..9f32dab7d 100644 --- a/plugins/agentv-dev/skills/agentv-bench/agents/analyzer.md +++ b/plugins/agentv-dev/skills/agentv-bench/agents/analyzer.md @@ -2,7 +2,7 @@ name: analyzer description: >- Analyze AgentV evaluation results to identify weak assertions, suggest deterministic - upgrades for LLM-grader evaluators, flag cost/quality improvements, and surface + upgrades for LLM-grader graders, flag cost/quality improvements, and surface cross-run benchmark patterns. Use when reviewing eval quality, improving evaluation configs, or triaging flaky/expensive evaluations. model: inherit @@ -22,18 +22,18 @@ You are an eval-quality analyst for AgentV. Your job is to read JSONL evaluation Read every line of the JSONL results file. Each line is a JSON object with: - `test_id`, `suite`, `score`, `assertions`, `reasoning`, `target` -- `scores` (optional): Array of per-evaluator breakdowns with `name`, `type`, `score`, `weight`, `verdict`, `assertions`, `reasoning` +- `scores` (optional): Array of per-grader breakdowns with `name`, `type`, `score`, `weight`, `verdict`, `assertions`, `reasoning` -If `eval-path` is provided, also read the EVAL.yaml to understand evaluator configurations. +If `eval-path` is provided, also read the EVAL.yaml to understand grader configurations. ### Step 2: Deterministic-Upgrade Analysis -For each evaluator entry in `scores` where `type` is `"llm-grader"` or `"rubrics"`, inspect the `reasoning` and `assertions` fields for patterns that indicate a deterministic assertion would suffice: +For each grader entry in `scores` where `type` is `"llm-grader"` or `"rubrics"`, inspect the `reasoning` and `assertions` fields for patterns that indicate a deterministic assertion would suffice: | Signal | Detection | Suggested Upgrade | |--------|-----------|-------------------| | Reasoning cites exact substring match | Reasoning contains phrases like "contains", "includes the text", "mentions [quoted string]" | `type: contains` with `value: ""` | -| Score is always 0.0 or 1.0 across all test cases for this evaluator | Collect scores per evaluator name; if all are binary | `type: equals` or deterministic check — LLM is doing binary work | +| Score is always 0.0 or 1.0 across all test cases for this grader | Collect scores per grader name; if all are binary | `type: equals` or deterministic check — LLM is doing binary work | | Reasoning references JSON validity | "valid JSON", "parseable JSON", "well-formed JSON" | `type: is-json` | | Reasoning references format compliance | "starts with", "begins with", "output starts with [string]" | `type: regex` with `value: "^"` | | Reasoning references ending pattern | "ends with", "output ends with" | `type: regex` with `value: "$"` | @@ -60,23 +60,23 @@ Scan the EVAL.yaml `assertions` entries (if `eval-path` provided) and the `reaso ### Step 4: Cost/Quality Signals -Flag evaluators that are expensive relative to their value: +Flag graders that are expensive relative to their value: | Signal | Detection | Suggestion | |--------|-----------|------------| -| Expensive binary check | LLM-grader evaluator where score is always 0.0 or 1.0 | Replace with deterministic assertion (zero LLM cost) | +| Expensive binary check | LLM-grader grader where score is always 0.0 or 1.0 | Replace with deterministic assertion (zero LLM cost) | | High-confidence deterministic candidate | LLM-grader reasoning or assertions always cite the same substring/pattern | Replace with `contains`/`regex` (zero LLM cost) | -| Redundant evaluators | Two evaluators on the same test with identical scores and similar reasoning | Merge or remove the redundant one | -| Always-pass evaluator | Evaluator scores 1.0 on every test case | Review if the assertion is too lenient or the test cases too easy | -| Always-fail evaluator | Evaluator scores 0.0 on every test case | Review if the assertion is misconfigured or the criteria unrealistic | +| Redundant graders | Two graders on the same test with identical scores and similar reasoning | Merge or remove the redundant one | +| Always-pass grader | Grader scores 1.0 on every test case | Review if the assertion is too lenient or the test cases too easy | +| Always-fail grader | Grader scores 0.0 on every test case | Review if the assertion is misconfigured or the criteria unrealistic | ### Step 5: Multi-Provider Analysis If results contain multiple `target` values: -- Compare scores per evaluator across targets -- Flag evaluators with high variance across providers (> 0.3 score difference) — may indicate provider-sensitive assertions -- Identify evaluators that pass for all providers (potentially too lenient) or fail for all (potentially misconfigured) +- Compare scores per grader across targets +- Flag graders with high variance across providers (> 0.3 score difference) — may indicate provider-sensitive assertions +- Identify graders that pass for all providers (potentially too lenient) or fail for all (potentially misconfigured) ## Output Format @@ -87,32 +87,32 @@ Produce a structured report in this exact format: **Results file:** **Test cases analyzed:** -**Evaluator entries analyzed:** +**Grader entries analyzed:** **Targets:** ### Deterministic-Upgrade Candidates -| # | Test ID | Evaluator | Current Type | Evidence | Suggested Type | Suggested Config | +| # | Test ID | Grader | Current Type | Evidence | Suggested Type | Suggested Config | |---|---------|-----------|-------------|----------|----------------|-----------------| -| 1 | | | llm-grader | | contains | `value: "exact string"` | +| 1 | | | llm-grader | | contains | `value: "exact string"` | ### Weak Assertions -| # | Test ID | Evaluator | Weakness | Current | Suggested Improvement | +| # | Test ID | Grader | Weakness | Current | Suggested Improvement | |---|---------|-----------|----------|---------|----------------------| -| 1 | | | Vague criteria | "Response is good" | Add specific criteria: what makes it "good"? | +| 1 | | | Vague criteria | "Response is good" | Add specific criteria: what makes it "good"? | ### Cost/Quality Flags -| # | Test ID | Evaluator | Flag | Detail | Suggestion | +| # | Test ID | Grader | Flag | Detail | Suggestion | |---|---------|-----------|------|--------|------------| -| 1 | | | Always-pass | Score 1.0 on 15/15 tests | Tighten criteria or add harder test cases | +| 1 | | | Always-pass | Score 1.0 on 15/15 tests | Tighten criteria or add harder test cases | ### Summary -- **Deterministic upgrades:** evaluators could be replaced with cheaper deterministic checks +- **Deterministic upgrades:** graders could be replaced with cheaper deterministic checks - **Weak assertions:** assertions need strengthening -- **Cost flags:** evaluators flagged for cost/quality review +- **Cost flags:** graders flagged for cost/quality review - **Estimated savings:** Replacing LLM-grader calls with deterministic checks ``` @@ -120,10 +120,10 @@ If a section has no findings, include the header with "None found." underneath. ## Guidelines -- **Be specific:** Every suggestion must include the test case ID, evaluator name, evidence from the results, and a concrete replacement config. +- **Be specific:** Every suggestion must include the test case ID, grader name, evidence from the results, and a concrete replacement config. - **Be conservative:** Only suggest deterministic upgrades when the pattern is clear and consistent. Partial or ambiguous evidence should be noted but not acted on. - **Prioritize by impact:** Order suggestions by estimated cost savings (`llm-grader` → deterministic saves the most). -- **Handle all evaluator types:** Process `code-grader`, `tool-trajectory`, `llm-grader`, `rubrics`, `composite`, and all deterministic types. Only LLM-based types are candidates for deterministic upgrades. +- **Handle all grader types:** Process `code-grader`, `tool-trajectory`, `llm-grader`, `rubrics`, `composite`, and all deterministic types. Only LLM-based types are candidates for deterministic upgrades. - **Multi-provider awareness:** When results span multiple targets, note if a suggestion applies to all targets or is target-specific. - **No false positives:** It is better to miss a suggestion than to recommend an incorrect upgrade. If unsure, add the finding to a "Needs Review" subsection with your reasoning. diff --git a/plugins/agentv-dev/skills/agentv-bench/agents/comparator.md b/plugins/agentv-dev/skills/agentv-bench/agents/comparator.md index 943e2d3e0..bc840ff30 100644 --- a/plugins/agentv-dev/skills/agentv-bench/agents/comparator.md +++ b/plugins/agentv-dev/skills/agentv-bench/agents/comparator.md @@ -25,7 +25,7 @@ You will receive: - `outputs`: Array of evaluation outputs to compare. Each contains: - `target_id`: The provider/configuration identifier (DO NOT read this during scoring) - `answer`: The candidate response text - - `evaluator_results`: Array of evaluator scores and details (code-grader, tool-trajectory, llm-grader, deterministic) + - `evaluator_results`: Array of grader scores and details (code-grader, tool-trajectory, llm-grader, deterministic) - `workspace_changes`: File changes made during workspace evaluation (if applicable) - `tool_calls`: Tool invocations and results from multi-turn conversations (if applicable) - `conversation`: Full multi-turn conversation history (if applicable) @@ -61,7 +61,7 @@ Assign random labels to outputs. Use the following procedure: ### Phase 2: Dynamic Rubric Generation -Generate task-specific rubrics based on `task_context` and the evaluator types present. The rubric has two dimensions: +Generate task-specific rubrics based on `task_context` and the grader types present. The rubric has two dimensions: **Content Rubric** — adapts criteria to the task type: @@ -89,7 +89,7 @@ For each content criterion, define: | Format compliance | 0.2 | Adherence to requested output format (JSON, markdown, code blocks) | | Completeness | 0.2 | All requested sections present, no truncation | -**Evaluator-Specific Scoring** — when evaluator results are present: +**Grader-Specific Scoring** — when grader results are present: - **code-grader**: Factor in pass/fail results, test coverage, assertion hit rates - **tool-trajectory**: Factor in tool call accuracy, sequence correctness, unnecessary tool calls @@ -102,10 +102,10 @@ For each labeled output (A, B, C, ...): 1. **Content score** (1–10): Apply the content rubric criteria with weights 2. **Structure score** (1–10): Apply the structure rubric criteria with weights -3. **Evaluator score** (1–10): Normalize evaluator results to a 1–10 scale. If no evaluator results, omit this dimension. +3. **Grader score** (1–10): Normalize grader results to a 1–10 scale. If no grader results, omit this dimension. 4. **Overall score**: Weighted combination: - - If evaluator results present: `0.5 × content + 0.2 × structure + 0.3 × evaluator` - - If no evaluator results: `0.7 × content + 0.3 × structure` + - If grader results present: `0.5 × content + 0.2 × structure + 0.3 × grader` + - If no grader results: `0.7 × content + 0.3 × structure` For N > 2 outputs, use **round-robin pairwise comparison** to establish ranking: - Compare every pair (A vs B, A vs C, B vs C, ...) @@ -162,7 +162,7 @@ Write the comparison results to `results_file` as JSON: "overall_weights": { "content": , "structure": , - "evaluator": + "grader": } }, "results": [ @@ -172,7 +172,7 @@ Write the comparison results to `results_file` as JSON: "scores": { "content": <1-10>, "structure": <1-10>, - "evaluator": <1-10 or null>, + "grader": <1-10 or null>, "overall": <1-10> }, "content_breakdown": [ @@ -215,7 +215,7 @@ Also produce a human-readable markdown summary: ### Rankings -| Rank | Label | Target | Overall | Content | Structure | Evaluator | +| Rank | Label | Target | Overall | Content | Structure | Grader | |------|-------|--------|---------|---------|-----------|-----------| | 1 | A | | 8.5 | 9.0 | 7.5 | 8.5 | @@ -236,12 +236,12 @@ Also produce a human-readable markdown summary: - **Be evidence-based**: Every score must cite specific evidence from the output. - **Evaluate substance over style**: Correct, complete answers with rough formatting score higher than polished but incorrect answers. - **Handle missing data gracefully**: If an output lacks workspace changes or tool calls but others have them, score what is present — do not penalize for data the target wasn't expected to produce. -- **Respect evaluator signals**: When code-grader or tool-trajectory results exist, they represent objective ground truth. Weight these heavily. +- **Respect grader signals**: When code-grader or tool-trajectory results exist, they represent objective ground truth. Weight these heavily. ## Edge Cases - **Identical outputs**: If two outputs are effectively identical, score them equally and note the duplication. - **Single output**: If only one output is provided, still generate the rubric and score it — this serves as a baseline for future comparisons. -- **Missing evaluator results**: If some outputs have evaluator results and others don't, score evaluator dimension only for those that have it. Adjust overall weights accordingly. +- **Missing grader results**: If some outputs have grader results and others don't, score grader dimension only for those that have it. Adjust overall weights accordingly. - **Very long outputs**: Focus scoring on substance and correctness. Length alone is neither a positive nor negative signal. - **Tie in overall scores**: Use pairwise comparison wins as tiebreaker. If still tied, declare a tie and explain the tradeoffs. diff --git a/plugins/agentv-dev/skills/agentv-bench/references/description-optimization.md b/plugins/agentv-dev/skills/agentv-bench/references/description-optimization.md index 1a0134cba..6e90abcc2 100644 --- a/plugins/agentv-dev/skills/agentv-bench/references/description-optimization.md +++ b/plugins/agentv-dev/skills/agentv-bench/references/description-optimization.md @@ -6,7 +6,7 @@ core workflow step. **Provider compatibility**: Description optimization applies to any agent platform with skill-discovery mechanisms — Claude Code, Codex (`.agents/` or `.codex/` folders), Copilot, -and others. The `skill-trigger` evaluator checks whether the agent invoked the right skill, +and others. The `skill-trigger` grader checks whether the agent invoked the right skill, regardless of how discovery works on that platform. ## Step 1: Generate Trigger EVAL.yaml diff --git a/plugins/agentv-dev/skills/agentv-bench/references/environment-adaptation.md b/plugins/agentv-dev/skills/agentv-bench/references/environment-adaptation.md index bcd522cd8..219c9d5f1 100644 --- a/plugins/agentv-dev/skills/agentv-bench/references/environment-adaptation.md +++ b/plugins/agentv-dev/skills/agentv-bench/references/environment-adaptation.md @@ -28,7 +28,7 @@ to any platform with skill-discovery mechanisms. All listed providers support sk ## Unsupported Providers: Use a Code-Grader -The built-in `skill-trigger` evaluator covers Claude, Copilot, Pi, Codex and VS Code out +The built-in `skill-trigger` grader covers Claude, Copilot, Pi, Codex and VS Code out of the box. For providers with different tool-call formats, write a code-grader that inspects the agent's tool call trace. diff --git a/plugins/agentv-dev/skills/agentv-bench/references/eval-yaml-spec.md b/plugins/agentv-dev/skills/agentv-bench/references/eval-yaml-spec.md index a8a3c632b..65b5a7a38 100644 --- a/plugins/agentv-dev/skills/agentv-bench/references/eval-yaml-spec.md +++ b/plugins/agentv-dev/skills/agentv-bench/references/eval-yaml-spec.md @@ -19,7 +19,7 @@ The grader agent uses this to evaluate assertions without the CLI. - `input` (string | Message[], required) — task input. String shorthand expands to `[{role: user, content: "..."}]` - `expected_output` (string | Message[], optional) — reference answer. String shorthand expands to `[{role: assistant, content: "..."}]` - `criteria` (string, optional) — human-readable success criteria -- `assertions` (array, optional) — evaluator assertions +- `assertions` (array, optional) — grader assertions - `conversation_id` (string, optional) — groups related tests - `execution` (object, optional) — per-test execution override @@ -224,7 +224,7 @@ Each line in the results JSONL file is an `EvaluationResult` object. In JSONL, f ### Optional fields -- `scores` (array of EvaluatorResult) — per-evaluator breakdown +- `scores` (array of EvaluatorResult) — per-grader breakdown - `input` (Message[]) — input messages - `token_usage` (object: `{prompt_tokens, completion_tokens, total_tokens}`) - `cost_usd` (number) @@ -237,8 +237,8 @@ Each line in the results JSONL file is an `EvaluationResult` object. In JSONL, f ### `scores[]` entries (EvaluatorResult) -- `name` (string) — evaluator name -- `type` (string) — evaluator kind (kebab-case) +- `name` (string) — grader name +- `type` (string) — grader kind (kebab-case) - `score` (number, 0.0-1.0) - `assertions` (array of `{text, passed, evidence?}`) - `weight` (number, optional) @@ -324,7 +324,7 @@ LLM grader results are read from disk at `/llm_grader_results/.js **Output:** - `/grading.json` — merged grading with `graders`, `assertions`, `summary.pass_rate` -- `index.jsonl` — one JSON line per test: `{test_id, score, pass, evaluators: [...]}` +- `index.jsonl` — one JSON line per test: `{test_id, score, pass, graders: [...]}` - `benchmark.json` — aggregate stats: `{metadata: {targets}, run_summary: {: {mean, stddev, n}}}` ### Agent-Mode Workflow diff --git a/plugins/agentv-dev/skills/agentv-bench/references/migrating-from-skill-creator.md b/plugins/agentv-dev/skills/agentv-bench/references/migrating-from-skill-creator.md index 2231efbf6..8d4e841a4 100644 --- a/plugins/agentv-dev/skills/agentv-bench/references/migrating-from-skill-creator.md +++ b/plugins/agentv-dev/skills/agentv-bench/references/migrating-from-skill-creator.md @@ -17,7 +17,7 @@ agentv eval assert --agent-output "..." --agent-input "..." AgentV automatically: - Promotes `prompt` → input messages - Promotes `expected_output` → reference answer -- Converts `assertions` → LLM-grader evaluators +- Converts `assertions` → LLM-grader graders - Resolves `files[]` paths relative to the evals.json directory If you're using the `agentv-bench` skill, it orchestrates these same AgentV commands. Code graders, grading, and artifact generation remain in AgentV core; the skill just orchestrates and summarizes the existing outputs. @@ -29,7 +29,7 @@ Moving from skill-creator's eval loop to AgentV's lifecycle skill gives you: | Capability | skill-creator | AgentV lifecycle skill | |-----------|---------------|----------------------| | Workspace isolation | ❌ | ✅ Clone repos, run setup/teardown scripts | -| Code graders | ❌ | ✅ Python/TypeScript evaluator scripts via `defineCodeGrader()` | +| Code graders | ❌ | ✅ Python/TypeScript grader scripts via `defineCodeGrader()` | | Tool trajectory scoring | ❌ | ✅ Evaluate tool call sequences | | Multi-provider comparison | with-skill vs without-skill | N-way: Claude, GPT, Copilot, Gemini, custom CLI | | Multi-turn evaluation | ❌ | ✅ Conversation tracking with `conversation_id` | @@ -71,7 +71,7 @@ agentv eval eval.yaml EVAL.yaml unlocks: - **Workspace setup/teardown** — clone repos, install dependencies, clean up after tests -- **Code graders** — write evaluators in Python or TypeScript, not just LLM prompts +- **Code graders** — write graders in Python or TypeScript, not just LLM prompts - **Rubric-based grading** — multi-dimensional scoring with weighted criteria - **Retry policies** — automatic retries for flaky tests with configurable backoff - **Test groups** — organize tests by category with shared config diff --git a/plugins/agentv-dev/skills/agentv-eval-writer/SKILL.md b/plugins/agentv-dev/skills/agentv-eval-writer/SKILL.md index da43f12ed..19a0ff385 100644 --- a/plugins/agentv-dev/skills/agentv-eval-writer/SKILL.md +++ b/plugins/agentv-dev/skills/agentv-eval-writer/SKILL.md @@ -367,7 +367,7 @@ Configure via `assertions` array. Multiple graders produce a weighted average sc Contract: stdin JSON -> stdout JSON `{score, assertions: [{text, passed, evidence?}], reasoning}` Input includes: `question`, `criteria`, `answer`, `reference_answer`, `output`, `trace`, `token_usage`, `cost_usd`, `duration_ms`, `start_time`, `end_time`, `file_changes`, `workspace_path`, `config` When a workspace is configured, `workspace_path` is the absolute path to the workspace dir (also available as `AGENTV_WORKSPACE_PATH` env var). Use this for functional grading (e.g., running `npm test` in the workspace). -See docs at https://agentv.dev/evaluators/code-graders/ +See docs at https://agentv.dev/graders/code-graders/ ### llm_grader ```yaml @@ -507,7 +507,7 @@ LLM-judged structured evaluation with weighted criteria. Criteria items support ### rubrics (inline, deprecated) Top-level `rubrics:` field is deprecated. Use `type: rubrics` under `assertions` instead. -See `references/rubric-evaluator.md` for score-range mode and scoring formula. +See `references/rubric-grader.md` for score-range mode and scoring formula. ## Execution Error Tolerance diff --git a/plugins/agentv-dev/skills/agentv-eval-writer/references/rubric-evaluator.md b/plugins/agentv-dev/skills/agentv-eval-writer/references/rubric-evaluator.md index 6d3ece932..2520e9b60 100644 --- a/plugins/agentv-dev/skills/agentv-eval-writer/references/rubric-evaluator.md +++ b/plugins/agentv-dev/skills/agentv-eval-writer/references/rubric-evaluator.md @@ -1,4 +1,4 @@ -# Rubric Evaluator +# Rubric Grader Rubrics are defined as `assertions` entries with `type: rubrics`. They support binary checklist grading and score-range analytic grading. @@ -35,7 +35,7 @@ assertions: Equivalent to the full form with `type: rubrics`. Use the full form only when you need weights, `required: false`, or `score_ranges`. -Mixed strings and objects are supported in `assertions` — strings are grouped into a single rubrics evaluator at the position of the first string: +Mixed strings and objects are supported in `assertions` — strings are grouped into a single rubrics grader at the position of the first string: ```yaml assertions: diff --git a/plugins/agentv-dev/skills/agentv-trace-analyst/SKILL.md b/plugins/agentv-dev/skills/agentv-trace-analyst/SKILL.md index 5cfea496d..6205f85e0 100644 --- a/plugins/agentv-dev/skills/agentv-trace-analyst/SKILL.md +++ b/plugins/agentv-dev/skills/agentv-trace-analyst/SKILL.md @@ -61,7 +61,7 @@ For each failing test, examine: - **assertions (failed)**: What criteria were not met? (filter for `passed: false`) - **trace.tool_calls**: Did the agent use expected tools? - **duration_ms**: Did it time out or run too long? -- **reasoning**: Why did the evaluator score it low? +- **reasoning**: Why did the grader score it low? ### 4. Inspect specific tests From 899e1337a30ca2b7751d39754470a4320640b484 Mon Sep 17 00:00:00 2001 From: Christopher Date: Wed, 15 Apr 2026 21:24:02 +0000 Subject: [PATCH 2/3] docs: fix grader example links Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- examples/README.md | 10 +++++----- .../evals/dataset.eval.yaml | 0 .../README.md | 0 .../evals/dataset.eval.baseline.jsonl | 0 .../evals/dataset.eval.yaml | 0 .../graders/assertions.ts | 0 .../package.json | 2 +- .../evals/dataset.eval.yaml | 0 .../README.md | 0 .../evals/dataset.eval.baseline.jsonl | 0 .../evals/dataset.eval.yaml | 0 .../prompts/accuracy-check.md | 0 .../prompts/clarity-check.md | 0 .../prompts/completeness-check.md | 0 .../prompts/correctness-check.md | 0 .../prompts/experimental-check.md | 0 .../prompts/quality-evaluation.md | 0 .../prompts/safety-check.md | 0 .../prompts/style-evaluation.md | 0 .../EVAL.yaml | 0 .../README.md | 0 .../bun.lock | 2 +- .../conformance-check.ts | 8 ++++---- .../fixtures.yaml | 0 .../graders}/keyword-grader.ts | 0 .../package.json | 2 +- packages/eval/README.md | 2 +- 27 files changed, 13 insertions(+), 13 deletions(-) rename examples/features/{default-evaluators => default-graders}/evals/dataset.eval.yaml (100%) rename examples/features/{deterministic-evaluators => deterministic-graders}/README.md (100%) rename examples/features/{deterministic-evaluators => deterministic-graders}/evals/dataset.eval.baseline.jsonl (100%) rename examples/features/{deterministic-evaluators => deterministic-graders}/evals/dataset.eval.yaml (100%) rename examples/features/{deterministic-evaluators => deterministic-graders}/graders/assertions.ts (100%) rename examples/features/{deterministic-evaluators => deterministic-graders}/package.json (68%) rename examples/features/{threshold-evaluator => threshold-grader}/evals/dataset.eval.yaml (100%) rename examples/features/{weighted-evaluators => weighted-graders}/README.md (100%) rename examples/features/{weighted-evaluators => weighted-graders}/evals/dataset.eval.baseline.jsonl (100%) rename examples/features/{weighted-evaluators => weighted-graders}/evals/dataset.eval.yaml (100%) rename examples/features/{weighted-evaluators => weighted-graders}/prompts/accuracy-check.md (100%) rename examples/features/{weighted-evaluators => weighted-graders}/prompts/clarity-check.md (100%) rename examples/features/{weighted-evaluators => weighted-graders}/prompts/completeness-check.md (100%) rename examples/features/{weighted-evaluators => weighted-graders}/prompts/correctness-check.md (100%) rename examples/features/{weighted-evaluators => weighted-graders}/prompts/experimental-check.md (100%) rename examples/features/{weighted-evaluators => weighted-graders}/prompts/quality-evaluation.md (100%) rename examples/features/{weighted-evaluators => weighted-graders}/prompts/safety-check.md (100%) rename examples/features/{weighted-evaluators => weighted-graders}/prompts/style-evaluation.md (100%) rename examples/showcase/{evaluator-conformance => grader-conformance}/EVAL.yaml (100%) rename examples/showcase/{evaluator-conformance => grader-conformance}/README.md (100%) rename examples/showcase/{evaluator-conformance => grader-conformance}/bun.lock (92%) rename examples/showcase/{evaluator-conformance => grader-conformance}/conformance-check.ts (98%) rename examples/showcase/{evaluator-conformance => grader-conformance}/fixtures.yaml (100%) rename examples/showcase/{evaluator-conformance/evaluators => grader-conformance/graders}/keyword-grader.ts (100%) rename examples/showcase/{evaluator-conformance => grader-conformance}/package.json (73%) diff --git a/examples/README.md b/examples/README.md index 7479668c1..438b1ca88 100644 --- a/examples/README.md +++ b/examples/README.md @@ -24,7 +24,7 @@ Examples are organized into two categories: ``` examples/ -├── features/ # Feature demonstrations (evaluators, metrics, SDK) +├── features/ # Feature demonstrations (graders, metrics, SDK) └── showcase/ # Real-world use cases and end-to-end demos ``` @@ -38,15 +38,15 @@ Focused demonstrations of specific AgentV capabilities. Each example includes it - [rubric](features/rubric/) - Rubric-based evaluation - [tool-trajectory-simple](features/tool-trajectory-simple/) - Tool trajectory validation - [tool-trajectory-advanced](features/tool-trajectory-advanced/) - Advanced tool trajectory with expected_output -- [composite](features/composite/) - Composite evaluator patterns -- [weighted-evaluators](features/weighted-evaluators/) - Weighted evaluators +- [composite](features/composite/) - Composite grader patterns +- [weighted-graders](features/weighted-graders/) - Weighted graders - [execution-metrics](features/execution-metrics/) - Metrics tracking (tokens, cost, latency) - [code-grader-with-llm-calls](features/code-grader-with-llm-calls/) - Code graders with target proxy for LLM calls - [batch-cli](features/batch-cli/) - Batch CLI evaluation - [document-extraction](features/document-extraction/) - Document data extraction - [local-cli](features/local-cli/) - Local CLI targets - [compare](features/compare/) - Baseline comparison -- [deterministic-evaluators](features/deterministic-evaluators/) - Deterministic assertions (contains, regex, JSON validation) +- [deterministic-graders](features/deterministic-graders/) - Deterministic assertions (contains, regex, JSON validation) - [workspace-setup-script](features/workspace-setup-script/) - Multi-step workspace setup with `before_all` lifecycle hook ### SDK @@ -78,7 +78,7 @@ Each example follows this structure: example-name/ ├── evals/ │ ├── dataset.eval.yaml # Primary eval file -│ ├── *.ts or *.py # Code evaluators (optional) +│ ├── *.ts or *.py # Code graders (optional) │ └── *.md # LLM grader prompts (optional) ├── scripts/ # Helper scripts (optional) ├── .agentv/ diff --git a/examples/features/default-evaluators/evals/dataset.eval.yaml b/examples/features/default-graders/evals/dataset.eval.yaml similarity index 100% rename from examples/features/default-evaluators/evals/dataset.eval.yaml rename to examples/features/default-graders/evals/dataset.eval.yaml diff --git a/examples/features/deterministic-evaluators/README.md b/examples/features/deterministic-graders/README.md similarity index 100% rename from examples/features/deterministic-evaluators/README.md rename to examples/features/deterministic-graders/README.md diff --git a/examples/features/deterministic-evaluators/evals/dataset.eval.baseline.jsonl b/examples/features/deterministic-graders/evals/dataset.eval.baseline.jsonl similarity index 100% rename from examples/features/deterministic-evaluators/evals/dataset.eval.baseline.jsonl rename to examples/features/deterministic-graders/evals/dataset.eval.baseline.jsonl diff --git a/examples/features/deterministic-evaluators/evals/dataset.eval.yaml b/examples/features/deterministic-graders/evals/dataset.eval.yaml similarity index 100% rename from examples/features/deterministic-evaluators/evals/dataset.eval.yaml rename to examples/features/deterministic-graders/evals/dataset.eval.yaml diff --git a/examples/features/deterministic-evaluators/graders/assertions.ts b/examples/features/deterministic-graders/graders/assertions.ts similarity index 100% rename from examples/features/deterministic-evaluators/graders/assertions.ts rename to examples/features/deterministic-graders/graders/assertions.ts diff --git a/examples/features/deterministic-evaluators/package.json b/examples/features/deterministic-graders/package.json similarity index 68% rename from examples/features/deterministic-evaluators/package.json rename to examples/features/deterministic-graders/package.json index bbb4ba4a7..d19f937df 100644 --- a/examples/features/deterministic-evaluators/package.json +++ b/examples/features/deterministic-graders/package.json @@ -1,5 +1,5 @@ { - "name": "agentv-example-deterministic-evaluators", + "name": "agentv-example-deterministic-graders", "private": true, "type": "module", "dependencies": { diff --git a/examples/features/threshold-evaluator/evals/dataset.eval.yaml b/examples/features/threshold-grader/evals/dataset.eval.yaml similarity index 100% rename from examples/features/threshold-evaluator/evals/dataset.eval.yaml rename to examples/features/threshold-grader/evals/dataset.eval.yaml diff --git a/examples/features/weighted-evaluators/README.md b/examples/features/weighted-graders/README.md similarity index 100% rename from examples/features/weighted-evaluators/README.md rename to examples/features/weighted-graders/README.md diff --git a/examples/features/weighted-evaluators/evals/dataset.eval.baseline.jsonl b/examples/features/weighted-graders/evals/dataset.eval.baseline.jsonl similarity index 100% rename from examples/features/weighted-evaluators/evals/dataset.eval.baseline.jsonl rename to examples/features/weighted-graders/evals/dataset.eval.baseline.jsonl diff --git a/examples/features/weighted-evaluators/evals/dataset.eval.yaml b/examples/features/weighted-graders/evals/dataset.eval.yaml similarity index 100% rename from examples/features/weighted-evaluators/evals/dataset.eval.yaml rename to examples/features/weighted-graders/evals/dataset.eval.yaml diff --git a/examples/features/weighted-evaluators/prompts/accuracy-check.md b/examples/features/weighted-graders/prompts/accuracy-check.md similarity index 100% rename from examples/features/weighted-evaluators/prompts/accuracy-check.md rename to examples/features/weighted-graders/prompts/accuracy-check.md diff --git a/examples/features/weighted-evaluators/prompts/clarity-check.md b/examples/features/weighted-graders/prompts/clarity-check.md similarity index 100% rename from examples/features/weighted-evaluators/prompts/clarity-check.md rename to examples/features/weighted-graders/prompts/clarity-check.md diff --git a/examples/features/weighted-evaluators/prompts/completeness-check.md b/examples/features/weighted-graders/prompts/completeness-check.md similarity index 100% rename from examples/features/weighted-evaluators/prompts/completeness-check.md rename to examples/features/weighted-graders/prompts/completeness-check.md diff --git a/examples/features/weighted-evaluators/prompts/correctness-check.md b/examples/features/weighted-graders/prompts/correctness-check.md similarity index 100% rename from examples/features/weighted-evaluators/prompts/correctness-check.md rename to examples/features/weighted-graders/prompts/correctness-check.md diff --git a/examples/features/weighted-evaluators/prompts/experimental-check.md b/examples/features/weighted-graders/prompts/experimental-check.md similarity index 100% rename from examples/features/weighted-evaluators/prompts/experimental-check.md rename to examples/features/weighted-graders/prompts/experimental-check.md diff --git a/examples/features/weighted-evaluators/prompts/quality-evaluation.md b/examples/features/weighted-graders/prompts/quality-evaluation.md similarity index 100% rename from examples/features/weighted-evaluators/prompts/quality-evaluation.md rename to examples/features/weighted-graders/prompts/quality-evaluation.md diff --git a/examples/features/weighted-evaluators/prompts/safety-check.md b/examples/features/weighted-graders/prompts/safety-check.md similarity index 100% rename from examples/features/weighted-evaluators/prompts/safety-check.md rename to examples/features/weighted-graders/prompts/safety-check.md diff --git a/examples/features/weighted-evaluators/prompts/style-evaluation.md b/examples/features/weighted-graders/prompts/style-evaluation.md similarity index 100% rename from examples/features/weighted-evaluators/prompts/style-evaluation.md rename to examples/features/weighted-graders/prompts/style-evaluation.md diff --git a/examples/showcase/evaluator-conformance/EVAL.yaml b/examples/showcase/grader-conformance/EVAL.yaml similarity index 100% rename from examples/showcase/evaluator-conformance/EVAL.yaml rename to examples/showcase/grader-conformance/EVAL.yaml diff --git a/examples/showcase/evaluator-conformance/README.md b/examples/showcase/grader-conformance/README.md similarity index 100% rename from examples/showcase/evaluator-conformance/README.md rename to examples/showcase/grader-conformance/README.md diff --git a/examples/showcase/evaluator-conformance/bun.lock b/examples/showcase/grader-conformance/bun.lock similarity index 92% rename from examples/showcase/evaluator-conformance/bun.lock rename to examples/showcase/grader-conformance/bun.lock index c3eccdcce..105c2963d 100644 --- a/examples/showcase/evaluator-conformance/bun.lock +++ b/examples/showcase/grader-conformance/bun.lock @@ -3,7 +3,7 @@ "configVersion": 1, "workspaces": { "": { - "name": "agentv-example-evaluator-conformance", + "name": "agentv-example-grader-conformance", "dependencies": { "@agentv/eval": "file:../../../packages/eval", "yaml": "^2.7.1", diff --git a/examples/showcase/evaluator-conformance/conformance-check.ts b/examples/showcase/grader-conformance/conformance-check.ts similarity index 98% rename from examples/showcase/evaluator-conformance/conformance-check.ts rename to examples/showcase/grader-conformance/conformance-check.ts index e1adc3fb4..613f88aeb 100644 --- a/examples/showcase/evaluator-conformance/conformance-check.ts +++ b/examples/showcase/grader-conformance/conformance-check.ts @@ -45,7 +45,7 @@ interface AssertionEntry { evidence?: string; } -interface EvaluatorResult { +interface GraderResult { score: number; assertions?: AssertionEntry[]; /** @deprecated use assertions */ @@ -112,7 +112,7 @@ function buildCodeGraderInput(fixture: Fixture): string { }); } -function runEvaluator(script: string[], input: string): Promise { +function runGrader(script: string[], input: string): Promise { return new Promise((resolve, reject) => { const proc = spawn(script[0], script.slice(1), { stdio: ['pipe', 'pipe', 'pipe'], @@ -137,7 +137,7 @@ function runEvaluator(script: string[], input: string): Promise } try { const result = JSON.parse(stdout); - resolve(result as EvaluatorResult); + resolve(result as GraderResult); } catch { reject(new Error(`Invalid JSON output: ${stdout}`)); } @@ -243,7 +243,7 @@ async function main(): Promise { for (let i = 0; i < runs; i++) { try { - const result = await runEvaluator(grader.script, input); + const result = await runGrader(grader.script, input); const schemaErrors = validateResult(result); if (schemaErrors.length > 0) { compatible = false; diff --git a/examples/showcase/evaluator-conformance/fixtures.yaml b/examples/showcase/grader-conformance/fixtures.yaml similarity index 100% rename from examples/showcase/evaluator-conformance/fixtures.yaml rename to examples/showcase/grader-conformance/fixtures.yaml diff --git a/examples/showcase/evaluator-conformance/evaluators/keyword-grader.ts b/examples/showcase/grader-conformance/graders/keyword-grader.ts similarity index 100% rename from examples/showcase/evaluator-conformance/evaluators/keyword-grader.ts rename to examples/showcase/grader-conformance/graders/keyword-grader.ts diff --git a/examples/showcase/evaluator-conformance/package.json b/examples/showcase/grader-conformance/package.json similarity index 73% rename from examples/showcase/evaluator-conformance/package.json rename to examples/showcase/grader-conformance/package.json index 7ac4a2b2c..fdb9cc782 100644 --- a/examples/showcase/evaluator-conformance/package.json +++ b/examples/showcase/grader-conformance/package.json @@ -1,5 +1,5 @@ { - "name": "agentv-example-evaluator-conformance", + "name": "agentv-example-grader-conformance", "private": true, "type": "module", "dependencies": { diff --git a/packages/eval/README.md b/packages/eval/README.md index 2db9802ac..ee9c764fa 100644 --- a/packages/eval/README.md +++ b/packages/eval/README.md @@ -57,7 +57,7 @@ For complete documentation including: - Execution metrics usage - Best practices -See the [Custom Graders Guide](../../plugins/agentv-dev/skills/agentv-eval-writer/references/custom-graders.md) or run AgentV's `/agentv-eval-builder` skill. +See the [Custom Graders Guide](../../apps/web/src/content/docs/docs/graders/custom-graders.mdx) or run AgentV's `/agentv-eval-builder` skill. ## Repository From cd2f3cce0dc4e62e2649afc805ea5ff46528cff4 Mon Sep 17 00:00:00 2001 From: Christopher Date: Wed, 15 Apr 2026 21:29:57 +0000 Subject: [PATCH 3/3] docs: remove issue plan artifact Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- docs/plans/1109-grader-rename.md | 13 ------------- 1 file changed, 13 deletions(-) delete mode 100644 docs/plans/1109-grader-rename.md diff --git a/docs/plans/1109-grader-rename.md b/docs/plans/1109-grader-rename.md deleted file mode 100644 index 6683027dd..000000000 --- a/docs/plans/1109-grader-rename.md +++ /dev/null @@ -1,13 +0,0 @@ -Problem: Hard-rename internal AgentV terminology from Evaluator to Grader across core, SDK exports, tests, docs, and UI copy, without changing YAML kind strings or `scores[].type`. - -Implementation plan: -1. Move core evaluator source/tests to `graders/` and rename registry/loader files plus exported TS symbols (`Evaluator` -> `Grader`, `EvaluatorRegistry` -> `GraderRegistry`, etc.). -2. Update dependent packages and applications (`packages/eval`, CLI, Studio) to consume the renamed symbols and user-facing terminology. -3. Sweep examples, docs, plugins, and repo guidance for concept-noun `evaluator` references; add the breaking-change migration notes required by the issue. -4. Run required validation, capture the live eval wire-format check, smoke-check Studio labels, and open/push the draft PR. - -Scope guardrails: -- Keep YAML kind strings unchanged. -- Keep `scores[].type` unchanged. -- Keep `evaluation/` and `evaluate()` unchanged. -- Do not add compatibility aliases for removed `Evaluator*` symbols.

GraderScoreStatusAssertions