diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 6d6be1a..4dd66c0 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -2,9 +2,10 @@ This document provides a comprehensive overview of the Weval architecture, detailing the distinct workflows that power the platform and the core components that drive evaluation. -The system is designed around two primary use cases: -1. **The Automated "Public Commons" Workflow**: A continuous integration pipeline that automatically evaluates community-contributed blueprints and updates the public `weval.org` website. -2. **The Interactive "Developer & Sandbox" Workflow**: A set of tools for developers and prompt engineers to create, test, and iterate on blueprints either locally or in a web-based environment. +Weval runs evaluations through two main paths: + +1. **The Automated "Public Commons" Workflow:** A continuous integration pipeline that automatically evaluates community-contributed blueprints and updates the public weval.org website. +2. **The Interactive "Developer & Sandbox" Workflow:** A set of tools for developers and prompt engineers to create, test, and iterate on blueprints either locally or in a web-based environment. ## 1. High-Level Workflows @@ -12,17 +13,16 @@ The following diagrams illustrate the two main operational flows of the platform ### The Automated "Public Commons" Workflow -This workflow describes how community contributions are automatically evaluated and published. - ```mermaid graph TD subgraph "Contribution" A[("fa:fa-user Contributor")] -- Proposes Blueprint --> B{{"fa:fa-github weval/configs
GitHub Repo"}} end - subgraph "Automated Evaluation (Netlify)" - C["fa:fa-clock fn: fetch-and-schedule-evals
(Weekly Cron Job)"] -- Fetches Blueprints --> B - C -- Triggers --> D["fa:fa-cogs fn: execute-evaluation-background"] + subgraph "Automated Evaluation (GitHub Actions cron + Railway)" + C0["fa:fa-clock GitHub Actions Cron
(weekly-eval-check.yml)"] -- POSTs --> C + C["fa:fa-bolt route: fetch-and-schedule-evals
(Railway-hosted Next.js API)"] -- Fetches Blueprints --> B + C -- Triggers (authenticated HTTP) --> D["fa:fa-cogs route: execute-evaluation-background
(same Railway service)"] D -- Runs Core Pipeline --> E[Core: comparison-pipeline-service] E -- Generates Raw Results --> F[("fa:fa-aws S3 Bucket
Raw Results (*_comparison.json)")] D -- Calculates Summaries --> G[("fa:fa-aws S3 Bucket
Aggregate Summaries")] @@ -38,13 +38,11 @@ graph TD class A user; class B,H platform; - class C,D,E,F,G process; + class C0,C,D,E,F,G process; ``` ### The Interactive "Developer & Sandbox" Workflow -This workflow shows the parallel paths for local CLI development and web-based Sandbox use. Both are powered by the same core evaluation engine. - ```mermaid graph LR subgraph "Path A: Local CLI Development" @@ -55,12 +53,12 @@ graph LR subgraph "Path B: Web Sandbox" C[("fa:fa-user Prompt Engineer")] --> D["fa:fa-flask Sandbox UI"] - D -- API Call --> G["Backend API
(/api/sandbox/run)"] - G -- Triggers --> H["fa:fa-cogs fn: execute-sandbox-pipeline-background"] + D -- Start Run --> G["Backend API
(/api/sandbox/run)"] + G -- Authenticated HTTP --> H["fa:fa-cogs route: execute-sandbox-pipeline-background
(same Railway service)"] H -- Runs Core Pipeline --> E - H -- Writes to --> I[("fa:fa-aws S3 Bucket
/sandbox-runs/")] - D -- Polls Status --> G - G -- Reads Status From --> I + H -- Writes to --> I[("fa:fa-aws S3 Bucket
live/sandbox/runs/")] + D -- Polls Status --> G2["Status API
(/api/sandbox/status/[sandboxId])"] + G2 -- Reads Status From --> I end J[("fa:fa-chart-bar Local Dashboard
(pnpm dev)")] -- Reads From --> F & I @@ -72,22 +70,21 @@ graph LR class A,C user; class B,F local; - class D,G,H,I web; + class D,G,G2,H,I web; class E,J core; ``` ## 2. Component Deep Dive -Each component in the diagrams above has a specific role in the ecosystem. - ### Core Services (Shared Logic) -These are the foundational services used across all workflows, ensuring evaluation consistency. -- **`comparison-pipeline-service.ts`**: The central orchestrator that manages a single evaluation run. It takes a configuration, generates model responses, and calls the necessary evaluators. -- **`llm-coverage-evaluator.ts`**: Implements the rubric-based scoring logic. It uses "judge" LLMs to assess responses against the `should` and `should_not` criteria defined in a blueprint. It supports complex rubrics including alternative paths (OR logic), where the best-performing path is selected. -- **`storageService.ts`**: A critical abstraction layer that handles all file I/O, allowing the system to seamlessly read and write from either the local filesystem or a cloud provider like AWS S3. -- **`summaryCalculationUtils.ts`**: Contains the post-processing logic for calculating aggregate metrics like the **Hybrid Score**, model performance drift, and leaderboard rankings. This service operates on completed raw result files. + +- **`comparison-pipeline-service.ts`**: The central orchestrator that manages a single evaluation run. It takes a configuration, generates model responses, and calls the necessary evaluators. +- **`llm-coverage-evaluator.ts`**: Implements the rubric-based scoring logic. It uses "judge" LLMs to assess responses against the `should` and `should_not` criteria defined in a blueprint. It supports complex rubrics including alternative paths (OR logic), where the best-performing path is selected. +- **`storageService.ts`**: A critical abstraction layer that handles all file I/O, allowing the system to seamlessly read and write from either the local filesystem or a cloud provider like AWS S3. +- **`summaryCalculationUtils.ts`**: Orchestrates post-processing after a run completes — computing model performance drift and leaderboard rankings, and calling `calculateHybridScore` (defined in `calculationUtils.ts`) to produce the **Hybrid Score**. Operates on completed raw result files. ### Storage Architecture (The `live/` Directory) + All active application data is stored within a single, top-level `live/` directory inside the configured storage provider (either local `.results/` or the S3 bucket). This centralized approach simplifies data management, backup, and restoration. The structure inside `live/` is organized by data type: @@ -102,10 +99,12 @@ graph TD; A --> C["blueprints/"]; A --> D["models/"]; A --> E["sandbox/"]; + A --> W["workshop/"]; + A --> P["pr-evals/"]; B --> B1["homepage_summary.json"]; B --> B2["latest_runs_summary.json"]; - B --> B3["search_index.json"]; + B --> B3["search-index.json"]; C --> C1["[config-id]/"]; C1 --> C2["[run-file].json"]; @@ -113,37 +112,47 @@ graph TD; D --> D1["summaries/"]; D --> D2["cards/"]; + D --> D3["ndeltas/"]; + D --> D4["vibes/"]; + D --> D5["compass/"]; D1 --> D1a["[model-id].json"]; D2 --> D2a["[model-id].json"]; + D3 --> D3a["manifest.json"]; + D4 --> D4a["index.json"]; + D5 --> D5a["index.json"]; + + W --> W1["runs/[workshopId]/[wevalId]/_comparison.json"]; + P --> P1["[prNumber]/[sanitized]/..."]; classDef dir fill:#24292e,stroke:#58a6ff,stroke-width:1px,color:#fff; classDef file fill:#333,stroke:#00c7b7,stroke-width:1px,color:#fff; - class A,B,C,D,E,C1,D1,D2 dir; - class B1,B2,B3,C2,C3,D1a,D2a file; + class A,B,C,D,E,W,P,C1,D1,D2,D3,D4,D5 dir; + class B1,B2,B3,C2,C3,D1a,D2a,D3a,D4a,D5a,W1,P1 file; ``` -- **`live/aggregates/`**: Contains all global, cross-cutting summary files. - - `homepage_summary.json`: The main manifest for the website's homepage. - - `latest_runs_summary.json`: A list of the 50 most recent evaluation runs. - - `search_index.json`: The pre-compiled index for the website's search functionality. -- **`live/blueprints/`**: Contains the core evaluation data, organized by each blueprint's unique ID. Each subdirectory contains the raw JSON outputs for every run of that blueprint, plus a `summary.json` of its historical performance. -- **`live/models/`**: Contains data aggregated on a per-model basis. - - `summaries/`: Detailed performance breakdowns for each model across all blueprints. - - `cards/`: The high-level, qualitative "Model Cards" generated for model families. - -- **`live/blueprints/[config-id]/[runLabel]_[timestamp]/`** – *Artefact-Based Run Layout* (Introduced 2025-08-07) - - `core.json` → Lightweight "above-the-fold" payload. Keeps: - - config metadata, promptIds, effectiveModels - - similarityMatrix - - executiveSummary - - thin `llmCoverageScores` (avgCoverageExtent, keyPointsCount, optional stdDev/sampleCount and **lightweight pointAssessments – no text**) - - **Place-holders** for bulky fields (`allFinalAssistantResponses`, `fullConversationHistories`) - - `responses/` → prompt-level final assistant responses split by prompt (`responses/[promptId].json`). - - `coverage/` → per-prompt × model rubric evaluations (`coverage/[promptId]/[modelId].json`). - - `histories/` → per-prompt × model full conversation histories (`histories/[promptId]/[modelId].json`). - - *(Legacy)* `[runLabel]_[timestamp]_comparison.json` – the original monolithic file is still generated for backward compatibility but will be phased out. - - The application fetches `core.json` via `/api/comparison/.../core` to render the page instantly. Detailed data is lazy-loaded on demand from `responses/` and `coverage/` paths, with automatic fallback to the legacy monolithic file when artefacts are missing. +- **`live/aggregates/`**: Contains all global, cross-cutting summary files. + - `homepage_summary.json`: The main manifest for the website's homepage. + - `latest_runs_summary.json`: A list of the 50 most recent evaluation runs. + - `search-index.json`: The pre-compiled index for the website's search functionality. +- **`live/blueprints/`**: Contains the core evaluation data, organized by each blueprint's unique ID. Each subdirectory contains the raw JSON outputs for every run of that blueprint, plus a `summary.json` of its historical performance. +- **`live/models/`**: Contains data aggregated on a per-model basis. + - `summaries/`: Detailed performance breakdowns for each model across all blueprints. + - `cards/`: The high-level, qualitative "Model Cards" generated for model families. + - `ndeltas/`: Per-model normalized score deltas (one JSON per model, plus a `manifest.json` index). + - `vibes/`: Pre-computed "vibes" index (`index.json`) consumed by the model-vibes UI. + - `compass/`: Pre-computed capability-compass index (`index.json`) consumed by the compass UI. + +- **`live/blueprints/[config-id]/[runLabel]_[timestamp]/`** – *Artefact-Based Run Layout* (Introduced 2025-08-07) + - `core.json` → Lightweight "above-the-fold" payload. Keeps: + - config metadata, promptIds, effectiveModels + - similarityMatrix + - executiveSummary + - thin `llmCoverageScores` (avgCoverageExtent, keyPointsCount, optional stdDev/sampleCount and **lightweight pointAssessments – no text**) + - **Place-holders** for bulky fields (`allFinalAssistantResponses`, `fullConversationHistories`) + - `responses/` → prompt-level final assistant responses split by prompt (`responses/[promptId].json`). + - `coverage/` → per-prompt × model rubric evaluations (`coverage/[promptId]/[modelId].json`). + - `histories/` → per-prompt × model full conversation histories (`histories/[promptId]/[modelId].json`). + - *(Legacy)* `[runLabel]_[timestamp]_comparison.json` – the original monolithic file is still generated for backward compatibility but will be phased out. #### Fixtures (Optional deterministic responses) @@ -151,17 +160,34 @@ graph TD; - For multi-turn prompts with `assistant: null` placeholders, fixtures can provide a `turns` array to fill those generated assistant turns in order. - `core.json` continues to contain placeholders for responses and histories by design; the concrete texts are persisted under `responses/` and `histories/` regardless of fixtures usage. -- **`live/sandbox/`**: Dedicated, isolated area for temporary data generated by the web-based Sandbox Studio. +- **`live/sandbox/`**: Dedicated, isolated area for temporary data generated by the web-based Sandbox Studio. Sandbox runs land at `live/sandbox/runs/[runId]/{blueprint.yml, status.json, ...}` and are garbage-collected after 7 days by the daily cleanup cron (see below). +- **`live/workshop/`**: Storage for collaborative workshop runs. Each weval is persisted at `live/workshop/runs/[workshopId]/[wevalId]/_comparison.json`. Workshop runs intentionally use the legacy monolithic file format rather than the artefact bundle. +- **`live/pr-evals/`**: Per-PR preview evaluations, written by `execute-pr-evaluation-background`. Layout: `live/pr-evals/[prNumber]/[sanitized-blueprint-id]/...`, otherwise mirroring the `live/blueprints/` artefact layout. ### Automated Workflow Components -These components power the public `weval.org` platform. -- **`fn: fetch-and-schedule-evals`**: A Netlify cron job that runs weekly. It scans the `weval/configs` repository for new or updated blueprints with the `_periodic` tag and triggers evaluation runs for them. -- **`fn: execute-evaluation-background`**: The Netlify background function that performs the actual evaluation for the public site. It calls the core services and is responsible for creating both the raw result file and updating the aggregate summary files in S3. + +- **GitHub Actions cron — Weekly evaluation** (`weekly-eval-check.yml`): The actual scheduler. Runs every Sunday at 00:00 UTC and POSTs an authenticated request to `${RAILWAY_APP_URL}/api/internal/fetch-and-schedule-evals` with a configurable batch size. +- **GitHub Actions cron — Daily sandbox cleanup** (`cleanup-sandbox-runs.yml`): Runs every day at 02:00 UTC and POSTs an authenticated request to `${RAILWAY_APP_URL}/api/internal/cleanup-sandbox-runs`, which deletes objects under `live/sandbox/runs/` older than 7 days (`CLEANUP_AGE_DAYS = 7`). +- **`/api/internal/fetch-and-schedule-evals`** (`fetch-and-schedule-evals`): A Next.js API route hosted on Railway. It scans the `weval/configs` repository for new or updated blueprints with the `_periodic` tag and triggers evaluation runs for them by calling `/api/internal/execute-evaluation-background` over authenticated HTTP (via `callBackgroundFunction`). +- **`/api/internal/execute-evaluation-background`** (`execute-evaluation-background`): A long-running Next.js API route handler — also on Railway — that performs the actual evaluation for the public site. It calls the core services and is responsible for creating both the raw result file and updating the aggregate summary files in S3. The caller (`callBackgroundFunction`) awaits a response up to a 30-second timeout; for evaluations that exceed that, the caller's connection is dropped but Railway continues running the handler to completion. Throughout this document, "background route" refers to this pattern: a long-running internal HTTP handler, **not** a Netlify Function. + +#### Other internal background routes + +Beyond the two canonical public-commons routes above, the codebase ships several additional internal background routes under `api/internal/`. They follow the same pattern (`callBackgroundFunction` → authenticated `POST` → long-running Next.js handler on Railway) and exist to support adjacent surfaces: + +- **`execute-pr-evaluation-background`**: Runs evaluations for blueprints proposed in a PR; output lands under `live/pr-evals/[prNumber]/...`. +- **`execute-api-evaluation-background`**: Runs evaluations triggered by the public HTTP API. +- **`execute-sandbox-pipeline-background`**: Runs evaluations triggered by the Sandbox Studio (also re-used by the workshop "retry" path); output lands under `live/sandbox/runs/...`. +- **`execute-story-quick-run-background`**: Backs the story / quick-run UX with low-latency single-model evaluations. +- **`generate-pairs-background`**: Populates the pairwise comparison queue (companion to `populatePairwiseQueue` in `pairwise-task-queue-service`). +- **`cleanup-sandbox-runs`**: The daily-cron target listed above. +- **`factcheck`**: Supporting service used by editorial / annotation flows. +- **`demo-external-evaluator`**, **`debug-env`**: Diagnostic / demo endpoints; not part of the production data path. ### Interactive Workflow Components -These components support the developer and sandbox environments. -- **`cli: run-config`**: The main command-line tool for developers. By default, it runs the evaluation pipeline for a local or GitHub-based blueprint and saves the results to the local `/.results/` directory, updating only the per-config summary. When used with the `--update-summaries` flag, it additionally rebuilds platform-wide summaries (homepage leaderboards, model summaries, etc.) using the same logic as the backfill process. -- **Sandbox UI & Backend API**: A full-stack feature within the Next.js app that provides an interactive, browser-based IDE for blueprint creation. It has its own set of API endpoints (`/api/sandbox`, `/api/github`) and a dedicated background function (`fn: execute-sandbox-pipeline-background`) for running evaluations. + +- **`cli: run-config`**: The main command-line tool for developers. By default, it runs the evaluation pipeline for a local or GitHub-based blueprint and saves the results to the local `/.results/` directory, updating only the per-config summary. When used with the `--update-summaries` flag, it additionally rebuilds platform-wide summaries (homepage leaderboards, model summaries, etc.) using the same logic as the backfill process. +- **Sandbox UI & Backend API**: A full-stack feature within the Next.js app that provides an interactive, browser-based IDE for blueprint creation. It has its own set of API endpoints (`/api/sandbox`, `/api/github`) and a dedicated background route handler (`/api/internal/execute-sandbox-pipeline-background`) for running evaluations. ## 3. Deep Dive: The Core Evaluation Pipeline @@ -218,7 +244,7 @@ graph TD; subgraph "Path B: Conceptual Check" O{"Is point text-based?"} P["Prompt Judge LLM
(with response + point)"]:::llm; - Q["Judge classifies on 5-point scale"]:::eval; + Q["Judge classifies on the active ordinal scale
(10-class experimental, FORCE_EXPERIMENTAL=true;
see METHODOLOGY.md §classification scale)"]:::eval; R["Map classification to score"]:::eval; S[("Point Score: 0.0-1.0")]:::score; O -- Yes --> P --> Q --> R --> S; @@ -250,7 +276,7 @@ graph TD; ## 4. Key Architectural Concepts -- **Separation of Raw Data and Summaries**: The core pipeline still produces a monolithic `*_comparison.json` for complete fidelity, *but* the UI now relies on the artefact bundle (`core.json` + `responses/` + `coverage/`) for 95 % of use-cases. High-level metrics like the **Hybrid Score** are *not* in either raw form; they are computed afterward by `summaryCalculationUtils.ts` and saved into summary files (e.g. `homepage_summary.json`). -- **Consistency via Shared Services**: By using the same core services (`comparison-pipeline-service`, `storageService`, etc.) for both the automated Netlify workflow and the manual CLI/Sandbox workflow, the platform ensures that an evaluation produces the same results regardless of how it was triggered. -- **Idempotent, Content-Hashed Runs**: The automated workflow uses a hash of a blueprint's content (including its fully resolved model list) as its `runLabel`. This ensures that identical blueprints are not re-run unnecessarily, saving significant computational resources. -- **Graceful Fallback & Progressive Enhancement**: The Sandbox is a prime example of this design principle. It is fully functional for anonymous users, with all work saved to local storage. Authenticating with GitHub progressively enhances the experience by enabling cloud-based file management and the ability to contribute back to the public commons. \ No newline at end of file +- **Separation of Raw Data and Summaries**: The core pipeline still produces a monolithic `*_comparison.json` for complete fidelity, *but* the UI now relies on the artefact bundle (`core.json` + `responses/` + `coverage/`) for 95 % of use-cases. High-level metrics like the **Hybrid Score** are *not* in either raw form; they are computed afterward by `calculateHybridScore` in `calculationUtils.ts` (called via `summaryCalculationUtils.ts`) and saved into summary files (e.g. `homepage_summary.json`). +- **Consistency via Shared Services**: By using the same core services (`comparison-pipeline-service`, `storageService`, etc.) for both the automated cron-driven workflow (GitHub Actions → Railway) and the manual CLI/Sandbox workflow, the platform ensures that an evaluation produces the same results regardless of how it was triggered. +- **Idempotent, Content-Hashed Runs**: The automated workflow uses a hash of a blueprint's content (including its fully resolved model list) as its `runLabel`. This ensures that identical blueprints are not re-run unnecessarily, saving significant computational resources. +- **Graceful Fallback & Progressive Enhancement**: The Sandbox is a prime example of this design principle. It is fully functional for anonymous users, with all work saved to local storage. Authenticating with GitHub progressively enhances the experience by enabling cloud-based file management and the ability to contribute back to the public commons. diff --git a/docs/BLUEPRINT_FORMAT.md b/docs/BLUEPRINT_FORMAT.md index 85d1292..2c579a9 100644 --- a/docs/BLUEPRINT_FORMAT.md +++ b/docs/BLUEPRINT_FORMAT.md @@ -117,7 +117,7 @@ The `evaluationConfig` field allows you to customize how evaluations are perform evaluationConfig: llm-coverage: judges: [...] # Custom judge configuration - useExperimentalScale: true # Use 9-point scale instead of 5-point + useExperimentalScale: true # Opt into the 10-class non-linear scale instead of the legacy 5-class linear scale (note: currently forced on globally; see "Classification Scale" below) ``` #### LLM Coverage Evaluation Options @@ -125,7 +125,7 @@ evaluationConfig: | Field | Type | Description | |---|---|---| | `judges` | `Judge[]` | **(Optional)** Custom judge configuration. If omitted, uses the default judges. Each judge is an object with `id`, `model`, and `approach` fields. See below for details. | -| `useExperimentalScale` | `boolean` | **(Optional)** If `true`, uses the experimental 9-point classification scale (0.0, 0.001, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0) instead of the default 5-point scale (0.0, 0.25, 0.5, 0.75, 1.0). This provides finer granularity in rubric scoring. | +| `useExperimentalScale` | `boolean` | **(Optional)** If `true`, uses the experimental **10-class** classification scale (0.0, 0.001, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0) instead of the legacy 5-class scale (0.0, 0.25, 0.5, 0.75, 1.0). This provides finer granularity in rubric scoring, with a deliberately non-linear gap between "utterly unmet / contradictory" (0.0) and "merely unmet" (0.001). **Note:** the experimental scale is currently forced on globally via `FORCE_EXPERIMENTAL = true` in `src/cli/evaluators/llm-coverage-evaluator.ts`, so this flag has no effect at the moment — both `true` and (omitted/`false`) result in the 10-class scale being used. See `docs/METHODOLOGY.md` for details. | | `judgeModels` | `string[]` | **(Deprecated)** Legacy field for backwards compatibility. Use `judges` instead. | | `judgeMode` | `'failover' \| 'consensus'` | **(Deprecated)** Legacy field for backwards compatibility. The system now always uses consensus mode across all configured judges. | @@ -141,17 +141,22 @@ Each judge in the `judges` array is an object with the following fields: **Default Judges:** -If no custom judges are specified, the system uses these default judges: +If no custom judges are specified, the system uses these default judges (defined as `DEFAULT_JUDGES` in `src/cli/evaluators/llm-coverage-evaluator.ts`): ```yaml judges: - - id: 'holistic-qwen3-30b-a3b-instruct-2507' - model: 'openrouter:qwen/qwen3-30b-a3b-instruct-2507' + - id: 'holistic-gemini-2-5-flash' + model: 'openrouter:google/gemini-2.5-flash' approach: 'holistic' - - id: 'holistic-openai-gpt-oss-120b' - model: 'openrouter:openai/gpt-oss-120b' + - id: 'holistic-gpt-4-1-mini' + model: 'openrouter:openai/gpt-4.1-mini' + approach: 'holistic' + - id: 'holistic-claude-haiku-4-5' + model: 'openrouter:anthropic/claude-haiku-4.5' approach: 'holistic' ``` +The defaults are chosen for **cross-family consensus** (one judge from each of Google, OpenAI, and Anthropic) over **approach diversity** (all three default judges use the `holistic` approach). To get approach diversity, configure custom judges explicitly. + **Example with Custom Judges:** ```yaml @@ -178,7 +183,7 @@ evaluationConfig: **Backup Judge:** -If all configured judges fail to return a valid assessment, the system automatically attempts to use a backup judge (`anthropic:claude-3.5-haiku` with `holistic` approach) to ensure evaluation can complete. This backup is only used when custom judges are not configured. +If any of the primary default judges fail to return a valid assessment, the system automatically attempts to use a backup judge (`openrouter:anthropic/claude-haiku-4.5` with `holistic` approach — defined as `DEFAULT_BACKUP_JUDGE` in `src/cli/evaluators/llm-coverage-evaluator.ts`) to ensure evaluation can complete. This backup is only used when **custom `judges` are not configured** — to preserve user intent, the backup is suppressed when a blueprint specifies its own judges. ### Model Configuration diff --git a/docs/INTER_AGREEMENT_PLAN.md b/docs/INTER_AGREEMENT_PLAN.md index 97fea18..4d0ee1f 100644 --- a/docs/INTER_AGREEMENT_PLAN.md +++ b/docs/INTER_AGREEMENT_PLAN.md @@ -392,7 +392,7 @@ llmCoverageScores[promptData.promptId][modelId] = { No additional changes needed! The `judgeAgreement` field will be: - Stored in result JSON files (automatic via type extension) - Available in API responses (automatic via type extension) -- Persisted to Netlify Blobs (automatic via serialization) +- Persisted via the storage abstraction in `src/lib/storageService.ts` (S3 in production, local FS in development; automatic via serialization) --- diff --git a/docs/METHODOLOGY.md b/docs/METHODOLOGY.md index 92f362f..f0ef8bf 100644 --- a/docs/METHODOLOGY.md +++ b/docs/METHODOLOGY.md @@ -54,35 +54,53 @@ evaluationConfig: - { model: 'openrouter:google/gemini-pro-1.5', approach: 'prompt-aware' } ``` -If no `judges` are specified, the system uses a default set designed to provide a balanced evaluation: -1. **`prompt-aware` approach (default):** A judge sees the response, the criterion, and the original user prompt. This allows the judge to consider the criterion in the context of the user's request. -2. **`holistic` approach (default):** A judge sees the response, the criterion, the user prompt, and *all other criteria* in the rubric. This provides the richest context, allowing the judge to assess the point as part of a whole, which can be useful for identifying redundancy or assessing trade-offs. +If no `judges` are specified, the system uses `DEFAULT_JUDGES` (defined in `src/cli/evaluators/llm-coverage-evaluator.ts`) — a fixed set of **three holistic judges across three model families**, chosen for cross-family diversity to mitigate single-vendor bias: -The **`standard`** approach (criterion-only) remains supported and can be configured explicitly if desired, but it is not part of the current default set. A backup judge is also used to improve robustness when primary judges fail. +1. `holistic-gemini-2-5-flash` → `openrouter:google/gemini-2.5-flash` +2. `holistic-gpt-4-1-mini` → `openrouter:openai/gpt-4.1-mini` +3. `holistic-claude-haiku-4-5` → `openrouter:anthropic/claude-haiku-4.5` + +All three default judges use the **`holistic`** approach (sees the response, the criterion, the user prompt, and *all other criteria* in the rubric). The platform's three approaches — **`standard`** (criterion-only), **`prompt-aware`** (response + criterion + prompt), and **`holistic`** (response + criterion + prompt + full rubric) — all remain supported and can be mixed in custom blueprint judge configs. They are simply not part of the current default set, which prioritizes cross-family consensus over approach diversity. A backup judge (Claude Haiku 4.5, holistic) is also used to improve robustness when primary judges fail. #### Judge Prompting and Classification A specific, structured prompt is used to elicit a judgment for each individual point in the rubric, tailored to the judge's `approach`. * **System Prompt Persona**: The judge is instructed to act as an "expert evaluator and examiner" and to adhere strictly to the task and output format. -* **Task Definition**: The judge is presented with the model's response (``) and a single criterion (``) and is asked to classify the degree to which the criterion is present in the text according to a 5-point scale. Depending on the `approach`, the original `` and the full `` may also be included for context. -* **The 5-Point Scale**: The judge must choose one of the following five classes: - * `CLASS_UNMET`: The criterion is not met. - * `CLASS_PARTIALLY_MET`: The criterion is partially met. - * `CLASS_MODERATELY_MET`: The criterion is moderately met. - * `CLASS_MAJORLY_MET`: The criterion is mostly met. - * `CLASS_EXACTLY_MET`: The criterion is fully met. +* **Task Definition**: The judge is presented with the model's response (``) and a single criterion (``) and is asked to classify the degree to which the criterion is present in the text according to an ordinal scale. Depending on the `approach`, the original `` and the full `` may also be included for context. +* **The Classification Scales**: Two scales are defined in the codebase. The platform currently forces use of the **experimental 10-class scale** via a global override (`FORCE_EXPERIMENTAL = true` at the top of `src/cli/evaluators/llm-coverage-evaluator.ts`), which can be flipped to opt blueprints into the older 5-class scale on a per-blueprint basis (`evaluationConfig['llm-coverage'].useExperimentalScale`). + + **Production today (10-class experimental, `EXPERIMENTAL_CLASSIFICATION_SCALE`):** + * `CLASS_UTTERLY_UNMET` — The criterion is so completely absent that the content in fact contradicts or contravenes it. + * `CLASS_UNMET` — The criterion is not met. + * `CLASS_TRACE` — Only a trace or hint of the criterion appears. + * `CLASS_SLIGHT` — A slight presence of the criterion is detectable. + * `CLASS_PARTIAL` — Partial fulfillment; important elements are missing. + * `CLASS_MODERATE` — Moderate fulfillment; balanced presence with notable gaps. + * `CLASS_SUBSTANTIAL` — Substantial fulfillment; most key aspects are present. + * `CLASS_MAJOR` — Major fulfillment; minor omissions remain. + * `CLASS_VERY_NEARLY` — Very nearly fully met; only negligible details missing. + * `CLASS_EXACT` — Exactly and fully meets the criterion. + + **Legacy 5-class scale (`CLASSIFICATION_SCALE`, available via opt-out):** + * `CLASS_UNMET` / `CLASS_PARTIALLY_MET` / `CLASS_MODERATELY_MET` / `CLASS_MAJORLY_MET` / `CLASS_EXACTLY_MET`. #### Mathematical Scoring of Rubric Points The judge's categorical classification is mapped to a quantitative score. -* **Numerical Mapping**: The classification is mapped to a linear, equidistant numerical scale: - * `CLASS_UNMET` -> **0.0** - * `CLASS_PARTIALLY_MET` -> **0.25** - * `CLASS_MODERATELY_MET` -> **0.50** - * `CLASS_MAJORLY_MET` -> **0.75** - * `CLASS_EXACTLY_MET` -> **1.0** +* **Numerical Mapping (10-class experimental scale, current default):** The mapping is **deliberately non-linear at the unmet end**, with a tiny gap separating contradictory content from merely-absent content: + * `CLASS_UTTERLY_UNMET` → **0.000** + * `CLASS_UNMET` → **0.001** + * `CLASS_TRACE` → **0.125** + * `CLASS_SLIGHT` → **0.250** + * `CLASS_PARTIAL` → **0.375** + * `CLASS_MODERATE` → **0.500** + * `CLASS_SUBSTANTIAL` → **0.625** + * `CLASS_MAJOR` → **0.750** + * `CLASS_VERY_NEARLY` → **0.875** + * `CLASS_EXACT` → **1.000** +* **Numerical Mapping (5-class legacy scale, opt-in):** Linear / equidistant: `CLASS_UNMET` → 0.0, `CLASS_PARTIALLY_MET` → 0.25, `CLASS_MODERATELY_MET` → 0.5, `CLASS_MAJORLY_MET` → 0.75, `CLASS_EXACTLY_MET` → 1.0. * **Score Inversion (`should_not`)**: For criteria that penalize undesirable content, the score is inverted. For an original score $S_{\text{orig}}$, the final score is $S_{\text{final}} = 1 - S_{\text{orig}}$. * **Weighted Aggregation**: A blueprint can assign a `multiplier` (weight) to each point. The final rubric score for a model on a prompt (`avgCoverageExtent`) is the weighted average of all point scores. For $N$ points with score $S_i$ and weight $w_i$: ```math @@ -112,7 +130,7 @@ When multiple judges evaluate the same response, the platform quantifies the deg ##### Krippendorff's Alpha Calculation * **Purpose**: Krippendorff's α measures the consistency of judgments across all judges evaluating a model's response. Values range from 0 (no agreement beyond chance) to 1 (perfect agreement). -* **Formula**: For ordinal data (our 0.0, 0.25, 0.50, 0.75, 1.0 scale), α is calculated as: +* **Formula**: For ordinal data (whichever scale is active — the 10-class experimental mapping today, see "Mathematical Scoring of Rubric Points" above), α is calculated as: ```math \alpha = 1 - \frac{D_o}{D_e} ``` @@ -191,7 +209,7 @@ While the platform aims for complete judge coverage on all criteria, individual **The Backup Judge Mechanism:** -To improve robustness, the system employs a backup judge (Claude 3.5 Haiku) that activates when primary judges fail: +To improve robustness, the system employs a backup judge (Claude Haiku 4.5, `openrouter:anthropic/claude-haiku-4.5`, holistic approach — defined as `DEFAULT_BACKUP_JUDGE` in `src/cli/evaluators/llm-coverage-evaluator.ts`) that activates when primary judges fail: * **Trigger Condition**: Backup judge only runs when `successfulJudgements < totalPrimaryJudges` (i.e., at least one primary judge failed) * **Not Used with Custom Judges**: To preserve user-configured judge sets, backup judge is disabled when custom `judges` are specified in the blueprint @@ -528,7 +546,7 @@ Weval's methodology is designed to be robust, but like any quantitative system, The validity of Weval's metrics rests on these core assumptions: * **Assumption of Appropriate Weighting in Hybrid Score**: The Hybrid Score currently uses 0% similarity and 100% coverage (coverage-only). This assumes that rubric adherence is the dominant signal of model quality and that semantic similarity to an ideal response (when available) adds no additional information. While this explicit choice is more transparent than an implicit weighting, it may not be optimal for all evaluation contexts, and future versions may offer configurable weights. -* **Assumption of Linearity in Score Mapping**: The 5-point categorical scale from the LLM judge is mapped to a linear, equidistant numerical scale. This assumes the qualitative gap between "Absent" and "Slightly Present" is the same as between "Majorly Present" and "Fully Present," which may not be perceptually true. +* **Assumption of (mostly) Linearity in Score Mapping**: The categorical scale from the LLM judge is mapped to a numerical scale that is **partially non-linear** in the current 10-class experimental default. The platform deliberately separates `CLASS_UTTERLY_UNMET` (0.000) from `CLASS_UNMET` (0.001) by an explicitly tiny gap to distinguish contradiction from mere absence, and uses 0.125-step increments above that. This addresses but does not fully solve the perceptual-gap problem: the middle of the scale still has uniform 0.125 steps that may not match how judges actually perceive distance between, say, "Moderate" and "Substantial". The legacy 5-class scale (still available via `useExperimentalScale: false`) is fully linear and even more susceptible to this concern. A complete fix would require either empirical calibration of the mapping against human ratings or moving to continuous 0–1 judge outputs. * **Assumption of Criterion Independence**: The rubric score (`avgCoverageExtent`) is a weighted average that treats each criterion as an independent variable. It does not account for potential correlations between criteria (e.g., "clarity" and "conciseness"). * **Assumption of Effective Bias Reduction via Anonymization**: The model anonymization system assumes that removing real model names and providers significantly reduces analyst LLM bias, while preserving maker-level information provides meaningful comparative insights. This assumes that brand bias is primarily driven by explicit name recognition rather than subtle patterns in response style that might persist even when anonymized.