From ce6cc5e1977ef0b84dc898988b4ba6fcc29bfa59 Mon Sep 17 00:00:00 2001 From: Ken Ojibe Date: Tue, 5 May 2026 08:42:00 -0400 Subject: [PATCH 1/8] docs: Refine descriptions of automated and interactive workflows, and improve storage architecture section with new data organization insights. --- docs/ARCHITECTURE.md | 139 ++++++++++++++++++++++++++++--------------- 1 file changed, 90 insertions(+), 49 deletions(-) diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 6d6be1a9..752e10bb 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -2,9 +2,11 @@ This document provides a comprehensive overview of the Weval architecture, detailing the distinct workflows that power the platform and the core components that drive evaluation. -The system is designed around two primary use cases: -1. **The Automated "Public Commons" Workflow**: A continuous integration pipeline that automatically evaluates community-contributed blueprints and updates the public `weval.org` website. -2. **The Interactive "Developer & Sandbox" Workflow**: A set of tools for developers and prompt engineers to create, test, and iterate on blueprints either locally or in a web-based environment. +Weval runs evaluations through two main paths: scheduled public evaluations and interactive developer/sandbox runs. Both paths use the same core evaluation pipeline, but differ in how runs are triggered, stored, and surfaced in the UI. + + +1. **The Automated "Public Commons" Workflow**: A continuous integration pipeline that automatically evaluates community-contributed blueprints and updates the public `weval.org` website. +2. **The Interactive "Developer & Sandbox" Workflow**: A set of tools for developers and prompt engineers to create, test, and iterate on blueprints either locally or in a web-based environment. ## 1. High-Level Workflows @@ -20,9 +22,10 @@ graph TD A[("fa:fa-user Contributor")] -- Proposes Blueprint --> B{{"fa:fa-github weval/configs
GitHub Repo"}} end - subgraph "Automated Evaluation (Netlify)" - C["fa:fa-clock fn: fetch-and-schedule-evals
(Weekly Cron Job)"] -- Fetches Blueprints --> B - C -- Triggers --> D["fa:fa-cogs fn: execute-evaluation-background"] + subgraph "Automated Evaluation (GitHub Actions cron + Railway)" + C0["fa:fa-clock GitHub Actions Cron
(weekly-eval-check.yml)"] -- POSTs --> C + C["fa:fa-bolt route: fetch-and-schedule-evals
(Railway-hosted Next.js API)"] -- Fetches Blueprints --> B + C -- Triggers (authenticated HTTP) --> D["fa:fa-cogs route: execute-evaluation-background
(same Railway service)"] D -- Runs Core Pipeline --> E[Core: comparison-pipeline-service] E -- Generates Raw Results --> F[("fa:fa-aws S3 Bucket
Raw Results (*_comparison.json)")] D -- Calculates Summaries --> G[("fa:fa-aws S3 Bucket
Aggregate Summaries")] @@ -38,7 +41,7 @@ graph TD class A user; class B,H platform; - class C,D,E,F,G process; + class C0,C,D,E,F,G process; ``` ### The Interactive "Developer & Sandbox" Workflow @@ -55,12 +58,12 @@ graph LR subgraph "Path B: Web Sandbox" C[("fa:fa-user Prompt Engineer")] --> D["fa:fa-flask Sandbox UI"] - D -- API Call --> G["Backend API
(/api/sandbox/run)"] - G -- Triggers --> H["fa:fa-cogs fn: execute-sandbox-pipeline-background"] + D -- Start Run --> G["Backend API
(/api/sandbox/run)"] + G -- Fire-and-forget HTTP --> H["fa:fa-cogs route: execute-sandbox-pipeline-background
(same Railway service)"] H -- Runs Core Pipeline --> E - H -- Writes to --> I[("fa:fa-aws S3 Bucket
/sandbox-runs/")] - D -- Polls Status --> G - G -- Reads Status From --> I + H -- Writes to --> I[("fa:fa-aws S3 Bucket
live/sandbox/runs/")] + D -- Polls Status --> G2["Status API
(/api/sandbox/status/[sandboxId])"] + G2 -- Reads Status From --> I end J[("fa:fa-chart-bar Local Dashboard
(pnpm dev)")] -- Reads From --> F & I @@ -72,7 +75,7 @@ graph LR class A,C user; class B,F local; - class D,G,H,I web; + class D,G,G2,H,I web; class E,J core; ``` @@ -81,13 +84,16 @@ graph LR Each component in the diagrams above has a specific role in the ecosystem. ### Core Services (Shared Logic) + These are the foundational services used across all workflows, ensuring evaluation consistency. -- **`comparison-pipeline-service.ts`**: The central orchestrator that manages a single evaluation run. It takes a configuration, generates model responses, and calls the necessary evaluators. -- **`llm-coverage-evaluator.ts`**: Implements the rubric-based scoring logic. It uses "judge" LLMs to assess responses against the `should` and `should_not` criteria defined in a blueprint. It supports complex rubrics including alternative paths (OR logic), where the best-performing path is selected. -- **`storageService.ts`**: A critical abstraction layer that handles all file I/O, allowing the system to seamlessly read and write from either the local filesystem or a cloud provider like AWS S3. -- **`summaryCalculationUtils.ts`**: Contains the post-processing logic for calculating aggregate metrics like the **Hybrid Score**, model performance drift, and leaderboard rankings. This service operates on completed raw result files. + +- **`comparison-pipeline-service.ts`**: The central orchestrator that manages a single evaluation run. It takes a configuration, generates model responses, and calls the necessary evaluators. +- **`llm-coverage-evaluator.ts`**: Implements the rubric-based scoring logic. It uses "judge" LLMs to assess responses against the `should` and `should_not` criteria defined in a blueprint. It supports complex rubrics including alternative paths (OR logic), where the best-performing path is selected. +- **`storageService.ts`**: A critical abstraction layer that handles all file I/O, allowing the system to seamlessly read and write from either the local filesystem or a cloud provider like AWS S3. +- **`summaryCalculationUtils.ts`**: Contains the post-processing logic for calculating aggregate metrics like the **Hybrid Score**, model performance drift, and leaderboard rankings. This service operates on completed raw result files. ### Storage Architecture (The `live/` Directory) + All active application data is stored within a single, top-level `live/` directory inside the configured storage provider (either local `.results/` or the S3 bucket). This centralized approach simplifies data management, backup, and restoration. The structure inside `live/` is organized by data type: @@ -102,6 +108,8 @@ graph TD; A --> C["blueprints/"]; A --> D["models/"]; A --> E["sandbox/"]; + A --> W["workshop/"]; + A --> P["pr-evals/"]; B --> B1["homepage_summary.json"]; B --> B2["latest_runs_summary.json"]; @@ -113,35 +121,47 @@ graph TD; D --> D1["summaries/"]; D --> D2["cards/"]; + D --> D3["ndeltas/"]; + D --> D4["vibes/"]; + D --> D5["compass/"]; D1 --> D1a["[model-id].json"]; D2 --> D2a["[model-id].json"]; + D3 --> D3a["manifest.json"]; + D4 --> D4a["index.json"]; + D5 --> D5a["index.json"]; + + W --> W1["runs/[workshopId]/[wevalId]/_comparison.json"]; + P --> P1["[prNumber]/[sanitized]/..."]; classDef dir fill:#24292e,stroke:#58a6ff,stroke-width:1px,color:#fff; classDef file fill:#333,stroke:#00c7b7,stroke-width:1px,color:#fff; - class A,B,C,D,E,C1,D1,D2 dir; - class B1,B2,B3,C2,C3,D1a,D2a file; + class A,B,C,D,E,W,P,C1,D1,D2,D3,D4,D5 dir; + class B1,B2,B3,C2,C3,D1a,D2a,D3a,D4a,D5a,W1,P1 file; ``` -- **`live/aggregates/`**: Contains all global, cross-cutting summary files. - - `homepage_summary.json`: The main manifest for the website's homepage. - - `latest_runs_summary.json`: A list of the 50 most recent evaluation runs. - - `search_index.json`: The pre-compiled index for the website's search functionality. -- **`live/blueprints/`**: Contains the core evaluation data, organized by each blueprint's unique ID. Each subdirectory contains the raw JSON outputs for every run of that blueprint, plus a `summary.json` of its historical performance. -- **`live/models/`**: Contains data aggregated on a per-model basis. - - `summaries/`: Detailed performance breakdowns for each model across all blueprints. - - `cards/`: The high-level, qualitative "Model Cards" generated for model families. - -- **`live/blueprints/[config-id]/[runLabel]_[timestamp]/`** – *Artefact-Based Run Layout* (Introduced 2025-08-07) - - `core.json` → Lightweight "above-the-fold" payload. Keeps: - - config metadata, promptIds, effectiveModels - - similarityMatrix - - executiveSummary - - thin `llmCoverageScores` (avgCoverageExtent, keyPointsCount, optional stdDev/sampleCount and **lightweight pointAssessments – no text**) - - **Place-holders** for bulky fields (`allFinalAssistantResponses`, `fullConversationHistories`) - - `responses/` → prompt-level final assistant responses split by prompt (`responses/[promptId].json`). - - `coverage/` → per-prompt × model rubric evaluations (`coverage/[promptId]/[modelId].json`). - - `histories/` → per-prompt × model full conversation histories (`histories/[promptId]/[modelId].json`). - - *(Legacy)* `[runLabel]_[timestamp]_comparison.json` – the original monolithic file is still generated for backward compatibility but will be phased out. +- **`live/aggregates/`**: Contains all global, cross-cutting summary files. + - `homepage_summary.json`: The main manifest for the website's homepage. + - `latest_runs_summary.json`: A list of the 50 most recent evaluation runs. + - `search_index.json`: The pre-compiled index for the website's search functionality. +- **`live/blueprints/`**: Contains the core evaluation data, organized by each blueprint's unique ID. Each subdirectory contains the raw JSON outputs for every run of that blueprint, plus a `summary.json` of its historical performance. +- **`live/models/`**: Contains data aggregated on a per-model basis. + - `summaries/`: Detailed performance breakdowns for each model across all blueprints. + - `cards/`: The high-level, qualitative "Model Cards" generated for model families. + - `ndeltas/`: Per-model normalized score deltas (one JSON per model, plus a `manifest.json` index). + - `vibes/`: Pre-computed "vibes" index (`index.json`) consumed by the model-vibes UI. + - `compass/`: Pre-computed capability-compass index (`index.json`) consumed by the compass UI. + +- **`live/blueprints/[config-id]/[runLabel]_[timestamp]/`** – *Artefact-Based Run Layout* (Introduced 2025-08-07) + - `core.json` → Lightweight "above-the-fold" payload. Keeps: + - config metadata, promptIds, effectiveModels + - similarityMatrix + - executiveSummary + - thin `llmCoverageScores` (avgCoverageExtent, keyPointsCount, optional stdDev/sampleCount and **lightweight pointAssessments – no text**) + - **Place-holders** for bulky fields (`allFinalAssistantResponses`, `fullConversationHistories`) + - `responses/` → prompt-level final assistant responses split by prompt (`responses/[promptId].json`). + - `coverage/` → per-prompt × model rubric evaluations (`coverage/[promptId]/[modelId].json`). + - `histories/` → per-prompt × model full conversation histories (`histories/[promptId]/[modelId].json`). + - *(Legacy)* `[runLabel]_[timestamp]_comparison.json` – the original monolithic file is still generated for backward compatibility but will be phased out. The application fetches `core.json` via `/api/comparison/.../core` to render the page instantly. Detailed data is lazy-loaded on demand from `responses/` and `coverage/` paths, with automatic fallback to the legacy monolithic file when artefacts are missing. @@ -151,17 +171,38 @@ graph TD; - For multi-turn prompts with `assistant: null` placeholders, fixtures can provide a `turns` array to fill those generated assistant turns in order. - `core.json` continues to contain placeholders for responses and histories by design; the concrete texts are persisted under `responses/` and `histories/` regardless of fixtures usage. -- **`live/sandbox/`**: Dedicated, isolated area for temporary data generated by the web-based Sandbox Studio. +- **`live/sandbox/`**: Dedicated, isolated area for temporary data generated by the web-based Sandbox Studio. Sandbox runs land at `live/sandbox/runs/[runId]/{blueprint.yml, status.json, ...}` and are garbage-collected after 7 days by the daily cleanup cron (see below). +- **`live/workshop/`**: Storage for collaborative workshop runs. Each weval is persisted at `live/workshop/runs/[workshopId]/[wevalId]/_comparison.json`. Workshop runs intentionally use the legacy monolithic file format rather than the artefact bundle. +- **`live/pr-evals/`**: Per-PR preview evaluations, written by `execute-pr-evaluation-background`. Layout: `live/pr-evals/[prNumber]/[sanitized-blueprint-id]/...`, otherwise mirroring the `live/blueprints/` artefact layout. ### Automated Workflow Components + These components power the public `weval.org` platform. -- **`fn: fetch-and-schedule-evals`**: A Netlify cron job that runs weekly. It scans the `weval/configs` repository for new or updated blueprints with the `_periodic` tag and triggers evaluation runs for them. -- **`fn: execute-evaluation-background`**: The Netlify background function that performs the actual evaluation for the public site. It calls the core services and is responsible for creating both the raw result file and updating the aggregate summary files in S3. + +- **GitHub Actions cron — Weekly evaluation** (`weekly-eval-check.yml`): The actual scheduler. Runs every Sunday at 00:00 UTC and POSTs an authenticated request to `${RAILWAY_APP_URL}/api/internal/fetch-and-schedule-evals` with a configurable batch size. +- **GitHub Actions cron — Daily sandbox cleanup** (`cleanup-sandbox-runs.yml`): Runs every day at 02:00 UTC and POSTs an authenticated request to `${RAILWAY_APP_URL}/api/internal/cleanup-sandbox-runs`, which deletes objects under `live/sandbox/runs/` older than 7 days (`CLEANUP_AGE_DAYS = 7`). +- **`/api/internal/fetch-and-schedule-evals`** (`fetch-and-schedule-evals`): A Next.js API route hosted on Railway. It scans the `weval/configs` repository for new or updated blueprints with the `_periodic` tag and triggers evaluation runs for them by calling `/api/internal/execute-evaluation-background` over authenticated HTTP (via `callBackgroundFunction`). +- **`/api/internal/execute-evaluation-background`** (`execute-evaluation-background`): A long-running Next.js API route handler — also on Railway — that performs the actual evaluation for the public site. It calls the core services and is responsible for creating both the raw result file and updating the aggregate summary files in S3. Throughout this document, "background function" refers to these fire-and-forget HTTP-triggered routes; they are **not** Netlify Functions, but the team's term for long-running internal route handlers. + +#### Other internal background routes + +Beyond the two canonical public-commons routes above, the codebase ships several additional internal background routes under `api/internal/`. They follow the same auth/fan-out pattern (`callBackgroundFunction` → authenticated `POST` → long-running Next.js handler on Railway) and exist to support adjacent surfaces: + +- **`execute-pr-evaluation-background`**: Runs evaluations for blueprints proposed in a PR; output lands under `live/pr-evals/[prNumber]/...`. +- **`execute-api-evaluation-background`**: Runs evaluations triggered by the public HTTP API. +- **`execute-sandbox-pipeline-background`**: Runs evaluations triggered by the Sandbox Studio (also re-used by the workshop "retry" path); output lands under `live/sandbox/runs/...`. +- **`execute-story-quick-run-background`**: Backs the story / quick-run UX with low-latency single-model evaluations. +- **`generate-pairs-background`**: Populates the pairwise comparison queue (companion to `populatePairwiseQueue` in `pairwise-task-queue-service`). +- **`cleanup-sandbox-runs`**: The daily-cron target listed above. +- **`factcheck`**: Supporting service used by editorial / annotation flows. +- **`demo-external-evaluator`**, **`debug-env`**: Diagnostic / demo endpoints; not part of the production data path. ### Interactive Workflow Components + These components support the developer and sandbox environments. -- **`cli: run-config`**: The main command-line tool for developers. By default, it runs the evaluation pipeline for a local or GitHub-based blueprint and saves the results to the local `/.results/` directory, updating only the per-config summary. When used with the `--update-summaries` flag, it additionally rebuilds platform-wide summaries (homepage leaderboards, model summaries, etc.) using the same logic as the backfill process. -- **Sandbox UI & Backend API**: A full-stack feature within the Next.js app that provides an interactive, browser-based IDE for blueprint creation. It has its own set of API endpoints (`/api/sandbox`, `/api/github`) and a dedicated background function (`fn: execute-sandbox-pipeline-background`) for running evaluations. + +- **`cli: run-config`**: The main command-line tool for developers. By default, it runs the evaluation pipeline for a local or GitHub-based blueprint and saves the results to the local `/.results/` directory, updating only the per-config summary. When used with the `--update-summaries` flag, it additionally rebuilds platform-wide summaries (homepage leaderboards, model summaries, etc.) using the same logic as the backfill process. +- **Sandbox UI & Backend API**: A full-stack feature within the Next.js app that provides an interactive, browser-based IDE for blueprint creation. It has its own set of API endpoints (`/api/sandbox`, `/api/github`) and a dedicated background route handler (`/api/internal/execute-sandbox-pipeline-background`) for running evaluations. ## 3. Deep Dive: The Core Evaluation Pipeline @@ -218,7 +259,7 @@ graph TD; subgraph "Path B: Conceptual Check" O{"Is point text-based?"} P["Prompt Judge LLM
(with response + point)"]:::llm; - Q["Judge classifies on 5-point scale"]:::eval; + Q["Judge classifies on the active ordinal scale
(10-class experimental, FORCE_EXPERIMENTAL=true;
see METHODOLOGY.md §classification scale)"]:::eval; R["Map classification to score"]:::eval; S[("Point Score: 0.0-1.0")]:::score; O -- Yes --> P --> Q --> R --> S; @@ -250,7 +291,7 @@ graph TD; ## 4. Key Architectural Concepts -- **Separation of Raw Data and Summaries**: The core pipeline still produces a monolithic `*_comparison.json` for complete fidelity, *but* the UI now relies on the artefact bundle (`core.json` + `responses/` + `coverage/`) for 95 % of use-cases. High-level metrics like the **Hybrid Score** are *not* in either raw form; they are computed afterward by `summaryCalculationUtils.ts` and saved into summary files (e.g. `homepage_summary.json`). -- **Consistency via Shared Services**: By using the same core services (`comparison-pipeline-service`, `storageService`, etc.) for both the automated Netlify workflow and the manual CLI/Sandbox workflow, the platform ensures that an evaluation produces the same results regardless of how it was triggered. -- **Idempotent, Content-Hashed Runs**: The automated workflow uses a hash of a blueprint's content (including its fully resolved model list) as its `runLabel`. This ensures that identical blueprints are not re-run unnecessarily, saving significant computational resources. -- **Graceful Fallback & Progressive Enhancement**: The Sandbox is a prime example of this design principle. It is fully functional for anonymous users, with all work saved to local storage. Authenticating with GitHub progressively enhances the experience by enabling cloud-based file management and the ability to contribute back to the public commons. \ No newline at end of file +- **Separation of Raw Data and Summaries**: The core pipeline still produces a monolithic `*_comparison.json` for complete fidelity, *but* the UI now relies on the artefact bundle (`core.json` + `responses/` + `coverage/`) for 95 % of use-cases. High-level metrics like the **Hybrid Score** are *not* in either raw form; they are computed afterward by `summaryCalculationUtils.ts` and saved into summary files (e.g. `homepage_summary.json`). +- **Consistency via Shared Services**: By using the same core services (`comparison-pipeline-service`, `storageService`, etc.) for both the automated cron-driven workflow (GitHub Actions → Railway) and the manual CLI/Sandbox workflow, the platform ensures that an evaluation produces the same results regardless of how it was triggered. +- **Idempotent, Content-Hashed Runs**: The automated workflow uses a hash of a blueprint's content (including its fully resolved model list) as its `runLabel`. This ensures that identical blueprints are not re-run unnecessarily, saving significant computational resources. +- **Graceful Fallback & Progressive Enhancement**: The Sandbox is a prime example of this design principle. It is fully functional for anonymous users, with all work saved to local storage. Authenticating with GitHub progressively enhances the experience by enabling cloud-based file management and the ability to contribute back to the public commons. From 67477eb58770e015ebf3d53d8a114df095e5cd2e Mon Sep 17 00:00:00 2001 From: Ken Ojibe Date: Tue, 5 May 2026 09:23:44 -0400 Subject: [PATCH 2/8] docs: Update methodology for judge evaluation approaches, clarify default judges, and enhance classification scales with new 10-class experimental mapping. --- docs/METHODOLOGY.md | 58 +++++++++++++++++++++++++++++---------------- 1 file changed, 38 insertions(+), 20 deletions(-) diff --git a/docs/METHODOLOGY.md b/docs/METHODOLOGY.md index 92f362f1..f0ef8bf6 100644 --- a/docs/METHODOLOGY.md +++ b/docs/METHODOLOGY.md @@ -54,35 +54,53 @@ evaluationConfig: - { model: 'openrouter:google/gemini-pro-1.5', approach: 'prompt-aware' } ``` -If no `judges` are specified, the system uses a default set designed to provide a balanced evaluation: -1. **`prompt-aware` approach (default):** A judge sees the response, the criterion, and the original user prompt. This allows the judge to consider the criterion in the context of the user's request. -2. **`holistic` approach (default):** A judge sees the response, the criterion, the user prompt, and *all other criteria* in the rubric. This provides the richest context, allowing the judge to assess the point as part of a whole, which can be useful for identifying redundancy or assessing trade-offs. +If no `judges` are specified, the system uses `DEFAULT_JUDGES` (defined in `src/cli/evaluators/llm-coverage-evaluator.ts`) — a fixed set of **three holistic judges across three model families**, chosen for cross-family diversity to mitigate single-vendor bias: -The **`standard`** approach (criterion-only) remains supported and can be configured explicitly if desired, but it is not part of the current default set. A backup judge is also used to improve robustness when primary judges fail. +1. `holistic-gemini-2-5-flash` → `openrouter:google/gemini-2.5-flash` +2. `holistic-gpt-4-1-mini` → `openrouter:openai/gpt-4.1-mini` +3. `holistic-claude-haiku-4-5` → `openrouter:anthropic/claude-haiku-4.5` + +All three default judges use the **`holistic`** approach (sees the response, the criterion, the user prompt, and *all other criteria* in the rubric). The platform's three approaches — **`standard`** (criterion-only), **`prompt-aware`** (response + criterion + prompt), and **`holistic`** (response + criterion + prompt + full rubric) — all remain supported and can be mixed in custom blueprint judge configs. They are simply not part of the current default set, which prioritizes cross-family consensus over approach diversity. A backup judge (Claude Haiku 4.5, holistic) is also used to improve robustness when primary judges fail. #### Judge Prompting and Classification A specific, structured prompt is used to elicit a judgment for each individual point in the rubric, tailored to the judge's `approach`. * **System Prompt Persona**: The judge is instructed to act as an "expert evaluator and examiner" and to adhere strictly to the task and output format. -* **Task Definition**: The judge is presented with the model's response (``) and a single criterion (``) and is asked to classify the degree to which the criterion is present in the text according to a 5-point scale. Depending on the `approach`, the original `` and the full `` may also be included for context. -* **The 5-Point Scale**: The judge must choose one of the following five classes: - * `CLASS_UNMET`: The criterion is not met. - * `CLASS_PARTIALLY_MET`: The criterion is partially met. - * `CLASS_MODERATELY_MET`: The criterion is moderately met. - * `CLASS_MAJORLY_MET`: The criterion is mostly met. - * `CLASS_EXACTLY_MET`: The criterion is fully met. +* **Task Definition**: The judge is presented with the model's response (``) and a single criterion (``) and is asked to classify the degree to which the criterion is present in the text according to an ordinal scale. Depending on the `approach`, the original `` and the full `` may also be included for context. +* **The Classification Scales**: Two scales are defined in the codebase. The platform currently forces use of the **experimental 10-class scale** via a global override (`FORCE_EXPERIMENTAL = true` at the top of `src/cli/evaluators/llm-coverage-evaluator.ts`), which can be flipped to opt blueprints into the older 5-class scale on a per-blueprint basis (`evaluationConfig['llm-coverage'].useExperimentalScale`). + + **Production today (10-class experimental, `EXPERIMENTAL_CLASSIFICATION_SCALE`):** + * `CLASS_UTTERLY_UNMET` — The criterion is so completely absent that the content in fact contradicts or contravenes it. + * `CLASS_UNMET` — The criterion is not met. + * `CLASS_TRACE` — Only a trace or hint of the criterion appears. + * `CLASS_SLIGHT` — A slight presence of the criterion is detectable. + * `CLASS_PARTIAL` — Partial fulfillment; important elements are missing. + * `CLASS_MODERATE` — Moderate fulfillment; balanced presence with notable gaps. + * `CLASS_SUBSTANTIAL` — Substantial fulfillment; most key aspects are present. + * `CLASS_MAJOR` — Major fulfillment; minor omissions remain. + * `CLASS_VERY_NEARLY` — Very nearly fully met; only negligible details missing. + * `CLASS_EXACT` — Exactly and fully meets the criterion. + + **Legacy 5-class scale (`CLASSIFICATION_SCALE`, available via opt-out):** + * `CLASS_UNMET` / `CLASS_PARTIALLY_MET` / `CLASS_MODERATELY_MET` / `CLASS_MAJORLY_MET` / `CLASS_EXACTLY_MET`. #### Mathematical Scoring of Rubric Points The judge's categorical classification is mapped to a quantitative score. -* **Numerical Mapping**: The classification is mapped to a linear, equidistant numerical scale: - * `CLASS_UNMET` -> **0.0** - * `CLASS_PARTIALLY_MET` -> **0.25** - * `CLASS_MODERATELY_MET` -> **0.50** - * `CLASS_MAJORLY_MET` -> **0.75** - * `CLASS_EXACTLY_MET` -> **1.0** +* **Numerical Mapping (10-class experimental scale, current default):** The mapping is **deliberately non-linear at the unmet end**, with a tiny gap separating contradictory content from merely-absent content: + * `CLASS_UTTERLY_UNMET` → **0.000** + * `CLASS_UNMET` → **0.001** + * `CLASS_TRACE` → **0.125** + * `CLASS_SLIGHT` → **0.250** + * `CLASS_PARTIAL` → **0.375** + * `CLASS_MODERATE` → **0.500** + * `CLASS_SUBSTANTIAL` → **0.625** + * `CLASS_MAJOR` → **0.750** + * `CLASS_VERY_NEARLY` → **0.875** + * `CLASS_EXACT` → **1.000** +* **Numerical Mapping (5-class legacy scale, opt-in):** Linear / equidistant: `CLASS_UNMET` → 0.0, `CLASS_PARTIALLY_MET` → 0.25, `CLASS_MODERATELY_MET` → 0.5, `CLASS_MAJORLY_MET` → 0.75, `CLASS_EXACTLY_MET` → 1.0. * **Score Inversion (`should_not`)**: For criteria that penalize undesirable content, the score is inverted. For an original score $S_{\text{orig}}$, the final score is $S_{\text{final}} = 1 - S_{\text{orig}}$. * **Weighted Aggregation**: A blueprint can assign a `multiplier` (weight) to each point. The final rubric score for a model on a prompt (`avgCoverageExtent`) is the weighted average of all point scores. For $N$ points with score $S_i$ and weight $w_i$: ```math @@ -112,7 +130,7 @@ When multiple judges evaluate the same response, the platform quantifies the deg ##### Krippendorff's Alpha Calculation * **Purpose**: Krippendorff's α measures the consistency of judgments across all judges evaluating a model's response. Values range from 0 (no agreement beyond chance) to 1 (perfect agreement). -* **Formula**: For ordinal data (our 0.0, 0.25, 0.50, 0.75, 1.0 scale), α is calculated as: +* **Formula**: For ordinal data (whichever scale is active — the 10-class experimental mapping today, see "Mathematical Scoring of Rubric Points" above), α is calculated as: ```math \alpha = 1 - \frac{D_o}{D_e} ``` @@ -191,7 +209,7 @@ While the platform aims for complete judge coverage on all criteria, individual **The Backup Judge Mechanism:** -To improve robustness, the system employs a backup judge (Claude 3.5 Haiku) that activates when primary judges fail: +To improve robustness, the system employs a backup judge (Claude Haiku 4.5, `openrouter:anthropic/claude-haiku-4.5`, holistic approach — defined as `DEFAULT_BACKUP_JUDGE` in `src/cli/evaluators/llm-coverage-evaluator.ts`) that activates when primary judges fail: * **Trigger Condition**: Backup judge only runs when `successfulJudgements < totalPrimaryJudges` (i.e., at least one primary judge failed) * **Not Used with Custom Judges**: To preserve user-configured judge sets, backup judge is disabled when custom `judges` are specified in the blueprint @@ -528,7 +546,7 @@ Weval's methodology is designed to be robust, but like any quantitative system, The validity of Weval's metrics rests on these core assumptions: * **Assumption of Appropriate Weighting in Hybrid Score**: The Hybrid Score currently uses 0% similarity and 100% coverage (coverage-only). This assumes that rubric adherence is the dominant signal of model quality and that semantic similarity to an ideal response (when available) adds no additional information. While this explicit choice is more transparent than an implicit weighting, it may not be optimal for all evaluation contexts, and future versions may offer configurable weights. -* **Assumption of Linearity in Score Mapping**: The 5-point categorical scale from the LLM judge is mapped to a linear, equidistant numerical scale. This assumes the qualitative gap between "Absent" and "Slightly Present" is the same as between "Majorly Present" and "Fully Present," which may not be perceptually true. +* **Assumption of (mostly) Linearity in Score Mapping**: The categorical scale from the LLM judge is mapped to a numerical scale that is **partially non-linear** in the current 10-class experimental default. The platform deliberately separates `CLASS_UTTERLY_UNMET` (0.000) from `CLASS_UNMET` (0.001) by an explicitly tiny gap to distinguish contradiction from mere absence, and uses 0.125-step increments above that. This addresses but does not fully solve the perceptual-gap problem: the middle of the scale still has uniform 0.125 steps that may not match how judges actually perceive distance between, say, "Moderate" and "Substantial". The legacy 5-class scale (still available via `useExperimentalScale: false`) is fully linear and even more susceptible to this concern. A complete fix would require either empirical calibration of the mapping against human ratings or moving to continuous 0–1 judge outputs. * **Assumption of Criterion Independence**: The rubric score (`avgCoverageExtent`) is a weighted average that treats each criterion as an independent variable. It does not account for potential correlations between criteria (e.g., "clarity" and "conciseness"). * **Assumption of Effective Bias Reduction via Anonymization**: The model anonymization system assumes that removing real model names and providers significantly reduces analyst LLM bias, while preserving maker-level information provides meaningful comparative insights. This assumes that brand bias is primarily driven by explicit name recognition rather than subtle patterns in response style that might persist even when anonymized. From 7da0017cfd92254e81d9e900637dc3c3e8cb00be Mon Sep 17 00:00:00 2001 From: Ken Ojibe Date: Tue, 5 May 2026 09:31:18 -0400 Subject: [PATCH 3/8] docs: Clarify the use of the experimental 10-class classification scale and update storage persistence details for judge agreement in evaluation configurations. --- docs/BLUEPRINT_FORMAT.md | 21 +++++++++++++-------- docs/INTER_AGREEMENT_PLAN.md | 2 +- 2 files changed, 14 insertions(+), 9 deletions(-) diff --git a/docs/BLUEPRINT_FORMAT.md b/docs/BLUEPRINT_FORMAT.md index 85d12927..2c579a9c 100644 --- a/docs/BLUEPRINT_FORMAT.md +++ b/docs/BLUEPRINT_FORMAT.md @@ -117,7 +117,7 @@ The `evaluationConfig` field allows you to customize how evaluations are perform evaluationConfig: llm-coverage: judges: [...] # Custom judge configuration - useExperimentalScale: true # Use 9-point scale instead of 5-point + useExperimentalScale: true # Opt into the 10-class non-linear scale instead of the legacy 5-class linear scale (note: currently forced on globally; see "Classification Scale" below) ``` #### LLM Coverage Evaluation Options @@ -125,7 +125,7 @@ evaluationConfig: | Field | Type | Description | |---|---|---| | `judges` | `Judge[]` | **(Optional)** Custom judge configuration. If omitted, uses the default judges. Each judge is an object with `id`, `model`, and `approach` fields. See below for details. | -| `useExperimentalScale` | `boolean` | **(Optional)** If `true`, uses the experimental 9-point classification scale (0.0, 0.001, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0) instead of the default 5-point scale (0.0, 0.25, 0.5, 0.75, 1.0). This provides finer granularity in rubric scoring. | +| `useExperimentalScale` | `boolean` | **(Optional)** If `true`, uses the experimental **10-class** classification scale (0.0, 0.001, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0) instead of the legacy 5-class scale (0.0, 0.25, 0.5, 0.75, 1.0). This provides finer granularity in rubric scoring, with a deliberately non-linear gap between "utterly unmet / contradictory" (0.0) and "merely unmet" (0.001). **Note:** the experimental scale is currently forced on globally via `FORCE_EXPERIMENTAL = true` in `src/cli/evaluators/llm-coverage-evaluator.ts`, so this flag has no effect at the moment — both `true` and (omitted/`false`) result in the 10-class scale being used. See `docs/METHODOLOGY.md` for details. | | `judgeModels` | `string[]` | **(Deprecated)** Legacy field for backwards compatibility. Use `judges` instead. | | `judgeMode` | `'failover' \| 'consensus'` | **(Deprecated)** Legacy field for backwards compatibility. The system now always uses consensus mode across all configured judges. | @@ -141,17 +141,22 @@ Each judge in the `judges` array is an object with the following fields: **Default Judges:** -If no custom judges are specified, the system uses these default judges: +If no custom judges are specified, the system uses these default judges (defined as `DEFAULT_JUDGES` in `src/cli/evaluators/llm-coverage-evaluator.ts`): ```yaml judges: - - id: 'holistic-qwen3-30b-a3b-instruct-2507' - model: 'openrouter:qwen/qwen3-30b-a3b-instruct-2507' + - id: 'holistic-gemini-2-5-flash' + model: 'openrouter:google/gemini-2.5-flash' approach: 'holistic' - - id: 'holistic-openai-gpt-oss-120b' - model: 'openrouter:openai/gpt-oss-120b' + - id: 'holistic-gpt-4-1-mini' + model: 'openrouter:openai/gpt-4.1-mini' + approach: 'holistic' + - id: 'holistic-claude-haiku-4-5' + model: 'openrouter:anthropic/claude-haiku-4.5' approach: 'holistic' ``` +The defaults are chosen for **cross-family consensus** (one judge from each of Google, OpenAI, and Anthropic) over **approach diversity** (all three default judges use the `holistic` approach). To get approach diversity, configure custom judges explicitly. + **Example with Custom Judges:** ```yaml @@ -178,7 +183,7 @@ evaluationConfig: **Backup Judge:** -If all configured judges fail to return a valid assessment, the system automatically attempts to use a backup judge (`anthropic:claude-3.5-haiku` with `holistic` approach) to ensure evaluation can complete. This backup is only used when custom judges are not configured. +If any of the primary default judges fail to return a valid assessment, the system automatically attempts to use a backup judge (`openrouter:anthropic/claude-haiku-4.5` with `holistic` approach — defined as `DEFAULT_BACKUP_JUDGE` in `src/cli/evaluators/llm-coverage-evaluator.ts`) to ensure evaluation can complete. This backup is only used when **custom `judges` are not configured** — to preserve user intent, the backup is suppressed when a blueprint specifies its own judges. ### Model Configuration diff --git a/docs/INTER_AGREEMENT_PLAN.md b/docs/INTER_AGREEMENT_PLAN.md index 97fea18f..4d0ee1f1 100644 --- a/docs/INTER_AGREEMENT_PLAN.md +++ b/docs/INTER_AGREEMENT_PLAN.md @@ -392,7 +392,7 @@ llmCoverageScores[promptData.promptId][modelId] = { No additional changes needed! The `judgeAgreement` field will be: - Stored in result JSON files (automatic via type extension) - Available in API responses (automatic via type extension) -- Persisted to Netlify Blobs (automatic via serialization) +- Persisted via the storage abstraction in `src/lib/storageService.ts` (S3 in production, local FS in development; automatic via serialization) --- From 1d2ff7b8897c3bf6c48236ee99ea330fef9041de Mon Sep 17 00:00:00 2001 From: Nnamdi Kenneth Ojibe Date: Tue, 5 May 2026 18:06:26 -0400 Subject: [PATCH 4/8] Refine evaluation paths in ARCHITECTURE.md Clarified the description of evaluation paths in the Weval architecture. --- docs/ARCHITECTURE.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 752e10bb..c5cf6f58 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -2,11 +2,10 @@ This document provides a comprehensive overview of the Weval architecture, detailing the distinct workflows that power the platform and the core components that drive evaluation. -Weval runs evaluations through two main paths: scheduled public evaluations and interactive developer/sandbox runs. Both paths use the same core evaluation pipeline, but differ in how runs are triggered, stored, and surfaced in the UI. - - -1. **The Automated "Public Commons" Workflow**: A continuous integration pipeline that automatically evaluates community-contributed blueprints and updates the public `weval.org` website. -2. **The Interactive "Developer & Sandbox" Workflow**: A set of tools for developers and prompt engineers to create, test, and iterate on blueprints either locally or in a web-based environment. +Weval runs evaluations through two main paths: + +1. **The Automated "Public Commons" Workflow:** A continuous integration pipeline that automatically evaluates community-contributed blueprints and updates the public weval.org website. +2. **The Interactive "Developer & Sandbox"** Workflow: A set of tools for developers and prompt engineers to create, test, and iterate on blueprints either locally or in a web-based environment. ## 1. High-Level Workflows From 0dd742ec17b396892d8754ec31b90e5d6d3792c9 Mon Sep 17 00:00:00 2001 From: Ken Ojibe Date: Tue, 5 May 2026 18:17:44 -0400 Subject: [PATCH 5/8] docs: fix factual inaccuracies in ARCHITECTURE.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - search_index.json → search-index.json (matches SEARCH_INDEX_FILENAME constant in storageService.ts) - Clarify that calculateHybridScore is defined in calculationUtils.ts; summaryCalculationUtils.ts orchestrates it, not defines it Co-Authored-By: Claude Sonnet 4.6 --- docs/ARCHITECTURE.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index c5cf6f58..26831d81 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -89,7 +89,7 @@ These are the foundational services used across all workflows, ensuring evaluati - **`comparison-pipeline-service.ts`**: The central orchestrator that manages a single evaluation run. It takes a configuration, generates model responses, and calls the necessary evaluators. - **`llm-coverage-evaluator.ts`**: Implements the rubric-based scoring logic. It uses "judge" LLMs to assess responses against the `should` and `should_not` criteria defined in a blueprint. It supports complex rubrics including alternative paths (OR logic), where the best-performing path is selected. - **`storageService.ts`**: A critical abstraction layer that handles all file I/O, allowing the system to seamlessly read and write from either the local filesystem or a cloud provider like AWS S3. -- **`summaryCalculationUtils.ts`**: Contains the post-processing logic for calculating aggregate metrics like the **Hybrid Score**, model performance drift, and leaderboard rankings. This service operates on completed raw result files. +- **`summaryCalculationUtils.ts`**: Orchestrates post-processing after a run completes — computing model performance drift and leaderboard rankings, and calling `calculateHybridScore` (defined in `calculationUtils.ts`) to produce the **Hybrid Score**. Operates on completed raw result files. ### Storage Architecture (The `live/` Directory) @@ -112,7 +112,7 @@ graph TD; B --> B1["homepage_summary.json"]; B --> B2["latest_runs_summary.json"]; - B --> B3["search_index.json"]; + B --> B3["search-index.json"]; C --> C1["[config-id]/"]; C1 --> C2["[run-file].json"]; @@ -141,7 +141,7 @@ graph TD; - **`live/aggregates/`**: Contains all global, cross-cutting summary files. - `homepage_summary.json`: The main manifest for the website's homepage. - `latest_runs_summary.json`: A list of the 50 most recent evaluation runs. - - `search_index.json`: The pre-compiled index for the website's search functionality. + - `search-index.json`: The pre-compiled index for the website's search functionality. - **`live/blueprints/`**: Contains the core evaluation data, organized by each blueprint's unique ID. Each subdirectory contains the raw JSON outputs for every run of that blueprint, plus a `summary.json` of its historical performance. - **`live/models/`**: Contains data aggregated on a per-model basis. - `summaries/`: Detailed performance breakdowns for each model across all blueprints. @@ -290,7 +290,7 @@ graph TD; ## 4. Key Architectural Concepts -- **Separation of Raw Data and Summaries**: The core pipeline still produces a monolithic `*_comparison.json` for complete fidelity, *but* the UI now relies on the artefact bundle (`core.json` + `responses/` + `coverage/`) for 95 % of use-cases. High-level metrics like the **Hybrid Score** are *not* in either raw form; they are computed afterward by `summaryCalculationUtils.ts` and saved into summary files (e.g. `homepage_summary.json`). +- **Separation of Raw Data and Summaries**: The core pipeline still produces a monolithic `*_comparison.json` for complete fidelity, *but* the UI now relies on the artefact bundle (`core.json` + `responses/` + `coverage/`) for 95 % of use-cases. High-level metrics like the **Hybrid Score** are *not* in either raw form; they are computed afterward by `calculateHybridScore` in `calculationUtils.ts` (called via `summaryCalculationUtils.ts`) and saved into summary files (e.g. `homepage_summary.json`). - **Consistency via Shared Services**: By using the same core services (`comparison-pipeline-service`, `storageService`, etc.) for both the automated cron-driven workflow (GitHub Actions → Railway) and the manual CLI/Sandbox workflow, the platform ensures that an evaluation produces the same results regardless of how it was triggered. - **Idempotent, Content-Hashed Runs**: The automated workflow uses a hash of a blueprint's content (including its fully resolved model list) as its `runLabel`. This ensures that identical blueprints are not re-run unnecessarily, saving significant computational resources. - **Graceful Fallback & Progressive Enhancement**: The Sandbox is a prime example of this design principle. It is fully functional for anonymous users, with all work saved to local storage. Authenticating with GitHub progressively enhances the experience by enabling cloud-based file management and the ability to contribute back to the public commons. From 63e14e905f1e8036f3aaf257bccf3f5681afe028 Mon Sep 17 00:00:00 2001 From: Ken Ojibe Date: Tue, 5 May 2026 18:18:14 -0400 Subject: [PATCH 6/8] docs: remove redundant prose from ARCHITECTURE.md Cut sentences that repeated adjacent headings or duplicated content already stated in Section 4 (workflow descriptions under diagram subheadings, filler section intros, core.json lazy-loading paragraph). Co-Authored-By: Claude Sonnet 4.6 --- docs/ARCHITECTURE.md | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 26831d81..73bd1b02 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -13,8 +13,6 @@ The following diagrams illustrate the two main operational flows of the platform ### The Automated "Public Commons" Workflow -This workflow describes how community contributions are automatically evaluated and published. - ```mermaid graph TD subgraph "Contribution" @@ -45,8 +43,6 @@ graph TD ### The Interactive "Developer & Sandbox" Workflow -This workflow shows the parallel paths for local CLI development and web-based Sandbox use. Both are powered by the same core evaluation engine. - ```mermaid graph LR subgraph "Path A: Local CLI Development" @@ -80,12 +76,8 @@ graph LR ## 2. Component Deep Dive -Each component in the diagrams above has a specific role in the ecosystem. - ### Core Services (Shared Logic) -These are the foundational services used across all workflows, ensuring evaluation consistency. - - **`comparison-pipeline-service.ts`**: The central orchestrator that manages a single evaluation run. It takes a configuration, generates model responses, and calls the necessary evaluators. - **`llm-coverage-evaluator.ts`**: Implements the rubric-based scoring logic. It uses "judge" LLMs to assess responses against the `should` and `should_not` criteria defined in a blueprint. It supports complex rubrics including alternative paths (OR logic), where the best-performing path is selected. - **`storageService.ts`**: A critical abstraction layer that handles all file I/O, allowing the system to seamlessly read and write from either the local filesystem or a cloud provider like AWS S3. @@ -162,8 +154,6 @@ graph TD; - `histories/` → per-prompt × model full conversation histories (`histories/[promptId]/[modelId].json`). - *(Legacy)* `[runLabel]_[timestamp]_comparison.json` – the original monolithic file is still generated for backward compatibility but will be phased out. - The application fetches `core.json` via `/api/comparison/.../core` to render the page instantly. Detailed data is lazy-loaded on demand from `responses/` and `coverage/` paths, with automatic fallback to the legacy monolithic file when artefacts are missing. - #### Fixtures (Optional deterministic responses) - When the CLI is invoked with `--fixtures`, the generation stage consults a fixtures file (YAML/JSON) to select deterministic candidate responses for specific prompt×model pairs. @@ -176,8 +166,6 @@ graph TD; ### Automated Workflow Components -These components power the public `weval.org` platform. - - **GitHub Actions cron — Weekly evaluation** (`weekly-eval-check.yml`): The actual scheduler. Runs every Sunday at 00:00 UTC and POSTs an authenticated request to `${RAILWAY_APP_URL}/api/internal/fetch-and-schedule-evals` with a configurable batch size. - **GitHub Actions cron — Daily sandbox cleanup** (`cleanup-sandbox-runs.yml`): Runs every day at 02:00 UTC and POSTs an authenticated request to `${RAILWAY_APP_URL}/api/internal/cleanup-sandbox-runs`, which deletes objects under `live/sandbox/runs/` older than 7 days (`CLEANUP_AGE_DAYS = 7`). - **`/api/internal/fetch-and-schedule-evals`** (`fetch-and-schedule-evals`): A Next.js API route hosted on Railway. It scans the `weval/configs` repository for new or updated blueprints with the `_periodic` tag and triggers evaluation runs for them by calling `/api/internal/execute-evaluation-background` over authenticated HTTP (via `callBackgroundFunction`). @@ -198,8 +186,6 @@ Beyond the two canonical public-commons routes above, the codebase ships several ### Interactive Workflow Components -These components support the developer and sandbox environments. - - **`cli: run-config`**: The main command-line tool for developers. By default, it runs the evaluation pipeline for a local or GitHub-based blueprint and saves the results to the local `/.results/` directory, updating only the per-config summary. When used with the `--update-summaries` flag, it additionally rebuilds platform-wide summaries (homepage leaderboards, model summaries, etc.) using the same logic as the backfill process. - **Sandbox UI & Backend API**: A full-stack feature within the Next.js app that provides an interactive, browser-based IDE for blueprint creation. It has its own set of API endpoints (`/api/sandbox`, `/api/github`) and a dedicated background route handler (`/api/internal/execute-sandbox-pipeline-background`) for running evaluations. From 9d63d664f3805882205c03754997fb6b8a7a3ae4 Mon Sep 17 00:00:00 2001 From: Ken Ojibe Date: Wed, 6 May 2026 06:29:53 -0400 Subject: [PATCH 7/8] docs: correct fire-and-forget description for background route handlers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit callBackgroundFunction awaits a response with a 30s timeout — not truly fire-and-forget. For long evaluations the caller's connection drops but Railway continues running the handler to completion. Replace fire-and-forget language with accurate description throughout. Co-Authored-By: Claude Sonnet 4.6 --- docs/ARCHITECTURE.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 73bd1b02..056f401e 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -54,7 +54,7 @@ graph LR subgraph "Path B: Web Sandbox" C[("fa:fa-user Prompt Engineer")] --> D["fa:fa-flask Sandbox UI"] D -- Start Run --> G["Backend API
(/api/sandbox/run)"] - G -- Fire-and-forget HTTP --> H["fa:fa-cogs route: execute-sandbox-pipeline-background
(same Railway service)"] + G -- Authenticated HTTP --> H["fa:fa-cogs route: execute-sandbox-pipeline-background
(same Railway service)"] H -- Runs Core Pipeline --> E H -- Writes to --> I[("fa:fa-aws S3 Bucket
live/sandbox/runs/")] D -- Polls Status --> G2["Status API
(/api/sandbox/status/[sandboxId])"] @@ -169,11 +169,11 @@ graph TD; - **GitHub Actions cron — Weekly evaluation** (`weekly-eval-check.yml`): The actual scheduler. Runs every Sunday at 00:00 UTC and POSTs an authenticated request to `${RAILWAY_APP_URL}/api/internal/fetch-and-schedule-evals` with a configurable batch size. - **GitHub Actions cron — Daily sandbox cleanup** (`cleanup-sandbox-runs.yml`): Runs every day at 02:00 UTC and POSTs an authenticated request to `${RAILWAY_APP_URL}/api/internal/cleanup-sandbox-runs`, which deletes objects under `live/sandbox/runs/` older than 7 days (`CLEANUP_AGE_DAYS = 7`). - **`/api/internal/fetch-and-schedule-evals`** (`fetch-and-schedule-evals`): A Next.js API route hosted on Railway. It scans the `weval/configs` repository for new or updated blueprints with the `_periodic` tag and triggers evaluation runs for them by calling `/api/internal/execute-evaluation-background` over authenticated HTTP (via `callBackgroundFunction`). -- **`/api/internal/execute-evaluation-background`** (`execute-evaluation-background`): A long-running Next.js API route handler — also on Railway — that performs the actual evaluation for the public site. It calls the core services and is responsible for creating both the raw result file and updating the aggregate summary files in S3. Throughout this document, "background function" refers to these fire-and-forget HTTP-triggered routes; they are **not** Netlify Functions, but the team's term for long-running internal route handlers. +- **`/api/internal/execute-evaluation-background`** (`execute-evaluation-background`): A long-running Next.js API route handler — also on Railway — that performs the actual evaluation for the public site. It calls the core services and is responsible for creating both the raw result file and updating the aggregate summary files in S3. The caller (`callBackgroundFunction`) awaits a response up to a 30-second timeout; for evaluations that exceed that, the caller's connection is dropped but Railway continues running the handler to completion. Throughout this document, "background route" refers to this pattern: a long-running internal HTTP handler, **not** a Netlify Function. #### Other internal background routes -Beyond the two canonical public-commons routes above, the codebase ships several additional internal background routes under `api/internal/`. They follow the same auth/fan-out pattern (`callBackgroundFunction` → authenticated `POST` → long-running Next.js handler on Railway) and exist to support adjacent surfaces: +Beyond the two canonical public-commons routes above, the codebase ships several additional internal background routes under `api/internal/`. They follow the same pattern (`callBackgroundFunction` → authenticated `POST` → long-running Next.js handler on Railway) and exist to support adjacent surfaces: - **`execute-pr-evaluation-background`**: Runs evaluations for blueprints proposed in a PR; output lands under `live/pr-evals/[prNumber]/...`. - **`execute-api-evaluation-background`**: Runs evaluations triggered by the public HTTP API. From ae6cc81e7973fb85c746e91c7e835d65ab08d3b9 Mon Sep 17 00:00:00 2001 From: Nnamdi Kenneth Ojibe Date: Wed, 6 May 2026 11:21:55 -0400 Subject: [PATCH 8/8] Fix formatting in ARCHITECTURE.md --- docs/ARCHITECTURE.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 056f401e..4dd66c05 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -5,7 +5,7 @@ This document provides a comprehensive overview of the Weval architecture, detai Weval runs evaluations through two main paths: 1. **The Automated "Public Commons" Workflow:** A continuous integration pipeline that automatically evaluates community-contributed blueprints and updates the public weval.org website. -2. **The Interactive "Developer & Sandbox"** Workflow: A set of tools for developers and prompt engineers to create, test, and iterate on blueprints either locally or in a web-based environment. +2. **The Interactive "Developer & Sandbox" Workflow:** A set of tools for developers and prompt engineers to create, test, and iterate on blueprints either locally or in a web-based environment. ## 1. High-Level Workflows