Skip to content

US-024: Evaluate Proposition Generation Prompts #56

Description

@Mateus-Mannes

Summary

Create a local-only evaluation application for the AI prompts used by proposition generation. The evaluator must run representative, human-reviewed cases through the real prompt implementations, apply deterministic checks and structured AI graders, and produce reports that make prompt or model regressions visible before production changes are deployed.

The evaluator is opt-in. It must not execute during dotnet test or in GitHub Actions because it makes paid, non-deterministic model calls.

Prompt Surfaces

Evaluate the current proposition-generation stages independently:

  1. Article-content eligibility validation.
  2. Advanced exercise paragraph generation.
  3. Proper-name extraction.
  4. Title generation.
  5. Title validation.
  6. Advanced-to-intermediate adaptation.
  7. Intermediate-to-beginner adaptation.
  8. Image eligibility validation.

Each prompt must expose a stable prompt name and version so reports identify exactly what produced a result.

Architecture

  • Add a console application such as:
    • tests/propositions-service/WriteFluency.PropositionGeneration.Evals
  • Include it in WriteFluency.sln as a normal executable project, not a test project.
  • Extract private prompt construction from OpenAIClient into focused, versioned prompt definitions or collaborators that can be used by both production and the evaluator.
  • Do not duplicate production prompt text inside the evaluator.
  • Keep production orchestration and evaluation orchestration separate.
  • Add a configurable generation model and a separately configurable grader model.
  • Use structured output for AI grader responses.
  • Use deterministic grading whenever a rule can be verified without AI.

Evaluation Corpus

Add a committed, self-contained, sanitized evaluation manifest. It must not depend on production databases, corrections.json, external URLs, or another local dataset.

Cases should cover:

  • Valid news articles.
  • Navigation, boilerplate, listicles, affiliate content, advertising, adult content, and violent content that should be rejected.
  • Articles with insufficient information.
  • Paragraphs with hallucinated, omitted, or altered facts.
  • Paragraphs with too many or incorrectly copied proper names.
  • Acronyms, symbols, quotations, dates, complex numbers, line breaks, and formatting prohibited by the transcription exercise.
  • Neutral third-person and listener-friendly writing requirements.
  • Titles that omit proper names, invent facts, exceed word limits, or merely list names.
  • Correct and incorrect proper-name extraction.
  • Intermediate and beginner rewrites that preserve facts and names.
  • Rewrites that fail to create meaningful difficulty separation.
  • Images that are valid editorial thumbnails, placeholders, publisher logos, advertisements, screenshots, or unrelated graphics.

Every case must include:

  • Anonymous case ID and category.
  • Prompt stage.
  • Complete input required by that stage.
  • Human-reviewed expectation and explanation.
  • Deterministic assertions where applicable.
  • AI-grader rubric expectations where semantic judgment is required.

Start with at least 40 cases distributed across all prompt stages. Include positive, negative, and near-boundary examples.

Grading

Deterministic Graders

Implement exact checks for requirements such as:

  • Null versus non-null decision.
  • Output schema and allowed classification values.
  • Character and word limits.
  • Single-paragraph formatting.
  • Prohibited symbols and line breaks.
  • Proper-name count and exact preservation.
  • Required names in titles.
  • Title word count.
  • No newly introduced proper names.
  • Expected classification and extraction precision, recall, and F1.

AI Graders

Use a separate, low-temperature grader with structured output for semantic dimensions that deterministic code cannot judge reliably:

  • Factual faithfulness to the source article.
  • Hallucination severity.
  • Important-fact coverage.
  • Listener/transcription suitability.
  • Engagement without sensationalism.
  • Neutral global English.
  • Title relevance and factuality.
  • Meaning preservation during difficulty adaptation.
  • Actual beginner/intermediate difficulty separation.
  • Image relevance and editorial usability.

The grader response must contain:

  • Per-dimension score.
  • Pass/fail decision.
  • Short evidence grounded in the supplied input and candidate.
  • Explicit failure codes.

The grader prompt must treat article and generated content as untrusted data. Content inside a fixture must never override the rubric or output schema.

Use temperature 0 when supported. The default grader should be an efficient model appropriate for classification and rubric scoring, with model IDs configurable from command-line arguments or configuration.

Repeatability And Comparison

Support:

--stage <stage-name>
--case <case-id>
--runs <positive integer>
--concurrency <positive integer>
--generation-model <model-id>
--grader-model <model-id>
--report-only
--validate-only
  • Allow multiple runs per case to measure model variance.
  • Use bounded concurrency.
  • Report pass rate per case across repeated runs.
  • Support comparing a candidate prompt/model against a named baseline report.
  • Do not pass evaluation expectations or grader answers into the generation prompt.
  • Keep examples used by production prompts separate from held-out evaluation cases to avoid contaminating the evaluation.

Reports

Write ignored artifacts under artifacts/proposition-generation-evals/<timestamp>/:

  • report.md: human-readable summary and failing cases.
  • report.json: complete structured metrics.
  • outputs.json: candidate outputs, grader evidence, deterministic failures, and expected behavior.

Include:

  • Prompt name and version.
  • Generation and grader model IDs.
  • Exact case and stage pass rates.
  • Per-rubric scores.
  • Classification confusion metrics where applicable.
  • Proper-name precision, recall, and F1.
  • Repeated-run stability.
  • Invalid structured outputs and provider failures.
  • Input/output token usage per request and in total.
  • Estimated dollar cost using configurable model prices.
  • Latency per case and total duration.

Do not log API keys, user identifiers, or unsanitized production content.

Quality Gates

  • Define configurable thresholds per stage rather than one global score.
  • Safety, factual faithfulness, and hallucination checks are hard gates.
  • Invalid output schemas fail the relevant case.
  • Reports must distinguish:
    • generation failure,
    • deterministic-rule failure,
    • grader failure,
    • grader disagreement,
    • provider/timeout failure.
  • Add a manifest validator that runs without an API key and checks fixture completeness and consistency.

Local Execution

Document:

dotnet run --project tests/propositions-service/WriteFluency.PropositionGeneration.Evals

Requirements:

  • Read the OpenAI key from the established local user-secrets/environment configuration.
  • Exit non-zero when configured quality gates fail, unless --report-only is used.
  • --validate-only must not require an API key or make network calls.
  • dotnet test WriteFluency.sln must never execute these evaluations.
  • GitHub workflows must not invoke this project.

Test Plan

  • Unit-test deterministic graders and metric calculations.
  • Unit-test grader structured-output parsing and validation with mocked responses.
  • Unit-test prompt version reporting.
  • Unit-test manifest validation.
  • Verify stage and case filters.
  • Verify repeated runs and bounded concurrency.
  • Verify token, cost, latency, and stability aggregation.
  • Verify report generation for success, quality failure, invalid output, timeout, and provider failure.
  • Verify evaluation cases are not embedded in production prompts.
  • Verify solution-wide dotnet test does not make external AI calls.

Acceptance Criteria

  • All current proposition-generation prompt stages have representative evaluation coverage.
  • Production and evaluator use the same versioned prompt definitions.
  • The corpus is self-contained, sanitized, and human-reviewed.
  • Deterministic and AI-based graders produce explainable results.
  • Repeated runs expose output variance.
  • Reports include quality, latency, token usage, and estimated cost.
  • A single case or stage can be debugged explicitly.
  • Evaluation runs only through an explicit local command.
  • Normal tests and GitHub pipelines remain free of paid AI evaluation calls.

Out Of Scope

  • Automatically changing prompts based on grader output.
  • Automatically deploying a prompt or model after an evaluation.
  • Running paid evaluations in CI.
  • Replacing production validation with the evaluation grader.
  • Building a hosted evaluation dashboard.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions