Skip to content

bbashifer/llm-qa-evaluator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Local LLM QA Evaluator

This is a local developer tool for comparing how different LLMs review the same course or assessment content. It is meant for AI-assisted QA evaluation, not blind automated QA and not text-template generation.

For a portfolio-friendly case study, start here:

The public portfolio artifacts intentionally exclude raw course content, raw model requests/responses, API keys, screenshots, and client-specific details.

The workflow is:

  1. Inspect a course folder.
  2. Build a clean review context from the real files.
  3. Send the same QA prompts to multiple OpenRouter models.
  4. Save raw model outputs and request metadata.
  5. Normalize model findings.
  6. Manually label every finding.
  7. Generate a comparison report for a portfolio case study.

Setup

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

Copy .env.example to .env and add your OpenRouter API key:

OPENROUTER_API_KEY=your_key_here
OPENROUTER_SITE_URL=http://localhost
OPENROUTER_APP_TITLE=Local LLM QA Evaluator

Do not commit .env.

Do not commit real course or assessment content. The project expects a local folder named inputassessment, but that folder is ignored by git except for an empty .gitkeep placeholder.

OpenRouter calls can cost money when you use paid models. Free models may be rate-limited, unavailable, or less reliable. Review config/models.yaml before running a real evaluation.

Inspect The Input Folder

The current input folder is expected at inputassessment. Put the course or assessment files there locally before running the tool. Real course files may contain client, platform, or assessment details, so they should stay out of the portfolio repository.

python -m src.main inspect --input inputassessment

This identifies file types, lesson content, assessment JSON, notebooks, images, CSV data, geodata artifacts, checkpoint files, and whether chunking is likely needed.

Build Context Only

python -m src.main build-context --input inputassessment

For a single course page, use the page slug printed by inspect:

python -m src.main build-context --input inputassessment --page-slug Creating-Basic-Geospatial-Visualizations-205e

You can also select by title fragment:

python -m src.main build-context --input inputassessment --page-title "Creating Basic"

This creates:

runs/<run_id>/review_context.md
runs/<run_id>/source_summary.json

The loader is designed around Codio-style guide content:

  • includes .guides/content/*.md in index.json order;
  • includes page metadata from .guides/content/*.json;
  • includes assessment questions, answer keys, guidance, scoring, and linked files from .guides/assessments/*.json;
  • includes notebook source cells and summarized outputs;
  • includes CSV sample rows;
  • summarizes images, Natural Earth shapefile data, settings, checkpoints, and large output blobs.

Dry Run

Dry-run mode works without an API key and makes no external API calls:

python -m src.main run --input inputassessment --dry-run --only-model openai/gpt-5.2 --only-prompt technical_qa_review

For the improved single-page course review prompt:

python -m src.main run --input inputassessment --dry-run --page-slug Creating-Basic-Geospatial-Visualizations-205e --only-model openai/gpt-5.2 --only-prompt course_page_text_review

For the grammar and term-formatting consistency prompt:

python -m src.main run --input inputassessment --dry-run --page-slug Creating-Basic-Geospatial-Visualizations-205e --only-model openai/gpt-5.2 --only-prompt grammar_consistency_markup_review

It prints each model/prompt request and saves raw request metadata under:

runs/<run_id>/raw/<model>/<prompt_id>.json

Real OpenRouter Evaluation

Edit config/models.yaml first so the model IDs match your OpenRouter account and budget.

python -m src.main run --input inputassessment --models config/models.yaml --prompts config/prompts.yaml

Recommended first real comparison for one page:

python -m src.main run --input inputassessment --page-slug Creating-Basic-Geospatial-Visualizations-205e --only-prompt course_page_text_review --only-model anthropic/claude-sonnet-4.6 --only-model openai/gpt-5.2 --only-model google/gemini-2.5-flash

The app uses OpenRouter chat completions at:

https://openrouter.ai/api/v1/chat/completions

It reads OPENROUTER_API_KEY from .env, sends non-streaming chat completion requests, retries transient errors and rate limits, and saves full raw request/response metadata.

Generated run outputs:

runs/<run_id>/raw/<model>/<prompt_id>.json
runs/<run_id>/normalized_findings.json
runs/<run_id>/normalized_findings.csv
runs/<run_id>/manual_review.csv

If a model returns imperfect JSON, the parser tries to extract JSON from code fences or surrounding text. If parsing fails, raw text remains saved in the raw output file and the error is written to:

runs/<run_id>/parser_errors.log

Manual Review

After a real run, open:

runs/<run_id>/manual_review.csv

Fill:

  • manual_label
  • manual_notes
  • final_status

Use manual_label for the main classification of the model finding. Use manual_notes for the evidence-based reviewer decision. Use final_status for the action you would take after validation.

Allowed manual_label values:

  • valid_issue
  • false_positive
  • useful_suggestion
  • duplicate
  • too_generic
  • needs_manual_verification
  • missed_known_issue
  • not_enough_evidence

Suggested manual_notes style:

  • Keep notes short and specific.
  • Explain why the label was chosen.
  • Mention what was checked: source page, nearby text, code block, notebook, assessment config, screenshot description, or live run.
  • Do not paste confidential source text if the run may later be sanitized for a public portfolio artifact.

Examples:

manual_label Example manual_notes
valid_issue Confirmed in source page. The explanation and code imply different behavior.
false_positive Model misread the term in context; nearby paragraph already clarifies it.
useful_suggestion Not a required fix, but the recommendation would improve clarity for beginners.
duplicate Same underlying issue as finding abc123; keep the clearer finding.
too_generic Comment is broadly true but does not point to a specific fixable issue.
needs_manual_verification Requires running the notebook or checking the rendered student-facing page.
not_enough_evidence Source snippet is insufficient to confirm or reject the issue.

Suggested final_status values:

final_status When to use
accepted_issue The finding is a real issue that should be fixed.
accepted_suggestion The finding is optional but useful.
rejected_false_positive The finding is incorrect after manual review.
merged_duplicate The finding duplicates another accepted item.
verification_required The finding needs a live run, rendered-page check, or cross-page check.
known_issue_added A manually known issue was added for comparison.
no_action No fix or follow-up is needed.

You can recreate or refresh the review file after normalizing findings:

python -m src.main create-review --run-id <run_id>

Existing manual labels are preserved when finding IDs match.

Generate Report

After manual labels are filled:

python -m src.main report --run-id <run_id> --manual-review runs/<run_id>/manual_review.csv

Outputs:

reports/<run_id>_model_comparison.md
reports/<run_id>_model_comparison.html

The report includes:

  • source folder summary;
  • models tested;
  • prompts used;
  • findings per model;
  • valid issues per model;
  • false positives per model;
  • duplicates;
  • too generic findings;
  • useful suggestions;
  • examples of one valid issue and one false positive when available;
  • notes on which model was most useful for this specific sample;
  • a warning not to overgeneralize from one page/sample;
  • the conclusion that LLMs were used as a secondary QA aid and final findings require manual validation.

Export Sanitized Portfolio Artifact

Full run folders can contain copied course text inside review_context.md, raw prompts, and raw model responses. Keep runs/ and reports/ local unless the source content is safe to publish.

To create a portfolio-safe summary without raw source text:

python -m src.main export-sanitized --run-id <run_id> --manual-review runs/<run_id>/manual_review.csv

Outputs:

portfolio_sample/model_comparison_sanitized.md
portfolio_sample/model_comparison_sanitized.html

This sanitized artifact includes aggregate counts, model comparison tables, label summaries, and high-level examples. It intentionally excludes the source page text, raw model requests/responses, API keys, screenshots, and client-specific course details.

Prompt Types

config/prompts.yaml includes:

  • course_page_text_review
  • grammar_consistency_markup_review
  • technical_qa_review
  • content_consistency_review
  • assessment_logic_review
  • screenshot_output_review
  • user_flow_review

Each prompt asks for structured JSON findings using the normalized schema in config/evaluation_schema.json.

course_page_text_review is optimized for the course-page proofreading QA workflow. It removes internet-verification claims, treats missing image binaries honestly, separates required fixes from cross-page verification points, and returns evaluator-friendly JSON. When --page-slug or --page-title is used, the tool automatically prepends page name, source path, previous/next page, opened files, assessment references, and headings to the user prompt.

grammar_consistency_markup_review is focused on grammar, punctuation, typos, capitalization, hyphenation, spacing, and inconsistent formatting of the same concept. It asks the model to return annotated_text with exact suspected spans marked using exclamation emoji markers, plus normalized findings[] for CSV/manual review. This prompt has max_tokens: 7000 because it asks the model to return a full marked copy of the reviewed page.

Basic Validation

python -m unittest discover tests

Notes

This tool intentionally uses plain files only: JSON, CSV, Markdown, and HTML. It does not require a database. It makes no external API calls except OpenRouter during a non-dry-run evaluation.

About

Local Python tool for evaluating LLM-generated QA findings with manual validation and false-positive analysis

Topics

Resources

Stars

Watchers

Forks

Contributors