Local LLM QA Evaluator

This is a local developer tool for comparing how different LLMs review the same course or assessment content. It is meant for AI-assisted QA evaluation, not blind automated QA and not text-template generation.

For a portfolio-friendly case study, start here:

portfolio_sample/README.md explains the QA workflow, sample results, and interpretation.
portfolio_sample/model_comparison_sanitized.md is a supporting appendix generated from one manually reviewed run.

The public portfolio artifacts intentionally exclude raw course content, raw model requests/responses, API keys, screenshots, and client-specific details.

The workflow is:

Inspect a course folder.
Build a clean review context from the real files.
Send the same QA prompts to multiple OpenRouter models.
Save raw model outputs and request metadata.
Normalize model findings.
Manually label every finding.
Generate a comparison report for a portfolio case study.

Setup

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

Copy .env.example to .env and add your OpenRouter API key:

OPENROUTER_API_KEY=your_key_here
OPENROUTER_SITE_URL=http://localhost
OPENROUTER_APP_TITLE=Local LLM QA Evaluator

Do not commit .env.

Do not commit real course or assessment content. The project expects a local folder named inputassessment, but that folder is ignored by git except for an empty .gitkeep placeholder.

OpenRouter calls can cost money when you use paid models. Free models may be rate-limited, unavailable, or less reliable. Review config/models.yaml before running a real evaluation.

Inspect The Input Folder

The current input folder is expected at inputassessment. Put the course or assessment files there locally before running the tool. Real course files may contain client, platform, or assessment details, so they should stay out of the portfolio repository.

python -m src.main inspect --input inputassessment

This identifies file types, lesson content, assessment JSON, notebooks, images, CSV data, geodata artifacts, checkpoint files, and whether chunking is likely needed.

Build Context Only

python -m src.main build-context --input inputassessment

For a single course page, use the page slug printed by inspect:

python -m src.main build-context --input inputassessment --page-slug Creating-Basic-Geospatial-Visualizations-205e

You can also select by title fragment:

python -m src.main build-context --input inputassessment --page-title "Creating Basic"

This creates:

runs/<run_id>/review_context.md
runs/<run_id>/source_summary.json

The loader is designed around Codio-style guide content:

includes .guides/content/*.md in index.json order;
includes page metadata from .guides/content/*.json;
includes assessment questions, answer keys, guidance, scoring, and linked files from .guides/assessments/*.json;
includes notebook source cells and summarized outputs;
includes CSV sample rows;
summarizes images, Natural Earth shapefile data, settings, checkpoints, and large output blobs.

Dry Run

Dry-run mode works without an API key and makes no external API calls:

python -m src.main run --input inputassessment --dry-run --only-model openai/gpt-5.2 --only-prompt technical_qa_review

For the improved single-page course review prompt:

python -m src.main run --input inputassessment --dry-run --page-slug Creating-Basic-Geospatial-Visualizations-205e --only-model openai/gpt-5.2 --only-prompt course_page_text_review

For the grammar and term-formatting consistency prompt:

python -m src.main run --input inputassessment --dry-run --page-slug Creating-Basic-Geospatial-Visualizations-205e --only-model openai/gpt-5.2 --only-prompt grammar_consistency_markup_review

It prints each model/prompt request and saves raw request metadata under:

runs/<run_id>/raw/<model>/<prompt_id>.json

Real OpenRouter Evaluation

Edit config/models.yaml first so the model IDs match your OpenRouter account and budget.

python -m src.main run --input inputassessment --models config/models.yaml --prompts config/prompts.yaml

Recommended first real comparison for one page:

python -m src.main run --input inputassessment --page-slug Creating-Basic-Geospatial-Visualizations-205e --only-prompt course_page_text_review --only-model anthropic/claude-sonnet-4.6 --only-model openai/gpt-5.2 --only-model google/gemini-2.5-flash

The app uses OpenRouter chat completions at:

https://openrouter.ai/api/v1/chat/completions

It reads OPENROUTER_API_KEY from .env, sends non-streaming chat completion requests, retries transient errors and rate limits, and saves full raw request/response metadata.

Generated run outputs:

runs/<run_id>/raw/<model>/<prompt_id>.json
runs/<run_id>/normalized_findings.json
runs/<run_id>/normalized_findings.csv
runs/<run_id>/manual_review.csv

If a model returns imperfect JSON, the parser tries to extract JSON from code fences or surrounding text. If parsing fails, raw text remains saved in the raw output file and the error is written to:

runs/<run_id>/parser_errors.log

Manual Review

After a real run, open:

runs/<run_id>/manual_review.csv

Fill:

manual_label
manual_notes
final_status

Use manual_label for the main classification of the model finding. Use manual_notes for the evidence-based reviewer decision. Use final_status for the action you would take after validation.

Allowed manual_label values:

valid_issue
false_positive
useful_suggestion
duplicate
too_generic
needs_manual_verification
missed_known_issue
not_enough_evidence

Suggested manual_notes style:

Keep notes short and specific.
Explain why the label was chosen.
Mention what was checked: source page, nearby text, code block, notebook, assessment config, screenshot description, or live run.
Do not paste confidential source text if the run may later be sanitized for a public portfolio artifact.

Examples:

manual_label	Example manual_notes
`valid_issue`	Confirmed in source page. The explanation and code imply different behavior.
`false_positive`	Model misread the term in context; nearby paragraph already clarifies it.
`useful_suggestion`	Not a required fix, but the recommendation would improve clarity for beginners.
`duplicate`	Same underlying issue as finding `abc123`; keep the clearer finding.
`too_generic`	Comment is broadly true but does not point to a specific fixable issue.
`needs_manual_verification`	Requires running the notebook or checking the rendered student-facing page.
`not_enough_evidence`	Source snippet is insufficient to confirm or reject the issue.

Suggested final_status values:

final_status	When to use
`accepted_issue`	The finding is a real issue that should be fixed.
`accepted_suggestion`	The finding is optional but useful.
`rejected_false_positive`	The finding is incorrect after manual review.
`merged_duplicate`	The finding duplicates another accepted item.
`verification_required`	The finding needs a live run, rendered-page check, or cross-page check.
`known_issue_added`	A manually known issue was added for comparison.
`no_action`	No fix or follow-up is needed.

You can recreate or refresh the review file after normalizing findings:

python -m src.main create-review --run-id <run_id>

Existing manual labels are preserved when finding IDs match.

Generate Report

After manual labels are filled:

python -m src.main report --run-id <run_id> --manual-review runs/<run_id>/manual_review.csv

Outputs:

reports/<run_id>_model_comparison.md
reports/<run_id>_model_comparison.html

The report includes:

source folder summary;
models tested;
prompts used;
findings per model;
valid issues per model;
false positives per model;
duplicates;
too generic findings;
useful suggestions;
examples of one valid issue and one false positive when available;
notes on which model was most useful for this specific sample;
a warning not to overgeneralize from one page/sample;
the conclusion that LLMs were used as a secondary QA aid and final findings require manual validation.

Export Sanitized Portfolio Artifact

Full run folders can contain copied course text inside review_context.md, raw prompts, and raw model responses. Keep runs/ and reports/ local unless the source content is safe to publish.

To create a portfolio-safe summary without raw source text:

python -m src.main export-sanitized --run-id <run_id> --manual-review runs/<run_id>/manual_review.csv

Outputs:

portfolio_sample/model_comparison_sanitized.md
portfolio_sample/model_comparison_sanitized.html

This sanitized artifact includes aggregate counts, model comparison tables, label summaries, and high-level examples. It intentionally excludes the source page text, raw model requests/responses, API keys, screenshots, and client-specific course details.

Prompt Types

config/prompts.yaml includes:

course_page_text_review
grammar_consistency_markup_review
technical_qa_review
content_consistency_review
assessment_logic_review
screenshot_output_review
user_flow_review

Each prompt asks for structured JSON findings using the normalized schema in config/evaluation_schema.json.

course_page_text_review is optimized for the course-page proofreading QA workflow. It removes internet-verification claims, treats missing image binaries honestly, separates required fixes from cross-page verification points, and returns evaluator-friendly JSON. When --page-slug or --page-title is used, the tool automatically prepends page name, source path, previous/next page, opened files, assessment references, and headings to the user prompt.

grammar_consistency_markup_review is focused on grammar, punctuation, typos, capitalization, hyphenation, spacing, and inconsistent formatting of the same concept. It asks the model to return annotated_text with exact suspected spans marked using exclamation emoji markers, plus normalized findings[] for CSV/manual review. This prompt has max_tokens: 7000 because it asks the model to return a full marked copy of the reviewed page.

Basic Validation

python -m unittest discover tests

Notes

This tool intentionally uses plain files only: JSON, CSV, Markdown, and HTML. It does not require a database. It makes no external API calls except OpenRouter during a non-dry-run evaluation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local LLM QA Evaluator

Setup

Inspect The Input Folder

Build Context Only

Dry Run

Real OpenRouter Evaluation

Manual Review

Generate Report

Export Sanitized Portfolio Artifact

Prompt Types

Basic Validation

Notes

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
inputassessment		inputassessment
portfolio_sample		portfolio_sample
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Local LLM QA Evaluator

Setup

Inspect The Input Folder

Build Context Only

Dry Run

Real OpenRouter Evaluation

Manual Review

Generate Report

Export Sanitized Portfolio Artifact

Prompt Types

Basic Validation

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages