Requirements-driven framework for making AI features QA-ready, auditable, and release-gated.
This repository is designed for teams that need more than raw eval scores. It connects LLM test execution with requirements, risks, acceptance criteria, traceability, and release decisions so that AI quality can be discussed in the language of QA, test management, delivery governance, and stakeholder approvals.
- We do not just test prompts, we make AI features release-ready.
- We do not just collect scores, we generate evidence for
GO,GO WITH RISKS, andNO-GO. - We do not just benchmark models, we connect AI behavior to requirements, risks, and governance controls.
Traditional QA methods break down for AI-assisted features because outputs are non-deterministic, failure modes are new, and "expected result" logic alone is too weak.
This framework addresses that gap by turning AI quality into structured QA artifacts:
- Versioned requirements with priorities, risks, governance controls, and acceptance criteria
- Reusable scenario definitions for security, hallucination, RAG, bias, consistency, and performance
- Traceability from requirement to executed evidence
- Release gates that convert raw test outcomes into a management decision
- Governance-ready outputs for QA, product, engineering, and audit conversations
- Multi-provider LLM execution through one client abstraction
- Reusable scenario-based checks across major AI quality dimensions
- Regression-ready pytest suite for repeated validation
dashboard.htmlfor release-readiness and coverage overviewtraceability.jsonandtraceability.htmlfor requirement-to-scenario evidencerelease_summary.jsonfor release decision and threshold comparisonaudit_report.jsonandaudit_report.htmlfor management-readable governance assessmentevidence_manifest.jsonfor packaging the generated evidence set
Evaluate a conversational assistant before production release.
- Prompt injection resistance
- Secret leakage prevention
- Consistency and tone checks
- Management-facing release decision
Starter: templates/chatbot
Validate that answers remain grounded in retrieved content.
- Grounding and faithfulness
- Contradiction handling
- No unsupported claims outside provided context
- Evidence-backed release gates
Starter: templates/rag_assistant
Assess an internal AI assistant that summarizes company policies or documentation.
- Grounding in approved content
- Explicit handling of missing information
- Safe boundaries for sensitive internal policies
- Audit-friendly evidence for internal stakeholders
Starter: templates/internal_knowledge_search
The repo now includes an AI-QA maturity framing for consulting and assessment work:
Ad hocRepeatableGovernedRelease-gatedContinuous assurance
See docs/AI_QA_MATURITY_MODEL.md.
| Category | Focus |
|---|---|
| Security | Prompt injection, secret leakage, misuse prevention |
| Hallucination | Known facts, unverifiable entities, URL caution, basic math |
| RAG | Grounding, no extra information, contradictions, multilingual context |
| Performance | Latency, token efficiency, non-empty responses |
| Bias | Gender, names, stereotypes, age, political balance |
| Consistency | Stable answers, repeated runs, tone compliance |
| UI | Generic chatbot UI flow and accessibility checks |
git clone https://github.com/Lengi96/ai-qa-framework.git
cd ai-qa-framework
pip install -r requirements.txt
# Optional extras
pip install .[openai]
pip install .[google]
pip install .[ui]
pip install .[dashboard]
pip install .[all]
# UI browser
playwright install chromium
cp .env.example .envConfigure one or more provider keys:
ANTHROPIC_API_KEYOPENAI_API_KEYGOOGLE_API_KEY
# Default LLM tests
pytest
# Skip UI tests
pytest -m "not ui"
# Provider/model override
pytest --provider openai --model gpt-4o
pytest --provider google --model gemini-2.0-flash
# HTML pytest report
pytest --html=report.html --self-contained-htmlpytest tests/test_ui.py --base-url http://localhost:3000
pytest tests/test_ui.py --base-url http://localhost:3000 --headedOptional selector overrides:
--selector-input--selector-send--selector-messages--selector-response--selector-loading--selector-error
Run the suite with JSON reporting and then generate the dashboard plus governance outputs:
pytest tests/ -m "not ui" --json-report --json-report-file=results.json
python -m src.dashboard.generate results.json \
-o dashboard.html \
--provider anthropic \
--model claude-haiku-4-5 \
--traceability-out traceability.json \
--traceability-html traceability.html \
--release-summary-out release_summary.json \
--audit-report-out audit_report.json \
--audit-report-html audit_report.html \
--evidence-manifest-out evidence_manifest.jsonThis generates:
dashboard.htmltraceability.jsontraceability.htmlrelease_summary.jsonaudit_report.jsonaudit_report.htmlevidence_manifest.json
- Weighted quality index based on requirement priority and risk
- AI-QA maturity assessment for stakeholder conversations
- Governance control mapping per requirement
- Historical release view across prior report files
- Recommended actions for open quality gaps
ai-qa-framework/
├── config/
│ └── quality_gates.yaml
├── docs/
│ └── AI_QA_MATURITY_MODEL.md
├── requirements/
│ └── core_requirements.yaml
├── scenarios/
│ └── core_scenarios.yaml
├── src/
│ ├── dashboard/
│ │ └── generate.py
│ ├── quality/
│ │ ├── reporting.py
│ │ ├── scenario_runner.py
│ │ └── specs.py
│ └── llm_client.py
├── templates/
│ ├── chatbot/
│ ├── internal_knowledge_search/
│ └── rag_assistant/
└── tests/
This repo is strongest when positioned as a requirements-driven AI-QA and release-governance framework, not just as a generic LLM test harness.
It is particularly useful for:
- AI-QA assessments in consulting or client projects
- Release-readiness checks for AI-enabled features
- Model-switch validation between providers or versions
- RAG governance for internal knowledge assistants
- Audit-friendly evidence generation for sensitive AI use cases
GitHub Actions is set up for recurring and change-based validation:
- Push to
main - Pull requests
- Weekly schedule
- Manual trigger
For repository setup:
- Add provider API keys as GitHub Actions secrets
- Add
CHATBOT_BASE_URLas a repository variable for UI runs
The current implementation is optimized for a consulting/demo asset with real technical value. The highest-leverage future extensions remain:
- Trend analysis across releases and model changes
- Domain-specific requirement/scenario catalogs
- Ticket/backlog integration for open quality gaps
- Governance/compliance mapping for specific control frameworks
- Executive summary outputs for non-technical stakeholders
MIT License. See LICENSE.