Skip to content

Lengi96/ai-qa-framework

Repository files navigation

AI-QA Governance Framework

Latest Release QA Management Workflow Python

Requirements-driven framework for making AI features QA-ready, auditable, and release-gated.

This repository is designed for teams that need more than raw eval scores. It connects LLM test execution with requirements, risks, acceptance criteria, traceability, and release decisions so that AI quality can be discussed in the language of QA, test management, delivery governance, and stakeholder approvals.

Core Narrative

  • We do not just test prompts, we make AI features release-ready.
  • We do not just collect scores, we generate evidence for GO, GO WITH RISKS, and NO-GO.
  • We do not just benchmark models, we connect AI behavior to requirements, risks, and governance controls.

Why This Repo Is Useful

Traditional QA methods break down for AI-assisted features because outputs are non-deterministic, failure modes are new, and "expected result" logic alone is too weak.

This framework addresses that gap by turning AI quality into structured QA artifacts:

  • Versioned requirements with priorities, risks, governance controls, and acceptance criteria
  • Reusable scenario definitions for security, hallucination, RAG, bias, consistency, and performance
  • Traceability from requirement to executed evidence
  • Release gates that convert raw test outcomes into a management decision
  • Governance-ready outputs for QA, product, engineering, and audit conversations

What You Get

Technical evidence

  • Multi-provider LLM execution through one client abstraction
  • Reusable scenario-based checks across major AI quality dimensions
  • Regression-ready pytest suite for repeated validation

QA and governance outputs

  • dashboard.html for release-readiness and coverage overview
  • traceability.json and traceability.html for requirement-to-scenario evidence
  • release_summary.json for release decision and threshold comparison
  • audit_report.json and audit_report.html for management-readable governance assessment
  • evidence_manifest.json for packaging the generated evidence set

Main Use Cases

1. Chatbot release readiness

Evaluate a conversational assistant before production release.

  • Prompt injection resistance
  • Secret leakage prevention
  • Consistency and tone checks
  • Management-facing release decision

Starter: templates/chatbot

2. RAG assistant quality assurance

Validate that answers remain grounded in retrieved content.

  • Grounding and faithfulness
  • Contradiction handling
  • No unsupported claims outside provided context
  • Evidence-backed release gates

Starter: templates/rag_assistant

3. Internal knowledge search governance

Assess an internal AI assistant that summarizes company policies or documentation.

  • Grounding in approved content
  • Explicit handling of missing information
  • Safe boundaries for sensitive internal policies
  • Audit-friendly evidence for internal stakeholders

Starter: templates/internal_knowledge_search

AI-QA Maturity Model

The repo now includes an AI-QA maturity framing for consulting and assessment work:

  • Ad hoc
  • Repeatable
  • Governed
  • Release-gated
  • Continuous assurance

See docs/AI_QA_MATURITY_MODEL.md.

Supported Quality Dimensions

Category Focus
Security Prompt injection, secret leakage, misuse prevention
Hallucination Known facts, unverifiable entities, URL caution, basic math
RAG Grounding, no extra information, contradictions, multilingual context
Performance Latency, token efficiency, non-empty responses
Bias Gender, names, stereotypes, age, political balance
Consistency Stable answers, repeated runs, tone compliance
UI Generic chatbot UI flow and accessibility checks

Installation

git clone https://github.com/Lengi96/ai-qa-framework.git
cd ai-qa-framework

pip install -r requirements.txt

# Optional extras
pip install .[openai]
pip install .[google]
pip install .[ui]
pip install .[dashboard]
pip install .[all]

# UI browser
playwright install chromium

cp .env.example .env

Configure one or more provider keys:

  • ANTHROPIC_API_KEY
  • OPENAI_API_KEY
  • GOOGLE_API_KEY

Running the Suite

# Default LLM tests
pytest

# Skip UI tests
pytest -m "not ui"

# Provider/model override
pytest --provider openai --model gpt-4o
pytest --provider google --model gemini-2.0-flash

# HTML pytest report
pytest --html=report.html --self-contained-html

UI tests

pytest tests/test_ui.py --base-url http://localhost:3000
pytest tests/test_ui.py --base-url http://localhost:3000 --headed

Optional selector overrides:

  • --selector-input
  • --selector-send
  • --selector-messages
  • --selector-response
  • --selector-loading
  • --selector-error

Generating Governance Artifacts

Run the suite with JSON reporting and then generate the dashboard plus governance outputs:

pytest tests/ -m "not ui" --json-report --json-report-file=results.json

python -m src.dashboard.generate results.json \
  -o dashboard.html \
  --provider anthropic \
  --model claude-haiku-4-5 \
  --traceability-out traceability.json \
  --traceability-html traceability.html \
  --release-summary-out release_summary.json \
  --audit-report-out audit_report.json \
  --audit-report-html audit_report.html \
  --evidence-manifest-out evidence_manifest.json

This generates:

  • dashboard.html
  • traceability.json
  • traceability.html
  • release_summary.json
  • audit_report.json
  • audit_report.html
  • evidence_manifest.json

Governance additions in the report layer

  • Weighted quality index based on requirement priority and risk
  • AI-QA maturity assessment for stakeholder conversations
  • Governance control mapping per requirement
  • Historical release view across prior report files
  • Recommended actions for open quality gaps

Repository Structure

ai-qa-framework/
├── config/
│   └── quality_gates.yaml
├── docs/
│   └── AI_QA_MATURITY_MODEL.md
├── requirements/
│   └── core_requirements.yaml
├── scenarios/
│   └── core_scenarios.yaml
├── src/
│   ├── dashboard/
│   │   └── generate.py
│   ├── quality/
│   │   ├── reporting.py
│   │   ├── scenario_runner.py
│   │   └── specs.py
│   └── llm_client.py
├── templates/
│   ├── chatbot/
│   ├── internal_knowledge_search/
│   └── rag_assistant/
└── tests/

Project Positioning

This repo is strongest when positioned as a requirements-driven AI-QA and release-governance framework, not just as a generic LLM test harness.

It is particularly useful for:

  • AI-QA assessments in consulting or client projects
  • Release-readiness checks for AI-enabled features
  • Model-switch validation between providers or versions
  • RAG governance for internal knowledge assistants
  • Audit-friendly evidence generation for sensitive AI use cases

CI/CD

GitHub Actions is set up for recurring and change-based validation:

  • Push to main
  • Pull requests
  • Weekly schedule
  • Manual trigger

For repository setup:

  • Add provider API keys as GitHub Actions secrets
  • Add CHATBOT_BASE_URL as a repository variable for UI runs

Roadmap Direction

The current implementation is optimized for a consulting/demo asset with real technical value. The highest-leverage future extensions remain:

  • Trend analysis across releases and model changes
  • Domain-specific requirement/scenario catalogs
  • Ticket/backlog integration for open quality gaps
  • Governance/compliance mapping for specific control frameworks
  • Executive summary outputs for non-technical stakeholders

License

MIT License. See LICENSE.

About

Requirements-driven AI-QA governance framework with traceability, release gates, audit reporting, and multi-provider evaluation

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors