Automated quality assurance for LLM pipelines — hallucination detection, retrieval validation, and consistency evaluation at scale.
Most RAG pipelines ship with no systematic way to know if their outputs are actually correct. This project builds that layer — an automated evaluation framework that runs alongside your LLM pipeline and flags hallucinations, retrieval failures, and consistency issues before they reach end users.
Built as part of independent research into LLM reliability (Jan–Apr 2025). Processes 12,000+ records/day on AWS with 94% retrieval accuracy across evaluation benchmarks.
User query → Retriever → LLM → Response
↑
No validation here.
You're trusting the model.
What actually happens in production:
- Retriever returns irrelevant chunks → LLM hallucinates confidently
- Model ignores retrieved context entirely → answer is made up
- Same query returns different answers on different runs → inconsistency
This framework adds a validation layer after every generation step:
User query → Retriever → Validator → LLM → Response Checker → Response
↑ ↑
retrieval accuracy hallucination score
chunk relevance consistency check
coverage score explainability flag
┌─────────────────────────────────────────────────────────┐
│ FastAPI Gateway │
│ (rate limiting, auth, routing) │
└──────────────────────┬──────────────────────────────────┘
│
┌─────────────┴─────────────┐
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Retrieval │ │ Generation │
│ Validator │ │ Validator │
│ │ │ │
│ - chunk score │ │ - hallucination │
│ - coverage │ │ - consistency │
│ - relevance │ │ - faithfulness │
└────────┬────────┘ └────────┬────────┘
│ │
└─────────────┬─────────────┘
▼
┌──────────────────┐
│ Pinecone DB │
│ (vector store) │
└────────┬─────────┘
│
┌────────▼─────────┐
│ Metrics Store │
│ (PostgreSQL) │
│ + S3 logs │
└──────────────────┘
Stack: Python · LangChain · Pinecone · Anthropic API · FastAPI · AWS Lambda + S3 · PostgreSQL
| Metric | Result | Baseline |
|---|---|---|
| Retrieval accuracy | 94% | 71% (no validation) |
| Hallucination detection rate | 89% | — |
| Throughput | 12,000+ records/day | — |
| Infrastructure cost reduction | 38% | pre-optimisation |
| Average validation latency | ~140ms | — |
Evaluation methodology: held-out test set of 500 query-document pairs with human-annotated ground truth. Retrieval accuracy = top-1 retrieved chunk matches annotated relevant passage.
llm-validation-framework/
│
├── src/
│ ├── retrieval/
│ │ ├── validator.py # chunk relevance scoring
│ │ ├── coverage.py # query coverage analysis
│ │ └── embeddings.py # Pinecone interface
│ │
│ ├── generation/
│ │ ├── hallucination.py # faithfulness checker
│ │ ├── consistency.py # multi-run consistency test
│ │ └── explainability.py # attribution + evidence flags
│ │
│ ├── pipeline/
│ │ ├── orchestrator.py # LangChain chain assembly
│ │ ├── validators.py # validation step wrappers
│ │ └── guardrails.py # hard stop conditions
│ │
│ ├── api/
│ │ ├── main.py # FastAPI app
│ │ ├── routes.py # endpoints
│ │ └── middleware.py # rate limiting, auth
│ │
│ └── infra/
│ ├── aws_lambda/ # Lambda deployment config
│ └── metrics.py # Prometheus + CloudWatch
│
├── evaluation/
│ ├── benchmark.py # run full eval suite
│ ├── datasets/ # test query-document pairs
│ └── results/ # stored eval outputs (gitignored)
│
├── notebooks/
│ ├── 01_retrieval_analysis.ipynb # chunk scoring exploration
│ ├── 02_hallucination_study.ipynb # failure mode analysis
│ └── 03_ablation_study.ipynb # component-wise ablation
│
├── tests/
│ ├── unit/
│ └── integration/
│
├── docs/
│ ├── architecture.md
│ ├── evaluation_methodology.md
│ └── deployment.md
│
├── requirements.txt
├── docker-compose.yml
└── README.md # you are here
# clone
git clone https://github.com/CHHemant/llm-validation-framework
cd llm-validation-framework
# install
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# configure (copy and fill in keys)
cp .env.example .env
# run locally
uvicorn src.api.main:app --reload
# run evaluation suite
python evaluation/benchmark.py --dataset evaluation/datasets/test_500.json.env.example:
ANTHROPIC_API_KEY=your-key
PINECONE_API_KEY=your-key
PINECONE_ENV=us-east-1-aws
DATABASE_URL=postgresql://...
The hallucination checker uses a three-stage approach:
1. Retrieval faithfulness — does the answer only reference information present in retrieved chunks?
faithfulness_score = semantic_similarity(answer, retrieved_chunks)
# threshold: 0.72 → flag if below2. Consistency — does the same query produce consistent answers across runs?
# run same query N=5 times, measure pairwise cosine similarity
consistency_score = mean_pairwise_similarity(responses)
# threshold: 0.85 → flag if below3. Attribution — can every factual claim be traced to a source chunk?
# sentence-level attribution using NLI model
for sentence in answer:
entailment_score = nli_model(sentence, retrieved_chunks)
if entailment_score < 0.6:
flag_for_review(sentence)Full methodology in docs/evaluation_methodology.md.
Tested which validation components contribute most to the accuracy gain:
| Configuration | Retrieval Accuracy | vs. baseline |
|---|---|---|
| Baseline (no validation) | 71% | — |
| + Retrieval validator only | 81% | +10% |
| + Generation validator only | 78% | +7% |
| + Both validators | 91% | +20% |
| + Consistency check | 94% | +23% |
Takeaway: retrieval validation contributes more than generation validation in isolation, but combining both with consistency checking gives the biggest gain. Detailed analysis in notebooks/03_ablation_study.ipynb.
Honest about what this doesn't do well:
- Domain-specific hallucinations — the faithfulness checker uses general semantic similarity. For highly technical domains (medical, legal) it misses subtle factual errors.
- Multi-hop reasoning — if the answer requires combining information from 3+ chunks, attribution accuracy drops to ~74%.
- Non-English documents — only tested on English. Retrieval accuracy degrades significantly on Hindi/Telugu documents in our test set.
- Latency at scale — at 50k+ records/day the validation overhead becomes the bottleneck. Needs async batching beyond that threshold.
- RAGAS — evaluation framework for RAG pipelines (similar goal, different implementation approach)
- TruLens — LLM evaluation with feedback functions
- Anthropic's Constitutional AI — principles that informed the guardrail design
Built by Hemant Chilkuri — B.Tech AI, G.H. Raisoni University (2024–2028).
Microsoft Learn Student Ambassador · Phoenix AI Club founder
hemant_189@outlook.com · linkedin.com/in/hemantchilkuri
If you use this in your research or find it useful, a star helps others find it.