Skip to content

CHHemant/llm-validation-framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Output Validation & RAG Workflow Framework

Automated quality assurance for LLM pipelines — hallucination detection, retrieval validation, and consistency evaluation at scale.

Python LangChain FastAPI AWS License


What this does

Most RAG pipelines ship with no systematic way to know if their outputs are actually correct. This project builds that layer — an automated evaluation framework that runs alongside your LLM pipeline and flags hallucinations, retrieval failures, and consistency issues before they reach end users.

Built as part of independent research into LLM reliability (Jan–Apr 2025). Processes 12,000+ records/day on AWS with 94% retrieval accuracy across evaluation benchmarks.


The problem it solves

User query → Retriever → LLM → Response
                ↑
         No validation here.
         You're trusting the model.

What actually happens in production:

  • Retriever returns irrelevant chunks → LLM hallucinates confidently
  • Model ignores retrieved context entirely → answer is made up
  • Same query returns different answers on different runs → inconsistency

This framework adds a validation layer after every generation step:

User query → Retriever → Validator → LLM → Response Checker → Response
                              ↑                    ↑
                    retrieval accuracy      hallucination score
                    chunk relevance         consistency check
                    coverage score          explainability flag

Architecture

┌─────────────────────────────────────────────────────────┐
│                     FastAPI Gateway                      │
│              (rate limiting, auth, routing)              │
└──────────────────────┬──────────────────────────────────┘
                       │
         ┌─────────────┴─────────────┐
         ▼                           ▼
┌─────────────────┐         ┌─────────────────┐
│  Retrieval      │         │  Generation     │
│  Validator      │         │  Validator      │
│                 │         │                 │
│ - chunk score   │         │ - hallucination │
│ - coverage      │         │ - consistency   │
│ - relevance     │         │ - faithfulness  │
└────────┬────────┘         └────────┬────────┘
         │                           │
         └─────────────┬─────────────┘
                       ▼
            ┌──────────────────┐
            │   Pinecone DB    │
            │  (vector store)  │
            └────────┬─────────┘
                     │
            ┌────────▼─────────┐
            │  Metrics Store   │
            │  (PostgreSQL)    │
            │  + S3 logs       │
            └──────────────────┘

Stack: Python · LangChain · Pinecone · Anthropic API · FastAPI · AWS Lambda + S3 · PostgreSQL


Key results

Metric Result Baseline
Retrieval accuracy 94% 71% (no validation)
Hallucination detection rate 89%
Throughput 12,000+ records/day
Infrastructure cost reduction 38% pre-optimisation
Average validation latency ~140ms

Evaluation methodology: held-out test set of 500 query-document pairs with human-annotated ground truth. Retrieval accuracy = top-1 retrieved chunk matches annotated relevant passage.


Project structure

llm-validation-framework/
│
├── src/
│   ├── retrieval/
│   │   ├── validator.py          # chunk relevance scoring
│   │   ├── coverage.py           # query coverage analysis
│   │   └── embeddings.py         # Pinecone interface
│   │
│   ├── generation/
│   │   ├── hallucination.py      # faithfulness checker
│   │   ├── consistency.py        # multi-run consistency test
│   │   └── explainability.py     # attribution + evidence flags
│   │
│   ├── pipeline/
│   │   ├── orchestrator.py       # LangChain chain assembly
│   │   ├── validators.py         # validation step wrappers
│   │   └── guardrails.py         # hard stop conditions
│   │
│   ├── api/
│   │   ├── main.py               # FastAPI app
│   │   ├── routes.py             # endpoints
│   │   └── middleware.py         # rate limiting, auth
│   │
│   └── infra/
│       ├── aws_lambda/           # Lambda deployment config
│       └── metrics.py            # Prometheus + CloudWatch
│
├── evaluation/
│   ├── benchmark.py              # run full eval suite
│   ├── datasets/                 # test query-document pairs
│   └── results/                  # stored eval outputs (gitignored)
│
├── notebooks/
│   ├── 01_retrieval_analysis.ipynb   # chunk scoring exploration
│   ├── 02_hallucination_study.ipynb  # failure mode analysis
│   └── 03_ablation_study.ipynb       # component-wise ablation
│
├── tests/
│   ├── unit/
│   └── integration/
│
├── docs/
│   ├── architecture.md
│   ├── evaluation_methodology.md
│   └── deployment.md
│
├── requirements.txt
├── docker-compose.yml
└── README.md                     # you are here

Quickstart

# clone
git clone https://github.com/CHHemant/llm-validation-framework
cd llm-validation-framework

# install
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# configure (copy and fill in keys)
cp .env.example .env

# run locally
uvicorn src.api.main:app --reload

# run evaluation suite
python evaluation/benchmark.py --dataset evaluation/datasets/test_500.json

.env.example:

ANTHROPIC_API_KEY=your-key
PINECONE_API_KEY=your-key
PINECONE_ENV=us-east-1-aws
DATABASE_URL=postgresql://...

Evaluation methodology

The hallucination checker uses a three-stage approach:

1. Retrieval faithfulness — does the answer only reference information present in retrieved chunks?

faithfulness_score = semantic_similarity(answer, retrieved_chunks)
# threshold: 0.72 → flag if below

2. Consistency — does the same query produce consistent answers across runs?

# run same query N=5 times, measure pairwise cosine similarity
consistency_score = mean_pairwise_similarity(responses)
# threshold: 0.85 → flag if below

3. Attribution — can every factual claim be traced to a source chunk?

# sentence-level attribution using NLI model
for sentence in answer:
    entailment_score = nli_model(sentence, retrieved_chunks)
    if entailment_score < 0.6:
        flag_for_review(sentence)

Full methodology in docs/evaluation_methodology.md.


Ablation study results

Tested which validation components contribute most to the accuracy gain:

Configuration Retrieval Accuracy vs. baseline
Baseline (no validation) 71%
+ Retrieval validator only 81% +10%
+ Generation validator only 78% +7%
+ Both validators 91% +20%
+ Consistency check 94% +23%

Takeaway: retrieval validation contributes more than generation validation in isolation, but combining both with consistency checking gives the biggest gain. Detailed analysis in notebooks/03_ablation_study.ipynb.


Limitations

Honest about what this doesn't do well:

  • Domain-specific hallucinations — the faithfulness checker uses general semantic similarity. For highly technical domains (medical, legal) it misses subtle factual errors.
  • Multi-hop reasoning — if the answer requires combining information from 3+ chunks, attribution accuracy drops to ~74%.
  • Non-English documents — only tested on English. Retrieval accuracy degrades significantly on Hindi/Telugu documents in our test set.
  • Latency at scale — at 50k+ records/day the validation overhead becomes the bottleneck. Needs async batching beyond that threshold.

Related work

  • RAGAS — evaluation framework for RAG pipelines (similar goal, different implementation approach)
  • TruLens — LLM evaluation with feedback functions
  • Anthropic's Constitutional AI — principles that informed the guardrail design

About

Built by Hemant Chilkuri — B.Tech AI, G.H. Raisoni University (2024–2028).

Microsoft Learn Student Ambassador · Phoenix AI Club founder

hemant_189@outlook.com · linkedin.com/in/hemantchilkuri


If you use this in your research or find it useful, a star helps others find it.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors