LLM Output Validation & RAG Workflow Framework

Automated quality assurance for LLM pipelines — hallucination detection, retrieval validation, and consistency evaluation at scale.

What this does

Most RAG pipelines ship with no systematic way to know if their outputs are actually correct. This project builds that layer — an automated evaluation framework that runs alongside your LLM pipeline and flags hallucinations, retrieval failures, and consistency issues before they reach end users.

Built as part of independent research into LLM reliability (Jan–Apr 2025). Processes 12,000+ records/day on AWS with 94% retrieval accuracy across evaluation benchmarks.

The problem it solves

User query → Retriever → LLM → Response
                ↑
         No validation here.
         You're trusting the model.

What actually happens in production:

Retriever returns irrelevant chunks → LLM hallucinates confidently
Model ignores retrieved context entirely → answer is made up
Same query returns different answers on different runs → inconsistency

This framework adds a validation layer after every generation step:

User query → Retriever → Validator → LLM → Response Checker → Response
                              ↑                    ↑
                    retrieval accuracy      hallucination score
                    chunk relevance         consistency check
                    coverage score          explainability flag

Architecture

┌─────────────────────────────────────────────────────────┐
│                     FastAPI Gateway                      │
│              (rate limiting, auth, routing)              │
└──────────────────────┬──────────────────────────────────┘
                       │
         ┌─────────────┴─────────────┐
         ▼                           ▼
┌─────────────────┐         ┌─────────────────┐
│  Retrieval      │         │  Generation     │
│  Validator      │         │  Validator      │
│                 │         │                 │
│ - chunk score   │         │ - hallucination │
│ - coverage      │         │ - consistency   │
│ - relevance     │         │ - faithfulness  │
└────────┬────────┘         └────────┬────────┘
         │                           │
         └─────────────┬─────────────┘
                       ▼
            ┌──────────────────┐
            │   Pinecone DB    │
            │  (vector store)  │
            └────────┬─────────┘
                     │
            ┌────────▼─────────┐
            │  Metrics Store   │
            │  (PostgreSQL)    │
            │  + S3 logs       │
            └──────────────────┘

Stack: Python · LangChain · Pinecone · Anthropic API · FastAPI · AWS Lambda + S3 · PostgreSQL

Key results

Metric	Result	Baseline
Retrieval accuracy	94%	71% (no validation)
Hallucination detection rate	89%	—
Throughput	12,000+ records/day	—
Infrastructure cost reduction	38%	pre-optimisation
Average validation latency	~140ms	—

Evaluation methodology: held-out test set of 500 query-document pairs with human-annotated ground truth. Retrieval accuracy = top-1 retrieved chunk matches annotated relevant passage.

Project structure

llm-validation-framework/
│
├── src/
│   ├── retrieval/
│   │   ├── validator.py          # chunk relevance scoring
│   │   ├── coverage.py           # query coverage analysis
│   │   └── embeddings.py         # Pinecone interface
│   │
│   ├── generation/
│   │   ├── hallucination.py      # faithfulness checker
│   │   ├── consistency.py        # multi-run consistency test
│   │   └── explainability.py     # attribution + evidence flags
│   │
│   ├── pipeline/
│   │   ├── orchestrator.py       # LangChain chain assembly
│   │   ├── validators.py         # validation step wrappers
│   │   └── guardrails.py         # hard stop conditions
│   │
│   ├── api/
│   │   ├── main.py               # FastAPI app
│   │   ├── routes.py             # endpoints
│   │   └── middleware.py         # rate limiting, auth
│   │
│   └── infra/
│       ├── aws_lambda/           # Lambda deployment config
│       └── metrics.py            # Prometheus + CloudWatch
│
├── evaluation/
│   ├── benchmark.py              # run full eval suite
│   ├── datasets/                 # test query-document pairs
│   └── results/                  # stored eval outputs (gitignored)
│
├── notebooks/
│   ├── 01_retrieval_analysis.ipynb   # chunk scoring exploration
│   ├── 02_hallucination_study.ipynb  # failure mode analysis
│   └── 03_ablation_study.ipynb       # component-wise ablation
│
├── tests/
│   ├── unit/
│   └── integration/
│
├── docs/
│   ├── architecture.md
│   ├── evaluation_methodology.md
│   └── deployment.md
│
├── requirements.txt
├── docker-compose.yml
└── README.md                     # you are here

Quickstart

# clone
git clone https://github.com/CHHemant/llm-validation-framework
cd llm-validation-framework

# install
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# configure (copy and fill in keys)
cp .env.example .env

# run locally
uvicorn src.api.main:app --reload

# run evaluation suite
python evaluation/benchmark.py --dataset evaluation/datasets/test_500.json

.env.example:

ANTHROPIC_API_KEY=your-key
PINECONE_API_KEY=your-key
PINECONE_ENV=us-east-1-aws
DATABASE_URL=postgresql://...

Evaluation methodology

The hallucination checker uses a three-stage approach:

1. Retrieval faithfulness — does the answer only reference information present in retrieved chunks?

faithfulness_score = semantic_similarity(answer, retrieved_chunks)
# threshold: 0.72 → flag if below

2. Consistency — does the same query produce consistent answers across runs?

# run same query N=5 times, measure pairwise cosine similarity
consistency_score = mean_pairwise_similarity(responses)
# threshold: 0.85 → flag if below

3. Attribution — can every factual claim be traced to a source chunk?

# sentence-level attribution using NLI model
for sentence in answer:
    entailment_score = nli_model(sentence, retrieved_chunks)
    if entailment_score < 0.6:
        flag_for_review(sentence)

Full methodology in docs/evaluation_methodology.md.

Ablation study results

Tested which validation components contribute most to the accuracy gain:

Configuration	Retrieval Accuracy	vs. baseline
Baseline (no validation)	71%	—
+ Retrieval validator only	81%	+10%
+ Generation validator only	78%	+7%
+ Both validators	91%	+20%
+ Consistency check	94%	+23%

Takeaway: retrieval validation contributes more than generation validation in isolation, but combining both with consistency checking gives the biggest gain. Detailed analysis in notebooks/03_ablation_study.ipynb.

Limitations

Honest about what this doesn't do well:

Domain-specific hallucinations — the faithfulness checker uses general semantic similarity. For highly technical domains (medical, legal) it misses subtle factual errors.
Multi-hop reasoning — if the answer requires combining information from 3+ chunks, attribution accuracy drops to ~74%.
Non-English documents — only tested on English. Retrieval accuracy degrades significantly on Hindi/Telugu documents in our test set.
Latency at scale — at 50k+ records/day the validation overhead becomes the bottleneck. Needs async batching beyond that threshold.

Related work

RAGAS — evaluation framework for RAG pipelines (similar goal, different implementation approach)
TruLens — LLM evaluation with feedback functions
Anthropic's Constitutional AI — principles that informed the guardrail design

About

Built by Hemant Chilkuri — B.Tech AI, G.H. Raisoni University (2024–2028).

Microsoft Learn Student Ambassador · Phoenix AI Club founder

hemant_189@outlook.com · linkedin.com/in/hemantchilkuri

If you use this in your research or find it useful, a star helps others find it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Output Validation & RAG Workflow Framework

What this does

The problem it solves

Architecture

Key results

Project structure

Quickstart

Evaluation methodology

Ablation study results

Limitations

Related work

About

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
evaluation		evaluation
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LLM Output Validation & RAG Workflow Framework

What this does

The problem it solves

Architecture

Key results

Project structure

Quickstart

Evaluation methodology

Ablation study results

Limitations

Related work

About

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages