Paper: Link 📝
Model: HuggingFace 🤗
Dataset: Kaggle 📊
Blog: LINK 📝
This project investigates the effectiveness of Retrieval-Augmented Generation (RAG) and parameter-efficient fine-tuning (LoRA) for technical question answering across core Computer Science domains.
We evaluate four configurations of a large language model:
- Vanilla Mistral
- RAG + Vanilla
- LoRA Fine-Tuned
- RAG + Fine-Tuned
Key finding:
Retrieval significantly improves both lexical and semantic answer quality, while aggressive fine-tuning can degrade semantic coherence due to catastrophic forgetting.
The dataset contains technical interview question–answer pairs across core CS domains such as data structures, operating systems, databases, and computer networks.
Dataset creation involved seed curation, synthetic expansion, and filtering:
| Stage | Samples |
|---|---|
| Initial dataset | 2070 |
| Exact duplicates removed | 51 |
| After deduplication | 2019 |
| Semantic duplicates removed | 213 |
| Final dataset | 1806 |
Semantic filtering used MiniLM embeddings with similarity threshold:
cosine similarity > 0.9
| Split | Samples |
|---|---|
| Train | 1264 |
| Validation | 270 |
| Test | 272 |
| Total | 1806 |
Dataset available here:
Kaggle Dataset
Base model:
Mistral-7B-Instruct
Fine-tuning method:
LoRA (Low-Rank Adaptation)
Retrieval pipeline:
- Sentence Transformers (
all-MiniLM-L6-v2) - FAISS vector index
- Top-K semantic retrieval
Model available here:
HuggingFace
The LoRA model converges quickly due to the domain-specific dataset.
| Model | BLEU-4 | ROUGE-L | BERTScore F1 |
|---|---|---|---|
| Vanilla | 0.0274 | 0.213 | 0.929 |
| RAG + Vanilla | 0.0515 | 0.298 | 0.890 |
| Fine-Tuned | 0.0561 | 0.287 | 0.889 |
| RAG + Fine-Tuned | 0.0380 | 0.252 | 0.871 |
Metric visualization:
Vanilla Model
Base Mistral-7B-Instruct used without modification.
RAG + Vanilla
Retrieval augmented inference using FAISS with MiniLM embeddings.
Fine-Tuned Model
LoRA-based domain adaptation using technical QA dataset.
RAG + Fine-Tuned
Retrieval applied on top of LoRA fine-tuned model.
Examples illustrate the qualitative difference between RAG + Vanilla and RAG + Fine-Tuned.
| Question | RAG + Fine-Tuned | RAG + Vanilla |
|---|---|---|
| Binary tree balanced check | Off-topic explanation about BST and hash tables | Recursive height-based solution |
| Pathway analysis | Fragmented output | Correct biological pathway explanation |
| JWT vs Session Cookies | Incomplete sentence | Correct explanation about stateless APIs |
| Seasonality in data | Empty output | Correct time-series explanation |
| Multiplayer game architecture | Truncated response | Structured client-server architecture |
Below are real examples from the evaluation set comparing RAG + Fine-Tuned and RAG + Vanilla outputs.
These examples illustrate a common failure pattern: the fine-tuned model often produces truncated or off-topic answers, while RAG + Vanilla generates grounded and complete responses.
Question
Implement a function to check if a binary tree is balanced.
Reference Answer
Use a recursive function to check the height of each subtree; return false if the difference is more than one.
| Model | Output |
|---|---|
| RAG + Fine-Tuned | "A Binary Search Tree is a binary tree where each node has a key and the keys are sorted in ascending order. A Hash Table is a data structure that uses a hash function..." |
| RAG + Vanilla | Provides a recursive solution that computes left and right subtree heights and checks if the difference exceeds one. |
Question
When would you choose JWT over Session Cookies?
Reference Answer
JWT is stateless and scalable for microservices; Session Cookies are better for immediate revocation and server-side control.
| Model | Output |
|---|---|
| RAG + Fine-Tuned | "JWT is a compact," (truncated output) |
| RAG + Vanilla | Explains that JWTs are preferred for stateless APIs and distributed systems where server-side sessions are not required. |
These examples highlight how retrieval grounding helps the base model generate coherent technical explanations, while aggressive fine-tuning can introduce catastrophic forgetting and incomplete responses.
The experiments reveal a fine-tuning paradox:
- Fine-tuning improves lexical overlap metrics
- Semantic coherence degrades due to catastrophic forgetting
- Retrieval grounding improves both lexical and semantic quality
Therefore:
RAG + Vanilla LLM provides the most reliable configuration for technical QA tasks.
Released for research and educational use.

