This project builds an end-to-end pipeline that converts natural language queries into SQL using a fine-tuned Gemma 3 1B instruction-tuned model. By applying LoRA (Low-Rank Adaptation) on the sql-create-context dataset, we significantly improve SQL generation accuracy on a realistic library database.
Input:
Find all books written by J.R.R. Tolkien that have reviews
Output SQL:
SELECT b.title
FROM books b
JOIN reviews r ON b.id = r.book_id
WHERE b.author = 'J.R.R. Tolkien';- Natural Language → SQL query generation
- Fine-tuned Gemma 3 model (LoRA)
- Execution-based evaluation using SQLite
- Supports JOINs, filtering, aggregations
- Lightweight (1B parameter model)
- Source:
b-mc2/sql-create-context(Hugging Face) - 7,000 training samples
- 300 evaluation samples
- Token limit: 512
Custom library database:
books(70 records)users(60 records)reviews(62 records)checkout(60 records)
- Base Model:
unsloth/gemma-3-1b-it - Fine-tuning: LoRA (4-bit quantization via Unsloth)
- Hardware: NVIDIA T4 (16GB VRAM)
- Frameworks: Transformers, PEFT, TRL, Datasets
git clone https://github.com/ttran569/CS-468-NLP-to-SQL-Final-Project.git
cd CS-468-NLP-to-SQL-Final-Project
pip install -r requirements.txtfrom transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("rahul042/gemma_3_better")
tokenizer = AutoTokenizer.from_pretrained("rahul042/gemma_3_better")
prompt = """
Generate one SQL query using the schema below.
Schema:
[...]
Question: Find all books published after 2000
"""
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))- Exact Match – String equality
- Execution Accuracy – Correct query results
- Syntax Error Rate – Invalid SQL
- Wrong Result Rate – Incorrect outputs
| Metric | Base Model | Fine-tuned Model |
|---|---|---|
| Exact Match | 4.0% | 69.7% |
| Execution Accuracy | 71.7% | 94.0% |
| Syntax Error Rate | 15.0% | 2.7% |
| Wrong Result Rate | 13.3% | 3.3% |
- +22.3% execution accuracy improvement
- 82% reduction in syntax errors
- Reliable SQL generation for real-world queries
- Efficient small-model fine-tuning
├── data/
├── model/
├── evaluation/
├── database/
├── notebooks/
└── README.md
- Expand dataset (20k+ samples)
- Schema linking improvements
- Multi-turn query support
- Faster batched inference
- 🤗 Model: https://huggingface.co/rahul042/gemma_3_better
- 📂 GitHub: https://github.com/ttran569/CS-468-NLP-to-SQL-Final-Project
- Rahul S.
- Jonathan C.
- Kunhao L.
- Thomas T.
⭐ If you find this project useful, consider starring the repo!