A system developed as part of the Natural Language Processing (NLP) course at the University of Salerno, focused on translating natural language into SQL queries in multi-turn conversational settings.
The solution is based on the CoSQL dataset and employs fine-tuning of the open-source LLM deepseek-coder-1.3B-instruct using Low-Rank Adaptation (LoRA).
The model is trained with Parameter-Efficient Fine-Tuning (PEFT) and evaluated through standard metrics such as Question Match and Interaction Match, measuring the Exact Match between predicted and gold queries.
- Python ≥ 3.13
- Conda (recommended for environment management)
- RAM ≥ 16GB+
- VRAM ≥ 6GB+
git clone https://github.com/cirovitale/text2sql
cd text2sql- Download the CoSQL dataset from: https://yale-lily.github.io/cosql
- Extract the downloaded files
- Place the dataset folder in the
/dataset/cosql_dataset/directory:
# Create the environment from the provided YAML file
conda env create -f environment.yml
# Activate the environment
conda activate unisa-nlp# To fine-tune the base model on the CoSQL dataset:
python training.py# To generate SQL queries from natural language prompts:
python inference.py# To compute evaluation metrics on the test set:
python testing.pyThe evaluation includes:
- Question Match: Accuracy per individual question (exact SQL match)
- Interaction Match: Accuracy on the entire multi-turn interaction
text2sql/
├── training.py # Training pipeline
├── inference.py # Inference pipeline
├── testing.py # Testing pipeline
├── environment.yml # Conda environment specification
├── dataset/
│ └── cosql_dataset/ # Directory for CoSQL dataset
├── model-007/ # Selected fine-tuned model
│ └── checkpoint-1000/ # Selected fine-tuned checkpoint
├── Documentazione.pdf # Documentation (Italian)
The complete documentation of the project, including literature review, methodology, datasets, training pipeline, and experimental results, is available in Italian language in: Documentazione.pdf