Skip to content

cirovitale/text2sql

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fine-Tuned LLMs for Conversational Text-to-SQL

A system developed as part of the Natural Language Processing (NLP) course at the University of Salerno, focused on translating natural language into SQL queries in multi-turn conversational settings. The solution is based on the CoSQL dataset and employs fine-tuning of the open-source LLM deepseek-coder-1.3B-instruct using Low-Rank Adaptation (LoRA). The model is trained with Parameter-Efficient Fine-Tuning (PEFT) and evaluated through standard metrics such as Question Match and Interaction Match, measuring the Exact Match between predicted and gold queries.

Prerequisites

System Requirements

  • Python ≥ 3.13
  • Conda (recommended for environment management)
  • RAM ≥ 16GB+
  • VRAM ≥ 6GB+

Installation

1. Clone Repository

git clone https://github.com/cirovitale/text2sql
cd text2sql

2. Download Dataset

  1. Download the CoSQL dataset from: https://yale-lily.github.io/cosql
  2. Extract the downloaded files
  3. Place the dataset folder in the /dataset/cosql_dataset/ directory:

3. Environment Setup with Conda

Create and Activate Conda Environment

# Create the environment from the provided YAML file
conda env create -f environment.yml

# Activate the environment
conda activate unisa-nlp

Usage

Training

# To fine-tune the base model on the CoSQL dataset:
python training.py

Inference

# To generate SQL queries from natural language prompts:
python inference.py

Evaluation

# To compute evaluation metrics on the test set:
python testing.py

The evaluation includes:

  • Question Match: Accuracy per individual question (exact SQL match)
  • Interaction Match: Accuracy on the entire multi-turn interaction

Project Structure

text2sql/
├── training.py             # Training pipeline
├── inference.py            # Inference pipeline
├── testing.py              # Testing pipeline
├── environment.yml         # Conda environment specification
├── dataset/
│   └── cosql_dataset/      # Directory for CoSQL dataset
├── model-007/              # Selected fine-tuned model
│   └── checkpoint-1000/    # Selected fine-tuned checkpoint
├── Documentazione.pdf      # Documentation (Italian)

Documentation

The complete documentation of the project, including literature review, methodology, datasets, training pipeline, and experimental results, is available in Italian language in: Documentazione.pdf

About

Fine-tuned LLM system for conversational natural language to SQL translation. Uses LoRA fine-tuning of deepseek-coder-1.3B-instruct on CoSQL dataset with Parameter-Efficient Fine-Tuning (PEFT). Evaluated on Question Match and Interaction Match metrics for multi-turn conversational settings.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors