Awesome-Text2SQL-Dataset

Awesome-Text2SQL-Dataset is a curated collection of datasets specifically designed for the Text-to-SQL task — the challenge of converting natural language questions into SQL queries. As a critical component of natural language interfaces to databases (NLIDB), Text2SQL plays a vital role in enabling users to interact with data using everyday language.

This repository aims to provide researchers, developers, and practitioners with a comprehensive list of datasets that support the development and evaluation of Text2SQL models. Whether you’re exploring schema linking, complex SQL generation, cross-domain generalization, or conversational query generation, you’ll find relevant datasets here to accelerate your work.

We welcome contributions to keep this list up-to-date and useful for the community.

🆕 Latest Datasets (2025)

Recent datasets that introduce new challenges such as data synthesis, error correction, and ambiguous query resolution. These datasets reflect the newest trends and tasks in Text2SQL research.

Dataset	Link	Desc
DLBench	[Paper] [Leaderboard]	2025/11, DLBENCH is a comprehensive benchmark used to evaluate the translation capabilities of large language models across different SQL dialects. The benchmark covers seven database management systems and 9,320 SQL dialect variants, with a total of 6,402 translation tasks.
BibSQL	[Paper] [Dateset]	2025/11, BibSQL is the first Chinese Text-to-SQL dataset specifically designed for document retrieval, containing 1,190 question-SQL pairs. Based on this dataset, a system was built that combines Semantic Retrieval (RAG), PoT (Thinking Process) prompting strategies, and Large Language Model (LLM) to improve the accuracy and intelligence of document retrieval in or related scenarios.
DySQL-Bench	[Paper] [Leaderboard]	2025/11, DySQL-Bench is a dynamic, multi-turn text-to-SQL benchmark suite. At its core, it constructs a three-party interactive environment containing a simulated user, an agent to be evaluated, and an executable database. This benchmark is used to evaluate the ability of a large language model to understand, execute, and dynamically correct SQL tasks, covering complete CRUD operations in multi-turn dialogues. It simulates the interaction process in real-world scenarios such as financial and business analytics, where users iteratively refine their query intent based on feedback.
DeKeyNLU	[Paper] [Dateset]	2025/11, DeKeyNLU achieves joint fine-grained annotation of task decomposition and keyword extraction through three layers of manual cross-validation. Building on this, the DeKeySQL framework innovatively integrates a dedicated understanding module into the RAG (result generation) process, establishing a new paradigm that prioritizes accurate semantic parsing, significantly improving the accuracy and domain adaptability of complex query SQL generation.
GeoSQL-Bench	[Paper] [Leaderboard]	2025/11, GeoSQL-Eval is the first end-to-end automated evaluation framework for PostGIS environments, designed to measure the performance of large language models in geospatial database query generation (GeoSQL). The research also includes the release of the GeoSQL-Bench benchmark dataset, which contains 14,178 instances, 340 PostGIS functions, and 82 thematic databases.
DBASQL	[Paper] [Dateset]	2025/10, This paper addresses the limitation of current NL2SQL (Natural Language to SQL) systems, which mostly focus on data querying (DML, such as SELECT) while neglecting database management (DBA) operations. The authors propose a method based on fine-tuning a Large Language Model (LLM) specifically designed to handle the daily tasks of database administrators (DBAs), including data definition (DDL), data control (DCL), and data manipulation (DML). To this end, the paper constructs a dedicated dataset called DBASQL and fine-tunes it using the T5-Large model, enabling the automatic generation of complex administrative SQL statements (such as table creation, field modification, and authorization) from natural language commands.
CORGI	[Paper] [Leaderboard]	2025/10, CORGI is a highly challenging Text-to-SQL benchmark suite specifically designed for the consulting field. It simulates real-world analytical scenarios from top consulting firms (such as McKinsey and Bain), covering 10 vertical industries including consumer platforms, retail, and digital services. This benchmark not only focuses on SQL syntax generation but also emphasizes the model's ability to handle deep business logic within extremely complex database architectures.
Payment-SQL	[Paper] [Dateset]	2025/09, Payment-SQL is an industry-grade dataset derived from real-world financial payments. Released as part of the SQLGovernor paper, it is specifically designed to evaluate the ability of LLM to handle highly complex OLAP (Online Analytical Processing) queries.
Arabic WikiTableQA	[Paper] [Dateset]	2025/09, Arabic WikiTableQA is the first large-scale Arabic non-SQL table question answering benchmark, filling a gap in the field of table question answering (TableQA) for non-English languages.
LLMSQL	[Paper] [Dateset]	2025/09, LLMSQL is a systematic reconstruction and upgrade of the classic WikiSQL dataset, organizing it into a standard SQL format, aiming to solve the adaptability problem of generative tasks in the era of Large Language Models (LLM).
text2SQL4PM	[Paper][Dateset]	2025/08, text2SQL4PM is a bilingual (Portuguese–English) text-to-SQL benchmark dataset designed for the process mining domain. Tailored to address the unique challenges of process mining, it covers domain-specific terminology and single-table relational structures derived from event logs. The dataset includes 1,655 natural language statements (with human paraphrases), 205 SQL queries, and 10 qualifiers. Its construction combines expert curation, professional translation, and detailed annotation, enabling in-depth analysis of task complexity.
REEF	[Paper][Dateset]	2025/08, REEF consists of 18 interrelated tables (e.g., products, orders, users) with annotated data distributions that encode specific causal relationships among variables, enabling the construction of realistic causal graphs. This dataset is designed to evaluate large models’ capabilities in end-to-end causal analysis tasks.
CogniSQL	[Paper][Dateset]	2025/07, CogniSQL has released two categories of curated datasets that significantly advance research on scalable text-to-SQL generation aligned with execution. By open-sourcing these resources, the community gains direct access to high-precision SQL samples and clear reasoning paths, enabling lightweight reinforcement learning and reasoning-augmented text-to-SQL model training even under limited computational resources.
SQLStorm	[Paper][Dateset]	2025/07, SQLStorm uses large language models (LLMs) to generate SQL statements for database performance testing, aiming to address the limitations of traditional datasets such as TPC-H in SQL feature coverage. The dataset is compatible with major database systems including PostgreSQL, Umbra, and DuckDB, and part of its data is based on real databases provided by StackOverflow.
BIRD-Critic	[Paper][Leaderboard]	2025/06, BIRD-CRITIC (a.k.a SWE-SQL), the first SQL diagnostic benchmark, is released to answer: Can large language models (LLMs) fix user issues in real-world database applications? The benchmark comprises 600 tasks for development and 200 held-out out-of-distribution (OOD) tests. BIRD-CRITIC 1.0 is built on realistic user issues across 4 prominent open-source SQL dialects: MySQL, PostgreSQL, SQL Server, and Oracle. It expands beyond simple SELECT queries to cover a wider range of SQL operations, reflecting actual application scenarios. Finally, an optimized execution-based evaluation environment is included for rigorous and efficient validation.
BiomedSQL	[Paper][Dateset]	2025/05, BiomedSQL is a text-to-SQL benchmark designed to evaluate Large Language Models (LLMs) on scientific tabular reasoning tasks. It consists of curated question-SQL query-answer triples covering a variety of biomedical and SQL reasoning types. The benchmark challenges models to apply implicit scientific criteria rather than simply translating syntax.
LogicCat	[Paper][Dateset]	2025/05, LogicCat is a challenging Text-to-SQL dataset designed to test complex reasoning, including physical, arithmetic, commonsense, and hypothetical reasoning. It contains 4,038 English questions paired with SQL queries and 12,114 step-by-step reasoning annotations across 45 diverse databases. Experiments show state-of-the-art models struggle, achieving only 14.96% accuracy, but performance improves to 33.96% with chain-of-thought annotations, highlighting its potential for advancing reasoning-driven SQL generation.
TINYSQL	[Paper][Dateset]	2025/03, TinySQL is a structured text-to-SQL dataset designed to support interpretability research by bridging toy examples and real-world tasks with controllable complexity.
NL2SQL-Bugs	[Paper][Dateset]	2025/03, NL2SQL-BUGs is the benchmark dedicated to detecting and categorizing semantic errors in Natural Language to SQL (NL2SQL) translation. While state-of-the-art NL2SQL models have made significant progress in translating natural language queries to SQL, they still frequently generate semantically incorrect queries that may execute successfully but produce incorrect results. This benchmark aims to support research in semantic error detection, which is a prerequisite for any subsequent error correction.
OmniSQL	[Paper][Dateset]	2025/03, As of March 2025, SynSQL-2.5M is the largest and most diverse synthetic text-to-SQL dataset to date. It represents a significant milestone in the text-to-SQL community. We encourage researchers, practitioners, and data enthusiasts to explore and build models using this dataset. If you find it useful, please consider giving us a star or citing our work. Your feedback is our greatest motivation to continue advancing.

🗂️ Datasets

These datasets focus on specific domains such as healthcare, finance, programming, and linguistics. They help evaluate how well Text2SQL systems generalize to specialized fields or languages.

Dataset	Link	Desc
WikiSQL	[Paper][Dateset]	2017/09, Salesforce proposes a large Text-to-SQL dataset WikiSQL, the data comes from Wikipedia, which belongs to a single domain, contains 80,654 natural language questions, and 77,840 SQL statements. The form of SQL statements is relatively simple, and does not include sorting, grouping, and subqueries and other complex operations.
Spider 1.0	[Paper][Leaderboard]	2018/09, Yale University proposes the Text-to-SQL dataset Spider with multiple databases, multiple tables, and single-round query. It is also recognized as the most difficult large-scale cross-domain evaluation list in the industry. It contains 10,181 natural language questions and 5,693 SQL statements
SParC	[Paper][Leaderboard]	2019/06, Yale University proposes a large dataset SParC for complex, cross-domain, and context-dependent(multi-turn) semantic parsing and text-to-SQL task, which consists of 4,298 coherent question sequences (12k+ unique individual questions annotated with SQL queries annotated by 14 Yale students), obtained from user interactions with 200 complex databases over 138 domains.
CSpider	[Paper][Leaderboard]	2019/09, Westlake University propposes a large Chinese dataset CSpider for complex and cross-domain semantic parsing and text-to-SQL task, translated from Spider by 2 NLP researchers and 1 computer science student, which consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains.
CoSQL	[Paper][Leaderboard]	2019/09, Yale University and Salesforce Research propose a cross-domain database CoSQL, which consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz (WOZ) collection of 3k dialogues querying 200 complex DBs spanning 138 domains.
KaggleDBQA	[Paper][dataset]	2021/06, KaggleDBQA is a challenging cross-domain and complex evaluation dataset of real Web databases, with domain-specific data types, original formatting, and unrestricted questions.
Spider-Syn	[Paper][Dateset]	2021/06, Spider-Syn is a benchmark dataset designed to evaluate and enhance the robustness of Text-to-SQL models against synonym substitutions in natural language questions. Developed by researchers from Queen Mary University of London and collaborators, Spider-Syn is based on the original Spider dataset.
SEDE	[Paper][Dateset]	2021/06, SEDE (Stack Exchange Data Explorer) is new dataset for Text-to-SQL tasks with more than 12,000 SQL queries and their natural language description. It's based on a real usage of users from the Stack Exchange Data Explorer platform, which brings complexities and challenges never seen before in any other semantic parsing dataset like including complex nesting, dates manipulation, numeric and text manipulation, parameters, and most importantly: under-specification and hidden-assumptions.
CHASE	[Paper][Dateset]	2021/08, CHASE is a large-scale and pragmatic Chinese dataset for cross-database context-dependent text-to-SQL task (natural language interfaces for relational databases). It is released along with our ACL 2021 paper: CHASE: A Large-Scale and Pragmatic Chinese Dataset for Cross-Database Context-Dependent Text-to-SQL.
Spider-DK	[Paper][Dateset]	2021/09, Spider-DK is a benchmark dataset designed to evaluate and enhance the robustness of Text-to-SQL models when handling domain knowledge. Developed by researchers from Queen Mary University of London, Spider-DK builds upon the original Spider dataset
EHRSQL	[Paper][Dateset]	2023/01, EHRSQL is a large-scale, high-quality dataset designed for text-to-SQL question answering on Electronic Health Records from MIMIC-III and eICU. The dataset includes questions collected from 222 hospital staff, such as physicians, nurses, insurance reviewers, and health records teams.
BIRD-SQL	[paper][Leaderboard]	2023/05, the University of Hong Kong and Alibaba propose a large-scale cross-domain dataset BIRD, which contains over 12,751 unique question-SQL pairs, 95 big databases with a total size of 33.4 GB. It also covers more than 37 professional domains, such as blockchain, hockey, healthcare and education, etc.
UNITE	[Paper][Dateset]	2023/05, Unified benchmark is composed of 18 publicly available text-to-SQL datasets, containing natural language questions from more than 12 domains, SQL queries from more than 3.9K patterns, and 29K databases. Compared to the widely used Spider benchmark, we introduce ∼120K additional examples and a threefold increase in SQL patterns, such as comparative and boolean questions.
Archer	[Paper] [Leaderboard]	2024/02, Archer is a challenging bilingual text-to-SQL dataset specific to complex reasoning, including arithmetic, commonsense and hypothetical reasoning. It contains 1,042 English questions and 1,042 Chinese questions, along with 521 unique SQL queries, covering 20 English databases across 20 domains.
BookSQL	[Paper][Dateset]	2024/06, BookSQL has 100k Query-SQL pairs which is about 1.25 times the existing largest Text-2-SQL dataset: WikiSQL. In particular, for designing the queries, we consulted financial experts to understand various practical use cases. We also plan to create a leaderboard where researchers can benchmark various Text-to-SQL models for the accounting domain.
Spider 2.0	[Paper] [Leaderboard]	2024/08, Spider 2.0, proposed by XLang AI, serves as an advanced evaluation framework for text-to-SQL tasks within real-world enterprise-level workflows. It contains 600 complex text-to-SQL workflow problems, derived from various enterprise database use cases. The dataset includes databases sourced from actual data applications, often containing over 1,000 columns, and stored in cloud or local systems like BigQuery, Snowflake, or PostgreSQL.
BEAVER	[Paper[Dateset]	2024/09, BEAVER, sourced from real enterprise data warehouses together with natural language queries and their correct SQL statements which we collected from actual user history
PRACTIQ	[Paper]	2024/10, PRACTIQ: A Practical Conversational text-to-SQL dataset with Ambiguous and Unanswerable Queries
TURSpider	[Paper][Dateset]	2024/11, TURSpider is a novel Turkish Text-to-SQL dataset that includes complex queries, akin to those in the original Spider dataset. TURSpider dataset comprises two main subsets: a dev set and a training set, aligned with the structure and scale of the popular Spider dataset. The dev set contains 1034 data rows with 1023 unique questions and 584 distinct SQL queries. In the training set, there are 8659 data rows, 8506 unique questions, and corresponding SQL queries.
synthetic_text_to_sql	[Dataset]	2024/11，gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-Text2SQL-Dataset

🆕 Latest Datasets (2025)

🗂️ Datasets

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome-Text2SQL-Dataset

🆕 Latest Datasets (2025)

🗂️ Datasets

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages