AI research in SQL relies heavily on high-quality datasets for training and evaluation purposes to enhance application capabilities in the field.
Recent years have seen the creation of various Text2SQL datasets like Spider and BIRD-SQL, alongside the associated leaderboards for evaluation.
New datasets introduced in 2025 include NL2SQL-Bugs dedicated to identifying semantic errors and OmniSQL, the largest cross-domain synthetic dataset.
TINYSQL offers a structured text-to-SQL dataset for interpretability research, catering to basic to advanced query tasks for model behavior analysis.
The article lists various datasets such as WikiSQL, Spider, SParC, CSpider, CoSQL, and more, emphasizing complexity, cross-domain challenges, and application scenarios.
Datasets like SEDE and CHASE introduce unique challenges like complex nesting, date manipulation, and pragmatic context-specific tasks for text-to-SQL.
The article acknowledges contributions like EHRSQL for healthcare data, BIRD-SQL for cross-domain datasets, and Archer for bilingual text-to-SQL datasets.
Innovations like BEAVER from real enterprise data and PRACTIQ for conversational text-to-SQL datasets continue to advance the field.
Diverse datasets like TURSpider in Turkish and synthetic_text_to_sql for high-quality synthetic samples further enrich the text-to-SQL research landscape.
Developers are encouraged to explore these datasets, contribute to advancements, and leverage the vast resource of publicly available datasets to improve text-to-SQL models.