BiomedSQL is introduced as a benchmark for evaluating scientific reasoning in text-to-SQL generation over biomedical knowledge bases.
It consists of 68,000 question/SQL query/answer triples grounded in a harmonized BigQuery knowledge base, integrating gene-disease associations, causal inference, and drug approval records.
Models need to infer domain-specific criteria rather than rely solely on syntactic translation, such as genome-wide significance thresholds and trial phase filtering.
Performance evaluation shows a significant performance gap among different language models, with the best custom agent achieving 62.6% execution accuracy compared to the expert baseline of 90.0%.