Metrics and Benchmarks for Evaluating Dutch Prompt-Based Text-to-SQL Systems

Raaij, Ruben van

View/Open

MscThesis_LLMSQL_RubenvRaaij_final_draftv3.pdf (1.423Mb)

Publication date

2025

Author

Raaij, Ruben van

Metadata

Show full item record

Summary

This thesis introduces a Collaborative Role-Oriented Workflow for SQL generation (CROW-SQL), which is a modular multi-agent framework designed to improve the reliability, accuracy, and interpretability of Text-to-SQL generation using Large Language Models (LLMs). Rather than relying on a monolithic prompting strategy, CROW-SQL decomposes the Structured Query Language (SQL) generation process into collaborative subtasks, query generation, schema suggestion, refinement, and orchestration, which are handled by independent, specialized agents. All agents are instantiated from the same LLM backend, primarily Gemini 2.0 Flash, ensuring a fair and controlled evaluation of agent behavior. To evaluate the system’s effectiveness, we benchmark CROW-SQL on two datasets: the academic Spider benchmark and the real-world BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation (BIRD) dataset. Experiments vary key parameters such as Query Generation Budget (QGB), few-shot prompt size, and agent composition. Evaluation metrics include execution accuracy, SQL correctness, structure correctness, skeleton similarity, Levenshtein distance, and runtime. A comparative study between Gemini 2.0 Flash and Lightweight variant of OpenAI’s Generative Pre-trained Transformer 4o (GPT4o-mini) highlights Gemini’s better performance in structural alignment and execution robustness within the multi-agent context. The results show that multi-agent configurations significantly outperform single-agent baselines, especially on complex queries. The Refiner Agent plays a critical role in recovering from execution failures. Optimal performance is achieved with a Query Generation Budget of 3, beyond which diminishing returns are observed. The modular architecture also enhances transparency, debugging, and deployability, making CROW-SQL particularly suitable for enterprise and compliance-focused applications. This work contributes a reproducible, tool-augmented framework for agent-based Textto-SQL reasoning, and sets the stage for future research in schema-aware prompting and adaptive agent routing SQL generation.

URI

https://studenttheses.uu.nl/handle/20.500.12932/49754

Collections

Theses