Metrics and Benchmarks for Evaluating Dutch Prompt-Based Text-to-SQL Systems

Raaij, Ruben van

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Behrisch, Michael
dc.contributor.author	Raaij, Ruben van
dc.date.accessioned	2025-08-15T00:03:57Z
dc.date.available	2025-08-15T00:03:57Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/49754
dc.description.abstract	This thesis introduces a Collaborative Role-Oriented Workflow for SQL generation (CROW-SQL), which is a modular multi-agent framework designed to improve the reliability, accuracy, and interpretability of Text-to-SQL generation using Large Language Models (LLMs). Rather than relying on a monolithic prompting strategy, CROW-SQL decomposes the Structured Query Language (SQL) generation process into collaborative subtasks, query generation, schema suggestion, refinement, and orchestration, which are handled by independent, specialized agents. All agents are instantiated from the same LLM backend, primarily Gemini 2.0 Flash, ensuring a fair and controlled evaluation of agent behavior. To evaluate the system’s effectiveness, we benchmark CROW-SQL on two datasets: the academic Spider benchmark and the real-world BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation (BIRD) dataset. Experiments vary key parameters such as Query Generation Budget (QGB), few-shot prompt size, and agent composition. Evaluation metrics include execution accuracy, SQL correctness, structure correctness, skeleton similarity, Levenshtein distance, and runtime. A comparative study between Gemini 2.0 Flash and Lightweight variant of OpenAI’s Generative Pre-trained Transformer 4o (GPT4o-mini) highlights Gemini’s better performance in structural alignment and execution robustness within the multi-agent context. The results show that multi-agent configurations significantly outperform single-agent baselines, especially on complex queries. The Refiner Agent plays a critical role in recovering from execution failures. Optimal performance is achieved with a Query Generation Budget of 3, beyond which diminishing returns are observed. The modular architecture also enhances transparency, debugging, and deployability, making CROW-SQL particularly suitable for enterprise and compliance-focused applications. This work contributes a reproducible, tool-augmented framework for agent-based Textto-SQL reasoning, and sets the stage for future research in schema-aware prompting and adaptive agent routing SQL generation.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	Metrics and Benchmarks for Evaluating Dutch Prompt-Based Text-to-SQL Systems
dc.title	Metrics and Benchmarks for Evaluating Dutch Prompt-Based Text-to-SQL Systems
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	LLM, Agents, Text-to-sql
dc.subject.courseuu	Data Science
dc.thesis.id	51687

Files in this item

Name:: MscThesis_LLMSQL_RubenvRaaij_f ...
Size:: 1.423Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record