View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Metrics and Benchmarks for Evaluating Dutch Prompt-Based Text-to-SQL Systems

        Thumbnail
        View/Open
        MscThesis_LLMSQL_RubenvRaaij_final_draftv3.pdf (1.423Mb)
        Publication date
        2025
        Author
        Raaij, Ruben van
        Metadata
        Show full item record
        Summary
        This thesis introduces a Collaborative Role-Oriented Workflow for SQL generation (CROW-SQL), which is a modular multi-agent framework designed to improve the reliability, accuracy, and interpretability of Text-to-SQL generation using Large Language Models (LLMs). Rather than relying on a monolithic prompting strategy, CROW-SQL decomposes the Structured Query Language (SQL) generation process into collaborative subtasks, query generation, schema suggestion, refinement, and orchestration, which are handled by independent, specialized agents. All agents are instantiated from the same LLM backend, primarily Gemini 2.0 Flash, ensuring a fair and controlled evaluation of agent behavior. To evaluate the system’s effectiveness, we benchmark CROW-SQL on two datasets: the academic Spider benchmark and the real-world BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation (BIRD) dataset. Experiments vary key parameters such as Query Generation Budget (QGB), few-shot prompt size, and agent composition. Evaluation metrics include execution accuracy, SQL correctness, structure correctness, skeleton similarity, Levenshtein distance, and runtime. A comparative study between Gemini 2.0 Flash and Lightweight variant of OpenAI’s Generative Pre-trained Transformer 4o (GPT4o-mini) highlights Gemini’s better performance in structural alignment and execution robustness within the multi-agent context. The results show that multi-agent configurations significantly outperform single-agent baselines, especially on complex queries. The Refiner Agent plays a critical role in recovering from execution failures. Optimal performance is achieved with a Query Generation Budget of 3, beyond which diminishing returns are observed. The modular architecture also enhances transparency, debugging, and deployability, making CROW-SQL particularly suitable for enterprise and compliance-focused applications. This work contributes a reproducible, tool-augmented framework for agent-based Textto-SQL reasoning, and sets the stage for future research in schema-aware prompting and adaptive agent routing SQL generation.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/49754
        Collections
        • Theses
        Utrecht university logo