View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        A comparison of Chain-of-Thought Faithfulness: GRPO vs. DPO

        Thumbnail
        View/Open
        ADS_Master_s_thesis_Tamas_Kozak.pdf (798.2Kb)
        Publication date
        2025
        Author
        Kozák, Tamás
        Metadata
        Show full item record
        Summary
        Chain-of-thought (CoT) reasoning has emerged as a powerful technique for enhancing the problem-solving capabilities of large language models (LLMs), particularly in complex tasks that require multi-step reasoning. Recent research has revealed that CoT explanations may not accurately represent models’ actual reasoning mechanisms, as models can conceal flawed thought processes or alter answers without acknowledging external influences. This limitation compromises the reliability of CoT-based methods for safety supervision and alignment monitoring since models may offer thorough but deceptive explanations for inaccurate responses. Current training and fine-tuning techniques need to be evaluated for their effectiveness in enhancing both accuracy and faithfulness of chain-of-thought reasoning. This study shows that Group Relative Policy Optimization (GRPO) training achieves superior performance compared to Direct Preference Optimization (DPO) in larger models, with the Qwen2.5-14B-Instruct model attaining the highest scores across all evaluation metrics. Both approaches show a positive correlation between model size and performance, but GRPO shows more potential for improving faithfulness metrics, although with less consistent behavior at smaller model scales. These results imply that the GRPO technique offers a promising path for creating more transparent AI systems. Our findings also highlight the trade-off between GRPO’s superior peak performance and DPO’s steady scaling behavior, while also asking questions regarding computational accessibility and the necessity of further development in faithfulness evaluation techniques, as the demand for explainable AI increases across many domains.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/50086
        Collections
        • Theses
        Utrecht university logo