Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorGiachanou, Anastasia
dc.contributor.authorKozák, Tamás
dc.date.accessioned2025-08-28T01:00:47Z
dc.date.available2025-08-28T01:00:47Z
dc.date.issued2025
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/50086
dc.description.abstractChain-of-thought (CoT) reasoning has emerged as a powerful technique for enhancing the problem-solving capabilities of large language models (LLMs), particularly in complex tasks that require multi-step reasoning. Recent research has revealed that CoT explanations may not accurately represent models’ actual reasoning mechanisms, as models can conceal flawed thought processes or alter answers without acknowledging external influences. This limitation compromises the reliability of CoT-based methods for safety supervision and alignment monitoring since models may offer thorough but deceptive explanations for inaccurate responses. Current training and fine-tuning techniques need to be evaluated for their effectiveness in enhancing both accuracy and faithfulness of chain-of-thought reasoning. This study shows that Group Relative Policy Optimization (GRPO) training achieves superior performance compared to Direct Preference Optimization (DPO) in larger models, with the Qwen2.5-14B-Instruct model attaining the highest scores across all evaluation metrics. Both approaches show a positive correlation between model size and performance, but GRPO shows more potential for improving faithfulness metrics, although with less consistent behavior at smaller model scales. These results imply that the GRPO technique offers a promising path for creating more transparent AI systems. Our findings also highlight the trade-off between GRPO’s superior peak performance and DPO’s steady scaling behavior, while also asking questions regarding computational accessibility and the necessity of further development in faithfulness evaluation techniques, as the demand for explainable AI increases across many domains.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectThis study examines chain-of-thought reasoning in AI models, where explanations can be misleading despite appearing convincing. Two training methods (GRPO vs DPO) are compared to improve reasoning accuracy and faithfulness.
dc.titleA comparison of Chain-of-Thought Faithfulness: GRPO vs. DPO
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.courseuuApplied Data Science
dc.thesis.id52765


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record