A comparison of Chain-of-Thought Faithfulness: GRPO vs. DPO

Kozák, Tamás

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Giachanou, Anastasia
dc.contributor.author	Kozák, Tamás
dc.date.accessioned	2025-08-28T01:00:47Z
dc.date.available	2025-08-28T01:00:47Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/50086
dc.description.abstract	Chain-of-thought (CoT) reasoning has emerged as a powerful technique for enhancing the problem-solving capabilities of large language models (LLMs), particularly in complex tasks that require multi-step reasoning. Recent research has revealed that CoT explanations may not accurately represent models’ actual reasoning mechanisms, as models can conceal flawed thought processes or alter answers without acknowledging external influences. This limitation compromises the reliability of CoT-based methods for safety supervision and alignment monitoring since models may offer thorough but deceptive explanations for inaccurate responses. Current training and fine-tuning techniques need to be evaluated for their effectiveness in enhancing both accuracy and faithfulness of chain-of-thought reasoning. This study shows that Group Relative Policy Optimization (GRPO) training achieves superior performance compared to Direct Preference Optimization (DPO) in larger models, with the Qwen2.5-14B-Instruct model attaining the highest scores across all evaluation metrics. Both approaches show a positive correlation between model size and performance, but GRPO shows more potential for improving faithfulness metrics, although with less consistent behavior at smaller model scales. These results imply that the GRPO technique offers a promising path for creating more transparent AI systems. Our findings also highlight the trade-off between GRPO’s superior peak performance and DPO’s steady scaling behavior, while also asking questions regarding computational accessibility and the necessity of further development in faithfulness evaluation techniques, as the demand for explainable AI increases across many domains.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	This study examines chain-of-thought reasoning in AI models, where explanations can be misleading despite appearing convincing. Two training methods (GRPO vs DPO) are compared to improve reasoning accuracy and faithfulness.
dc.title	A comparison of Chain-of-Thought Faithfulness: GRPO vs. DPO
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.courseuu	Applied Data Science
dc.thesis.id	52765

Files in this item

Name:: ADS_Master_s_thesis_Tamas_Kozak.pdf
Size:: 798.2Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record