A comparison of Chain-of-Thought Faithfulness: GRPO vs. DPO
Summary
Chain-of-thought (CoT) reasoning has emerged as a powerful technique
for enhancing the problem-solving capabilities of large language models
(LLMs), particularly in complex tasks that require multi-step reasoning. Recent
research has revealed that CoT explanations may not accurately represent
models’ actual reasoning mechanisms, as models can conceal flawed
thought processes or alter answers without acknowledging external influences.
This limitation compromises the reliability of CoT-based methods for
safety supervision and alignment monitoring since models may offer thorough
but deceptive explanations for inaccurate responses. Current training
and fine-tuning techniques need to be evaluated for their effectiveness
in enhancing both accuracy and faithfulness of chain-of-thought reasoning.
This study shows that Group Relative Policy Optimization (GRPO) training
achieves superior performance compared to Direct Preference Optimization
(DPO) in larger models, with the Qwen2.5-14B-Instruct model attaining the
highest scores across all evaluation metrics. Both approaches show a positive
correlation between model size and performance, but GRPO shows
more potential for improving faithfulness metrics, although with less consistent
behavior at smaller model scales. These results imply that the GRPO
technique offers a promising path for creating more transparent AI systems.
Our findings also highlight the trade-off between GRPO’s superior peak
performance and DPO’s steady scaling behavior, while also asking questions
regarding computational accessibility and the necessity of further development
in faithfulness evaluation techniques, as the demand for explainable
AI increases across many domains.