Explaining what makes M-CoT matter in complex multimodal reasoning tasks

Cen, Kaiwei

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Paperno, Denis
dc.contributor.author	Cen, Kaiwei
dc.date.accessioned	2024-09-12T23:02:36Z
dc.date.available	2024-09-12T23:02:36Z
dc.date.issued	2024
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/47743
dc.description.abstract	Multimodal chain-of-thought (M-CoT) reasoning has been increasingly applied to Vision and Language Models (VLM) in multimodal reasoning tasks, to improve their reasoning abilities. However, compared to the research that demonstrates the effective- ness of M-CoT, the explanation of why this strategy can improve the performance of VLMs still remains underexplored. In this work, we test whether M-CoT can improve the performance of VLMs on multimodal reasoning tasks, following zero-shot setting. We analyze the most likely patterns of M-CoT that contribute to improving the performance of VLMs, to find out why they can benefit the model’s performance. We specifically designed different experiments to explore what is important to M-CoT’s success. Our study shows that M-CoT can improve the accuracy of InstructBLIP 7B by 10.71%, and InstructBLIP 13B by 14.88%, on the ScienceQA benchmark. The M-CoT rationales that can improve the performance of InstructBLIP have information about the image and commonsense knowledge, which might help the model to perform better reasoning and answer the question more accurately. Whether the textual part of M-CoT is relevant to the question is important to the improvement of results with VLMs. The validity of the reasoning chain in the textual part of M-CoT can significantly affect the performance of VLMs on multimodal reasoning tasks. VLMs might rely on the conclusion of the textual part of the M-CoT rationale to make their decisions.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	We employ the zero-shot CoT prompting strategy to evaluate the effectiveness of M-CoT in enhancing the reasoning capabilities of VLMs in multimodal reasoning tasks. Through a detailed analysis of M-CoT rationales, we seek to identify the key factors that contribute to its success in these tasks. To this end, we design and conduct ablation experiments to determine which components of M-CoT are most influential in driving performance improvements.
dc.title	Explaining what makes M-CoT matter in complex multimodal reasoning tasks
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Chain-of-Thought Prompting, Vision and Language Models, Explainability
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	39220

Files in this item

Name:: Master_Thesis_Kaiwei_Cen_15377 ...
Size:: 6.042Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record