Explaining what makes M-CoT matter in complex multimodal reasoning tasks
Summary
Multimodal chain-of-thought (M-CoT) reasoning has been increasingly applied to
Vision and Language Models (VLM) in multimodal reasoning tasks, to improve their
reasoning abilities. However, compared to the research that demonstrates the effective-
ness of M-CoT, the explanation of why this strategy can improve the performance of
VLMs still remains underexplored. In this work, we test whether M-CoT can improve
the performance of VLMs on multimodal reasoning tasks, following zero-shot setting.
We analyze the most likely patterns of M-CoT that contribute to improving the
performance of VLMs, to find out why they can benefit the model’s performance. We
specifically designed different experiments to explore what is important to M-CoT’s
success. Our study shows that M-CoT can improve the accuracy of InstructBLIP 7B
by 10.71%, and InstructBLIP 13B by 14.88%, on the ScienceQA benchmark. The
M-CoT rationales that can improve the performance of InstructBLIP have information
about the image and commonsense knowledge, which might help the model to perform
better reasoning and answer the question more accurately. Whether the textual part
of M-CoT is relevant to the question is important to the improvement of results
with VLMs. The validity of the reasoning chain in the textual part of M-CoT can
significantly affect the performance of VLMs on multimodal reasoning tasks. VLMs
might rely on the conclusion of the textual part of the M-CoT rationale to make their
decisions.