A Multi-Modal Approach to Open Domain Question Answering: Dense Retrieval of Regions-of-Interest
Summary
This master thesis proposes a multi-modal retrieval system for open-domain question answering, building upon dense passage retrievers and incorporating multi-modal information to retrieve candidate regions of interest (ROIs) from document images given a user query. Our main research goal was to investigate the efficacy of dense representations of questions and multi-modal contexts in retrieving relevant content, and evaluating the impact of multi-modal information compared to uni-modal baselines. To this end, the study leverages the VisualMRC dataset which offers annotations for visual components, particularly ROIs such as titles or graphs, to facilitate efficient con- tent retrieval. The proposed methodology involves pre-processing the multi-modal ROIs, employing a bi-encoder setup to encode the question and ROIs separately, and use such encodings to calculate similarity in their shared multi-dimensional embedding space. The training objective is achieved through contrastive learning by passing to the model a question, along with one positive and k negative contexts, and minimizing the loss function by reducing the negative log likelihood associated with the positive ROI. We evaluate our trained models on three different modality scenarios, text-only, vision-only, and multi-modal, and we evaluate their retrieval performance on standard metrics such as Normalized Cumulative Discounted Gain @ k, Mean Reciprocal Rank @ k, and Recall @ k. The results reveal the benefits of both vision-only and multi-modal approaches over text-only, while also highlighting challenges related to the number of negative ROIs. Our results support the first hypothesis but raise questions about the second, suggesting that the inclusion of layout information may not always improve retrieval performance. The strengths of our approach include efficient ROI retrieval and dataset adaptability, while limitations involve dataset variability and encoding techniques. In light of this, we suggest several avenues for future work such as exploring new datasets, incorporating hard negatives in contrastive learning, and refining ROI dissimilarity. Additionally, we speculate that integrating keyword matching and retrieval-augmented generation approaches could enhance the retrieval pipeline. Overall, the present thesis hopes to advance research in multi-modal retrieval models, emphasizing the importance of visual and textual context for open-domain question answering.