A Multi-Modal Approach to Open Domain Question Answering: Dense Retrieval of Regions-of-Interest

Garzoni di Adorgnano, Massimiliano

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Gatt, A.
dc.contributor.author	Garzoni di Adorgnano, Massimiliano
dc.date.accessioned	2023-10-20T00:00:46Z
dc.date.available	2023-10-20T00:00:46Z
dc.date.issued	2023
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/45412
dc.description.abstract	This master thesis proposes a multi-modal retrieval system for open-domain question answering, building upon dense passage retrievers and incorporating multi-modal information to retrieve candidate regions of interest (ROIs) from document images given a user query. Our main research goal was to investigate the efficacy of dense representations of questions and multi-modal contexts in retrieving relevant content, and evaluating the impact of multi-modal information compared to uni-modal baselines. To this end, the study leverages the VisualMRC dataset which offers annotations for visual components, particularly ROIs such as titles or graphs, to facilitate efficient con- tent retrieval. The proposed methodology involves pre-processing the multi-modal ROIs, employing a bi-encoder setup to encode the question and ROIs separately, and use such encodings to calculate similarity in their shared multi-dimensional embedding space. The training objective is achieved through contrastive learning by passing to the model a question, along with one positive and k negative contexts, and minimizing the loss function by reducing the negative log likelihood associated with the positive ROI. We evaluate our trained models on three different modality scenarios, text-only, vision-only, and multi-modal, and we evaluate their retrieval performance on standard metrics such as Normalized Cumulative Discounted Gain @ k, Mean Reciprocal Rank @ k, and Recall @ k. The results reveal the benefits of both vision-only and multi-modal approaches over text-only, while also highlighting challenges related to the number of negative ROIs. Our results support the first hypothesis but raise questions about the second, suggesting that the inclusion of layout information may not always improve retrieval performance. The strengths of our approach include efficient ROI retrieval and dataset adaptability, while limitations involve dataset variability and encoding techniques. In light of this, we suggest several avenues for future work such as exploring new datasets, incorporating hard negatives in contrastive learning, and refining ROI dissimilarity. Additionally, we speculate that integrating keyword matching and retrieval-augmented generation approaches could enhance the retrieval pipeline. Overall, the present thesis hopes to advance research in multi-modal retrieval models, emphasizing the importance of visual and textual context for open-domain question answering.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	This master's thesis presents a multi-modal retrieval system for open-domain question answering, using dense passage retrievers and multi-modal information to find relevant content in document images. The study supports the benefits of vision-only and multi-modal approaches over text-only, with potential future work focusing on improving dataset variability and encoding techniques.
dc.title	A Multi-Modal Approach to Open Domain Question Answering: Dense Retrieval of Regions-of-Interest
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	multi-modal retrieval; transformers; multi-modality; regions-of-interest; open-domain question answering
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	25442

Files in this item

Name:: Final Master Thesis - Massimiliano ...
Size:: 1.287Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record