Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorGatt, A.
dc.contributor.authorGarzoni di Adorgnano, Massimiliano
dc.date.accessioned2023-10-20T00:00:46Z
dc.date.available2023-10-20T00:00:46Z
dc.date.issued2023
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/45412
dc.description.abstractThis master thesis proposes a multi-modal retrieval system for open-domain question answering, building upon dense passage retrievers and incorporating multi-modal information to retrieve candidate regions of interest (ROIs) from document images given a user query. Our main research goal was to investigate the efficacy of dense representations of questions and multi-modal contexts in retrieving relevant content, and evaluating the impact of multi-modal information compared to uni-modal baselines. To this end, the study leverages the VisualMRC dataset which offers annotations for visual components, particularly ROIs such as titles or graphs, to facilitate efficient con- tent retrieval. The proposed methodology involves pre-processing the multi-modal ROIs, employing a bi-encoder setup to encode the question and ROIs separately, and use such encodings to calculate similarity in their shared multi-dimensional embedding space. The training objective is achieved through contrastive learning by passing to the model a question, along with one positive and k negative contexts, and minimizing the loss function by reducing the negative log likelihood associated with the positive ROI. We evaluate our trained models on three different modality scenarios, text-only, vision-only, and multi-modal, and we evaluate their retrieval performance on standard metrics such as Normalized Cumulative Discounted Gain @ k, Mean Reciprocal Rank @ k, and Recall @ k. The results reveal the benefits of both vision-only and multi-modal approaches over text-only, while also highlighting challenges related to the number of negative ROIs. Our results support the first hypothesis but raise questions about the second, suggesting that the inclusion of layout information may not always improve retrieval performance. The strengths of our approach include efficient ROI retrieval and dataset adaptability, while limitations involve dataset variability and encoding techniques. In light of this, we suggest several avenues for future work such as exploring new datasets, incorporating hard negatives in contrastive learning, and refining ROI dissimilarity. Additionally, we speculate that integrating keyword matching and retrieval-augmented generation approaches could enhance the retrieval pipeline. Overall, the present thesis hopes to advance research in multi-modal retrieval models, emphasizing the importance of visual and textual context for open-domain question answering.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectThis master's thesis presents a multi-modal retrieval system for open-domain question answering, using dense passage retrievers and multi-modal information to find relevant content in document images. The study supports the benefits of vision-only and multi-modal approaches over text-only, with potential future work focusing on improving dataset variability and encoding techniques.
dc.titleA Multi-Modal Approach to Open Domain Question Answering: Dense Retrieval of Regions-of-Interest
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsmulti-modal retrieval; transformers; multi-modality; regions-of-interest; open-domain question answering
dc.subject.courseuuArtificial Intelligence
dc.thesis.id25442


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record