Towards Interpretable Multimodal Models for Emotion Recognition

Boer, Kathleen de

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Schraagen, Marijn
dc.contributor.author	Boer, Kathleen de
dc.date.accessioned	2024-07-24T23:07:07Z
dc.date.available	2024-07-24T23:07:07Z
dc.date.issued	2024
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/46902
dc.description.abstract	The contents of this thesis focus on the development and evaluation of an interpretable multimodal model for emotion recognition in collaboration with the Dutch Institute of Sound & Vision. The state-of-the-art multimodal model Self Supervised Embedding Feature Transformer (SSE-FT) was finetuned and assessed on the Multimodal EmotionLines Dataset (MELD), revealing performance issues. The interpretability framework MM-SHAP was modified for emotion recognition and extended to include the text, audio, and video modalities. The proposed interpretability framework and ablation studies showed the SSE-FT predominately relied on the textual modality, leading to uni-modal collapse. The Dutch language model RobBERT was integrated into SSE-FT to increase performance, yet training RobBERT independently showed its limitations in capturing nuanced emotional cues from the MELD dataset. This thesis introduces visualization techniques specifically developed to focus on increasing interpretability within individual modalities, and to assist comparative analysis between the audio and text modality. The proposed interpretability method and visualization technique for text is applied to analyze the textual modality and show valuable insights into the model’s learned emotional cues for the textual modality. The results show that SSEFT trained on MELD relies heavily on paralinguistic cues in text and is not able to capture the more nuanced emotional cues in the video and audio modality. The findings of this thesis call attention to the need for a balanced, high-quality Dutch dataset for emotion recognition as well as the importance of general dataset quality for advancing in the field. The proposed interpretability method is found to be effective for creating interpretability in multimodal models for emotion recognition.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	The contents of this thesis focus on the development and evaluation of an interpretable multimodal model for emotion recognition in collaboration with the Dutch Institute of Sound & Vision.
dc.title	Towards Interpretable Multimodal Models for Emotion Recognition
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Multimodal, Emotion Recognition, Interpretability, SSE-FT, MM-SHAP, Uni-modal Collapse, Visualization
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	34826

Files in this item

Name:: Thesis_K.K._de_Boer.pdf
Size:: 2.994Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record