Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorSchraagen, Marijn
dc.contributor.authorBoer, Kathleen de
dc.date.accessioned2024-07-24T23:07:07Z
dc.date.available2024-07-24T23:07:07Z
dc.date.issued2024
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/46902
dc.description.abstractThe contents of this thesis focus on the development and evaluation of an interpretable multimodal model for emotion recognition in collaboration with the Dutch Institute of Sound & Vision. The state-of-the-art multimodal model Self Supervised Embedding Feature Transformer (SSE-FT) was finetuned and assessed on the Multimodal EmotionLines Dataset (MELD), revealing performance issues. The interpretability framework MM-SHAP was modified for emotion recognition and extended to include the text, audio, and video modalities. The proposed interpretability framework and ablation studies showed the SSE-FT predominately relied on the textual modality, leading to uni-modal collapse. The Dutch language model RobBERT was integrated into SSE-FT to increase performance, yet training RobBERT independently showed its limitations in capturing nuanced emotional cues from the MELD dataset. This thesis introduces visualization techniques specifically developed to focus on increasing interpretability within individual modalities, and to assist comparative analysis between the audio and text modality. The proposed interpretability method and visualization technique for text is applied to analyze the textual modality and show valuable insights into the model’s learned emotional cues for the textual modality. The results show that SSEFT trained on MELD relies heavily on paralinguistic cues in text and is not able to capture the more nuanced emotional cues in the video and audio modality. The findings of this thesis call attention to the need for a balanced, high-quality Dutch dataset for emotion recognition as well as the importance of general dataset quality for advancing in the field. The proposed interpretability method is found to be effective for creating interpretability in multimodal models for emotion recognition.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectThe contents of this thesis focus on the development and evaluation of an interpretable multimodal model for emotion recognition in collaboration with the Dutch Institute of Sound & Vision.
dc.titleTowards Interpretable Multimodal Models for Emotion Recognition
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsMultimodal, Emotion Recognition, Interpretability, SSE-FT, MM-SHAP, Uni-modal Collapse, Visualization
dc.subject.courseuuArtificial Intelligence
dc.thesis.id34826


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record