Towards Interpretable Multimodal Models for Emotion Recognition
Summary
The contents of this thesis focus on the development and evaluation of an interpretable
multimodal model for emotion recognition in collaboration with the Dutch Institute of
Sound & Vision. The state-of-the-art multimodal model Self Supervised Embedding
Feature Transformer (SSE-FT) was finetuned and assessed on the Multimodal EmotionLines Dataset (MELD), revealing performance issues. The interpretability framework
MM-SHAP was modified for emotion recognition and extended to include the text,
audio, and video modalities. The proposed interpretability framework and ablation
studies showed the SSE-FT predominately relied on the textual modality, leading to
uni-modal collapse. The Dutch language model RobBERT was integrated into SSE-FT
to increase performance, yet training RobBERT independently showed its limitations
in capturing nuanced emotional cues from the MELD dataset. This thesis introduces
visualization techniques specifically developed to focus on increasing interpretability
within individual modalities, and to assist comparative analysis between the audio and
text modality. The proposed interpretability method and visualization technique for
text is applied to analyze the textual modality and show valuable insights into the
model’s learned emotional cues for the textual modality. The results show that SSEFT trained on MELD relies heavily on paralinguistic cues in text and is not able to
capture the more nuanced emotional cues in the video and audio modality. The findings of this thesis call attention to the need for a balanced, high-quality Dutch dataset
for emotion recognition as well as the importance of general dataset quality for advancing in the field. The proposed interpretability method is found to be effective for
creating interpretability in multimodal models for emotion recognition.