Multimodal emotion recognition for video content
Summary
There is a huge body of research dedicated to automatic emotion recognition from facial expressions and from texts. Although previous work has shown that combining the information learned from these channels improve the quality of predictions, this combination is underrepresented in the bimodal emotion recognition domain. Our research is aimed to close this gap by creating an emotion recognition model that joins facial and textual emotion classifiers. Two building components for this bimodal model are 1) the high-performing convolutional neural network mini-Xception which is responsible for the facial emotion recognition and 2) the BERT model of word embeddings fine-tuned on the textual emotion classification task. Firstly, we investigated if engaging the textual and video modalities would improve the quality of emotion classification. Secondly, we evaluated the high-performing models that were employed for the facial (mini-Xception) and textual (BERT) emotion predictions on a new dynamic data source which was the Dutch soap opera "Goede Tijden, Slechte Tijden". Our result showed that the performance of these two models on the new data is not high: Mini-Xception gained 0.17 macro F1-score, fine-tuned BERT - 0.26 macro F1-score. As far as the bimodal model is concerned, fusing BERT with mini-Xception did not improve the classification: the bimodal model (Random Forest) turned out to perform worse (macro F1-score 0.22) than the textual emotion classifier, but slightly better than the facial emotion classifier. All in all, this research demonstrated the necessity of cross-dataset evaluation for high-performing deep learning models. Although the bimodal model did not outperform unimodal models as expected, joining the modalities can still be an efficient approach for automatic emotion classification. In our set-up there were some obstacles for the bimodal model that may have resulted in poor performance. First, the task of annotating soap opera shots with emotions was formidable even for human annotators. In addition, the bad predictions of the unimodal models resulted in the low bimodal performance.