Multimodal emotion recognition for video content

Polyanskaya, L.

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Harvey, B.M.
dc.contributor.author	Polyanskaya, L.
dc.date.accessioned	2019-07-19T17:00:43Z
dc.date.available	2019-07-19T17:00:43Z
dc.date.issued	2019
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/32896
dc.description.abstract	There is a huge body of research dedicated to automatic emotion recognition from facial expressions and from texts. Although previous work has shown that combining the information learned from these channels improve the quality of predictions, this combination is underrepresented in the bimodal emotion recognition domain. Our research is aimed to close this gap by creating an emotion recognition model that joins facial and textual emotion classifiers. Two building components for this bimodal model are 1) the high-performing convolutional neural network mini-Xception which is responsible for the facial emotion recognition and 2) the BERT model of word embeddings fine-tuned on the textual emotion classification task. Firstly, we investigated if engaging the textual and video modalities would improve the quality of emotion classification. Secondly, we evaluated the high-performing models that were employed for the facial (mini-Xception) and textual (BERT) emotion predictions on a new dynamic data source which was the Dutch soap opera "Goede Tijden, Slechte Tijden". Our result showed that the performance of these two models on the new data is not high: Mini-Xception gained 0.17 macro F1-score, fine-tuned BERT - 0.26 macro F1-score. As far as the bimodal model is concerned, fusing BERT with mini-Xception did not improve the classification: the bimodal model (Random Forest) turned out to perform worse (macro F1-score 0.22) than the textual emotion classifier, but slightly better than the facial emotion classifier. All in all, this research demonstrated the necessity of cross-dataset evaluation for high-performing deep learning models. Although the bimodal model did not outperform unimodal models as expected, joining the modalities can still be an efficient approach for automatic emotion classification. In our set-up there were some obstacles for the bimodal model that may have resulted in poor performance. First, the task of annotating soap opera shots with emotions was formidable even for human annotators. In addition, the bad predictions of the unimodal models resulted in the low bimodal performance.
dc.description.sponsorship	Utrecht University
dc.format.extent	12549380
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.title	Multimodal emotion recognition for video content
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Multimodal emotion recognition, Deep learning, Cross-dataset evaluation
dc.subject.courseuu	Artificial Intelligence

Files in this item

Name:: 6173357_Polyanskaya_thesis_fin ...
Size:: 11.96Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record