Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorHarvey, B.M.
dc.contributor.authorPolyanskaya, L.
dc.date.accessioned2019-07-19T17:00:43Z
dc.date.available2019-07-19T17:00:43Z
dc.date.issued2019
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/32896
dc.description.abstractThere is a huge body of research dedicated to automatic emotion recognition from facial expressions and from texts. Although previous work has shown that combining the information learned from these channels improve the quality of predictions, this combination is underrepresented in the bimodal emotion recognition domain. Our research is aimed to close this gap by creating an emotion recognition model that joins facial and textual emotion classifiers. Two building components for this bimodal model are 1) the high-performing convolutional neural network mini-Xception which is responsible for the facial emotion recognition and 2) the BERT model of word embeddings fine-tuned on the textual emotion classification task. Firstly, we investigated if engaging the textual and video modalities would improve the quality of emotion classification. Secondly, we evaluated the high-performing models that were employed for the facial (mini-Xception) and textual (BERT) emotion predictions on a new dynamic data source which was the Dutch soap opera "Goede Tijden, Slechte Tijden". Our result showed that the performance of these two models on the new data is not high: Mini-Xception gained 0.17 macro F1-score, fine-tuned BERT - 0.26 macro F1-score. As far as the bimodal model is concerned, fusing BERT with mini-Xception did not improve the classification: the bimodal model (Random Forest) turned out to perform worse (macro F1-score 0.22) than the textual emotion classifier, but slightly better than the facial emotion classifier. All in all, this research demonstrated the necessity of cross-dataset evaluation for high-performing deep learning models. Although the bimodal model did not outperform unimodal models as expected, joining the modalities can still be an efficient approach for automatic emotion classification. In our set-up there were some obstacles for the bimodal model that may have resulted in poor performance. First, the task of annotating soap opera shots with emotions was formidable even for human annotators. In addition, the bad predictions of the unimodal models resulted in the low bimodal performance.
dc.description.sponsorshipUtrecht University
dc.format.extent12549380
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.titleMultimodal emotion recognition for video content
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsMultimodal emotion recognition, Deep learning, Cross-dataset evaluation
dc.subject.courseuuArtificial Intelligence


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record