Evaluating the Effectiveness, Generalizability, and Explainability of Video Swin Transformers on Automated Pain Detection

Rau, Maximilian

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Önal Ertugrul, I.
dc.contributor.author	Rau, Maximilian
dc.date.accessioned	2024-06-19T23:01:54Z
dc.date.available	2024-06-19T23:01:54Z
dc.date.issued	2024
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/46527
dc.description.abstract	Recent advancements in computer vision, particularly with transformer-based models, offer promising potential for establishing new benchmarks in automated pain assessment through facial expressions. This thesis explores the efficacy of the Video Swin Transformer (VST), a recent approach that leverages temporal dynamics and offers a potential for nuanced detection capabilities of pain through varying scales. Our study involves applying the VST and comparing its performance against other transformer-based state-of-the-art models such as the Swin Transformer and the Vision Transformer (ViT). Through ablation studies, we demonstrated the positive impact of incorporating a higher temporal depth length into the model. Additionally, we evaluated the use of Focal loss to mitigate the issue of an imbalanced class distribution found in the UNBC McMaster dataset, which turned out to be insufficient. Furthermore, our research also focused on the generalizability of our models across different datasets, highlighting the need for more diverse datasets in training phases. Through the extraction of attention maps, we gained insights into the explainability, particularly the focus points of our models, confirming their utilization of pain-related regions for decision-making. The results were promising: our best models, VST-0 and VST-1-TD, set new benchmarks with F1-scores of 0.56±0.06 and 0.59±0.04, respectively, and achieved comparable state-of-the-art AUC scores of 0.85±0.04 and 0.87±0.03. This thesis underscores the potential of the VST architecture not only in automated pain assessment but also its broader applicability in the analysis of facial expressions.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	Investigations of the Video Swin Transformer on automated pain detection. Next to the performance evaluation and comparison with other state-of-the-art models, temporal dynamics, the application of Focal loss, generalizability, and explainability were explored.
dc.title	Evaluating the Effectiveness, Generalizability, and Explainability of Video Swin Transformers on Automated Pain Detection
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Automated Pain Detection, Transformer, Video Swin Transformer, Generalizability, Explainable AI, Focal Loss
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	31601

Files in this item

Name:: Master_Thesis_Maximilian_Rau-F ...
Size:: 8.943Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record