Interpretable and explainable vision and video vision transformers for pain detection
Summary
Automatic detection of facial indicators of pain has many useful applications in the healthcare domain. Vision transformers are a top-performing architecture in computer vision, with little research on their use for pain assessment. In this thesis, we propose the first fully-attentive automated pain assessment pipeline that achieves state-of-the-art performance on direct and indirect pain detection from facial expressions. The models are trained on the UNBC-McMaster dataset, after faces are 3D-registered and rotated to the canonical frontal view. In our direct pain detection experiments we identify important areas of the hyperparameter space and their interaction with vision and video vision transformers, obtaining three noteworthy models. We also test these models on indirect pain detection and direct and indirect pain intensity estimation. Our indirect pain detection models underperform compared to their direct counterparts, but still outperform previous works while providing explanations for their predictions. We analyze the attention maps of one of our direct pain detection models, finding reasonable interpretations for its predictions. We find the models to perform much worse on pain intensity estimation, showing the limits of the simple approach chosen. We also evaluate Mixup, an augmentation technique, and Sharpness-Aware Minimization, an optimizer, with no success. Our presented models for direct pain detection, ViT-1-D (F1 score 0.55 ± 0.15), ViViT-1-D (F1 score 0.55 ± 0.13), and ViViT-2-D (F1 score 0.49 ± 0.04), all outperform earlier works, showing the potential of vision transformers for pain detection.