Vision Transformers for Pain Recognition on Thermal Image Frames
Summary
Pain remains a phenomenon that is not fully understood scientifically, even though poorly
managed pain severely impacts the individuals involved. Therefore, a valid and reliable pain
assessment is necessary to manage pain properly. This study investigates the effectiveness
of vision transformers in detecting pain from thermal face video frames. In doing so it looks
at the effect of incorporating temporal sequences and extracting regions of interest (ROI).
Vision transformers (ViT) and video vision transformers (ViViT) models are employed
for this analysis. We found that both models can discern pain distinctions, but the models
overfit quite easily. However, we did find that the ViViT model trained on sequences of
entire thermal images (ViViT whole) shows promise, outperforming other configurations
with 60.5% accuracy. ViT ROI was found more effective than ViT whole and ViViT ROI,
highlighting the benefit of ROI extraction in the case of single-image pain prediction on
thermal images.