dc.description.abstract | The influence of Vision Transformers (ViTs) is increasing in the field of medical image segmentation. In recent years, several papers have presented ViT-based architectures that outperform the previously state-of-the-art CNNs (such as nnU-Net). An example of such is the Swin UNet Transformer (or Swin UNETR), which, in combination with a self-supervised pre-training scheme, has outperformed other CNN and ViT-based architectures in multiple segmentation tasks. However, there are certain design and configuration choices that may aid the ViT to achieve this performance. In this paper, we perform an objective comparison between Swin UNETR and U-Net, comparing both networks in an equal resource setting. We explore two downscaling approaches to balance the parameter count of Swin UNETR, making it closer to the U-Net in this aspect. We measure the ViT's performance loss due to downscaling, as well as the gain obtained from using pre-trained weights for the encoder. Additionally, we assess whether residual blocks aid the Swin UNETR or U-Net to obtain superior performance. Our results show that in the framework used in this study, both U-Net and Swin UNETR show comparable results, with the CNN-based network achieving a slightly superior (1%) DSC. The downscaled ViT models show a decrease of 1.4% in DSC, while pre-training improves the outcome of the original Swin UNETR by 1.6%. Residual functionality proves to aid the pre-trained Swin UNETR with an increase of 3.6% in DSC, while only improving U-Net's DSC by 0.8%. In the constrained resource setting of this study, the U-Net proves to obtain similar performance to Swin UNETR, while employing fewer GPU resources and improving inference speed. | |