ViTs vs. CNNs for 3D Medical Image Segmentation: Are Transformers All You Need?

Arregui García, Xabier

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Maspero, Matteo
dc.contributor.author	Arregui García, Xabier
dc.date.accessioned	2023-07-25T00:01:28Z
dc.date.available	2023-07-25T00:01:28Z
dc.date.issued	2023
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/44289
dc.description.abstract	The influence of Vision Transformers (ViTs) is increasing in the field of medical image segmentation. In recent years, several papers have presented ViT-based architectures that outperform the previously state-of-the-art CNNs (such as nnU-Net). An example of such is the Swin UNet Transformer (or Swin UNETR), which, in combination with a self-supervised pre-training scheme, has outperformed other CNN and ViT-based architectures in multiple segmentation tasks. However, there are certain design and configuration choices that may aid the ViT to achieve this performance. In this paper, we perform an objective comparison between Swin UNETR and U-Net, comparing both networks in an equal resource setting. We explore two downscaling approaches to balance the parameter count of Swin UNETR, making it closer to the U-Net in this aspect. We measure the ViT's performance loss due to downscaling, as well as the gain obtained from using pre-trained weights for the encoder. Additionally, we assess whether residual blocks aid the Swin UNETR or U-Net to obtain superior performance. Our results show that in the framework used in this study, both U-Net and Swin UNETR show comparable results, with the CNN-based network achieving a slightly superior (1%) DSC. The downscaled ViT models show a decrease of 1.4% in DSC, while pre-training improves the outcome of the original Swin UNETR by 1.6%. Residual functionality proves to aid the pre-trained Swin UNETR with an increase of 3.6% in DSC, while only improving U-Net's DSC by 0.8%. In the constrained resource setting of this study, the U-Net proves to obtain similar performance to Swin UNETR, while employing fewer GPU resources and improving inference speed.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	In this study, the well-established CNNs and the novel ViTs are compared in the context of 3D Medical Image Segmentation, demonstrating that CNNs are still not to be dismissed.
dc.title	ViTs vs. CNNs for 3D Medical Image Segmentation: Are Transformers All You Need?
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.courseuu	Medical Imaging
dc.thesis.id	19908

Files in this item

Name:: Minor_Research_Project_Pablo_X ...
Size:: 4.090Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record