Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorMaspero, Matteo
dc.contributor.authorArregui García, Xabier
dc.date.accessioned2023-07-25T00:01:28Z
dc.date.available2023-07-25T00:01:28Z
dc.date.issued2023
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/44289
dc.description.abstractThe influence of Vision Transformers (ViTs) is increasing in the field of medical image segmentation. In recent years, several papers have presented ViT-based architectures that outperform the previously state-of-the-art CNNs (such as nnU-Net). An example of such is the Swin UNet Transformer (or Swin UNETR), which, in combination with a self-supervised pre-training scheme, has outperformed other CNN and ViT-based architectures in multiple segmentation tasks. However, there are certain design and configuration choices that may aid the ViT to achieve this performance. In this paper, we perform an objective comparison between Swin UNETR and U-Net, comparing both networks in an equal resource setting. We explore two downscaling approaches to balance the parameter count of Swin UNETR, making it closer to the U-Net in this aspect. We measure the ViT's performance loss due to downscaling, as well as the gain obtained from using pre-trained weights for the encoder. Additionally, we assess whether residual blocks aid the Swin UNETR or U-Net to obtain superior performance. Our results show that in the framework used in this study, both U-Net and Swin UNETR show comparable results, with the CNN-based network achieving a slightly superior (1%) DSC. The downscaled ViT models show a decrease of 1.4% in DSC, while pre-training improves the outcome of the original Swin UNETR by 1.6%. Residual functionality proves to aid the pre-trained Swin UNETR with an increase of 3.6% in DSC, while only improving U-Net's DSC by 0.8%. In the constrained resource setting of this study, the U-Net proves to obtain similar performance to Swin UNETR, while employing fewer GPU resources and improving inference speed.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectIn this study, the well-established CNNs and the novel ViTs are compared in the context of 3D Medical Image Segmentation, demonstrating that CNNs are still not to be dismissed.
dc.titleViTs vs. CNNs for 3D Medical Image Segmentation: Are Transformers All You Need?
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.courseuuMedical Imaging
dc.thesis.id19908


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record