Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorPoppe, Ronald
dc.contributor.authorTerpstra, Joren
dc.date.accessioned2025-08-21T00:05:58Z
dc.date.available2025-08-21T00:05:58Z
dc.date.issued2025
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/49892
dc.description.abstractMamba-based vision models show promise as successors to Transformer-based models due to their competitive performance and significant efficiency improvements, but are constrained by fixed, predetermined scan patterns. These scan patterns are specific orderings of patch tokens, which influence the receptive field of Mamba vision models. In this research, we investigated two methods of constructing data-driven scan patterns for Vision Mamba on an image classification task. One is a global fixed pattern derived from aggregating the bounding boxes of a subset of ImageNet-1K, and the other a per-image pattern based on BING saliency maps. These methods were intended to improve Vision Mamba performance, specifically classification accuracy, by leveraging data-driven scan patterns. Contrary to our hypothesis, both methods underperformed compared to a baseline of 53.11% top-1 classification accuracy, achieving 47.57% for a global pattern and 45.20% for saliency-based patterns. This seems to demonstrate that the disruption of spatial contiguity in image patches is detrimental to performance in the standard Vision Mamba architecture. Interestingly, an ablative study revealed that combining saliency-based sequence truncation with a final-token readout recovered nearly all performance while using half the tokens, highlighting a critical interaction between scan order and model architecture. Likely as a result of the sequential nature of Mamba, placing a classification token at the natural information aggregation point seems to work best in combination with patch token reordering. Ultimately, this work shows that improving Vision Mamba requires more than a simple token reordering, suggesting that more sophisticated combinations of adaptive reordering and readout mechanisms tailored to reordered sequences are a promising direction for future research.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectRearranging token sequences for vision state space models by leveraging bounding boxes and saliency heatmaps
dc.titleThe effect of fixed, data driven image scan patterns on Vision Mamba performance
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsVision Mamba; Scan Patterns; Patch Importance; Image Patch Ordering; Saliency Maps; Image Classification; Computer Vision; Token Reordering
dc.subject.courseuuArtificial Intelligence
dc.thesis.id51985


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record