The effect of fixed, data driven image scan patterns on Vision Mamba performance

Terpstra, Joren

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Poppe, Ronald
dc.contributor.author	Terpstra, Joren
dc.date.accessioned	2025-08-21T00:05:58Z
dc.date.available	2025-08-21T00:05:58Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/49892
dc.description.abstract	Mamba-based vision models show promise as successors to Transformer-based models due to their competitive performance and significant efficiency improvements, but are constrained by fixed, predetermined scan patterns. These scan patterns are specific orderings of patch tokens, which influence the receptive field of Mamba vision models. In this research, we investigated two methods of constructing data-driven scan patterns for Vision Mamba on an image classification task. One is a global fixed pattern derived from aggregating the bounding boxes of a subset of ImageNet-1K, and the other a per-image pattern based on BING saliency maps. These methods were intended to improve Vision Mamba performance, specifically classification accuracy, by leveraging data-driven scan patterns. Contrary to our hypothesis, both methods underperformed compared to a baseline of 53.11% top-1 classification accuracy, achieving 47.57% for a global pattern and 45.20% for saliency-based patterns. This seems to demonstrate that the disruption of spatial contiguity in image patches is detrimental to performance in the standard Vision Mamba architecture. Interestingly, an ablative study revealed that combining saliency-based sequence truncation with a final-token readout recovered nearly all performance while using half the tokens, highlighting a critical interaction between scan order and model architecture. Likely as a result of the sequential nature of Mamba, placing a classification token at the natural information aggregation point seems to work best in combination with patch token reordering. Ultimately, this work shows that improving Vision Mamba requires more than a simple token reordering, suggesting that more sophisticated combinations of adaptive reordering and readout mechanisms tailored to reordered sequences are a promising direction for future research.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	Rearranging token sequences for vision state space models by leveraging bounding boxes and saliency heatmaps
dc.title	The effect of fixed, data driven image scan patterns on Vision Mamba performance
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Vision Mamba; Scan Patterns; Patch Importance; Image Patch Ordering; Saliency Maps; Image Classification; Computer Vision; Token Reordering
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	51985

Files in this item

Name:: Thesis Joren Terpstra - defini ...
Size:: 9.088Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record