The effect of fixed, data driven image scan patterns on Vision Mamba performance
Summary
Mamba-based vision models show promise as successors to Transformer-based models due to their competitive performance and significant efficiency improvements, but are constrained by fixed, predetermined scan patterns. These scan patterns are specific orderings of patch tokens, which influence the receptive field of Mamba vision models. In this research, we investigated two methods of constructing data-driven scan patterns for Vision Mamba on an image classification task. One is a global fixed pattern derived from aggregating the bounding boxes of a subset of ImageNet-1K, and the other a per-image pattern based on BING saliency maps. These methods were intended to improve Vision Mamba performance, specifically classification accuracy, by leveraging data-driven scan patterns. Contrary to our hypothesis, both methods underperformed compared to a baseline of 53.11% top-1 classification accuracy, achieving 47.57% for a global pattern and 45.20% for saliency-based patterns. This seems to demonstrate that the disruption
of spatial contiguity in image patches is detrimental to performance in the standard Vision Mamba architecture. Interestingly, an ablative study revealed that combining saliency-based sequence truncation with a final-token readout recovered nearly all performance while using half the tokens, highlighting a critical interaction between scan order and model architecture. Likely as a result of the sequential nature of Mamba, placing a classification token at the natural information aggregation point seems to work best in combination with patch token reordering. Ultimately, this work shows that improving Vision Mamba requires more than a simple token reordering, suggesting that more sophisticated combinations of adaptive reordering and readout mechanisms tailored to reordered sequences are a promising direction for future research.