dc.description.abstract | Tiny object detection in video footage using computer vision models has a wide range of applications. Following an overview of relevant deep learning methods and related work done in this field, this paper explores the performance of Video Transformer models for this challenging task. More specifically, this work focusses on the effect of early fusion of spatio-temporal features on the performance of Video Transformers. In this paper, we propose a switch to a Video Swin backbone, using spatial-temporal fusion mechanics to resolve the shape mismatch between the backbone output and model input, while attempting to conserve as much temporal information as possible. These fusion mechanics are put between the backbone and the model, making it compatible with any object detector. Using this approach, different fusion mechanics are investigated, of which a 3D convolution resulted in the best performance. The resulting multi-frame model does not show performance improvements compared to state-of-the-art models. When compared to a single frame version of the same model, the overall performance is comparable, although distinct differences can be noted in their respective capabilities to successfully detect specific object classes. The multi-frame model outperforms the single frame model in scenarios with large objects, rare classes and medium speed objects. These results indicate that, while the multi-frame model struggles with interpreting the added temporal information in the context of small objects, it can offer some unique advantages in certain areas. The work provided new insights in how early temporal context can affect a model’s performance, thereby increasing the understanding of the remaining challenges of Video Transformers for tiny object detection. Based on these findings, future work can be directed towards improving the method proposed in this paper aiming to achieve performance gains on a broad range of real-world applications. | |