View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Tiny Object Detection in Video with Transformers using Temporal Context

        Thumbnail
        View/Open
        Thesis_PimVeraar6409458.pdf (4.208Mb)
        Publication date
        2024
        Author
        Veraar, Pim
        Metadata
        Show full item record
        Summary
        Tiny object detection in video footage using computer vision models has a wide range of applications. Following an overview of relevant deep learning methods and related work done in this field, this paper explores the performance of Video Transformer models for this challenging task. More specifically, this work focusses on the effect of early fusion of spatio-temporal features on the performance of Video Transformers. In this paper, we propose a switch to a Video Swin backbone, using spatial-temporal fusion mechanics to resolve the shape mismatch between the backbone output and model input, while attempting to conserve as much temporal information as possible. These fusion mechanics are put between the backbone and the model, making it compatible with any object detector. Using this approach, different fusion mechanics are investigated, of which a 3D convolution resulted in the best performance. The resulting multi-frame model does not show performance improvements compared to state-of-the-art models. When compared to a single frame version of the same model, the overall performance is comparable, although distinct differences can be noted in their respective capabilities to successfully detect specific object classes. The multi-frame model outperforms the single frame model in scenarios with large objects, rare classes and medium speed objects. These results indicate that, while the multi-frame model struggles with interpreting the added temporal information in the context of small objects, it can offer some unique advantages in certain areas. The work provided new insights in how early temporal context can affect a model’s performance, thereby increasing the understanding of the remaining challenges of Video Transformers for tiny object detection. Based on these findings, future work can be directed towards improving the method proposed in this paper aiming to achieve performance gains on a broad range of real-world applications.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/48222
        Collections
        • Theses
        Utrecht university logo