View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Bridging the gap: Threefold knowledge distillation for language-enabled action recognition models operating in real-time scenarios

        Thumbnail
        View/Open
        MasterThesis_Frans-15-fixed-formatting.pdf (4.209Mb)
        Publication date
        2024
        Author
        Huntink, Frans
        Metadata
        Show full item record
        Summary
        In this work, we propose a set of knowledge distillation techniques that aim to transfer the benefits of large and computationally slow language-enabled action recognition (AR) models to smaller, faster student models that can operate in real-time scenarios. This means that important benefits of these models such as unprecedented predictive performance, flexible natural language interaction and “zero- shot” predictions can be used in real-time scenarios. We study existing language-based AR models and find that transformer models using the Contrastive Language-Image Pretraining (CLIP) model as an encoder backbone perform best among the set of existing language-enabled AR models. We determine that the CLIP-based AR model called ActionCLIP is most suitable for our distillation experiments by comparing it to other CLIP-based models in terms of predictive performance and inference time using a dense frame sampling strategy. We then propose three distillation techniques that each distill a specific portion of the knowledge contained in the ActionCLIP model into a smaller, faster student model. First, we propose a way to replace the CLIP encoder backbone of the ActionCLIP model with a model from the distilled TinyCLIP family. In doing so, we find a steep decrease in inference time but also find a significant drop in predictive performance. Next, we propose a method to distill the spatial knowledge contained in the CLIP model itself. We do this by creating a multi-task learning problem for our ActionCLIP model in which the model has to predict both the ground truth human actions label and a set of additional spatial objectives for a given AR dataset. We generate these additional objectives by designing a CLIP-based spatial prediction framework. We find that spatial distillation improves the predictive performance of our ActionCLIP model as compared to its original single-objective implementation. Finally, we adapt a distillation technique called the data efficient image transformer distillation (DeiT) approach to be able to distill video transformers instead of image transformers. We then apply the DeiT strategy by using a large pretrained ActionCLIP model as a teacher to the smaller, faster ActionCLIP student. We find significant improvement in predictive performance over the original ActionCLIP training strategy but also find that this is caused by our so-called two-headed training strategy we introduced to accommodate for DeiT distillation, not because of the teacher supervision. This two-headed training strategy implies that not one but two perspectives on a video sample are learned: a token-based perspective and a frame-based perspective. We find that its success is likely because the model is supervised to simultaneously learn these two perspectives of a video. We then explore ways to combine the distillation techniques and obtain an ActionCLIP student model that reaches a Top-1 validation score of 66.5 on the HMDB51 dataset, which is only slightly lower than the 68.8 Top-1 validation score obtained by the original ActionCLIP model at half the backbone parameters and more than 1.73× the inference speed. We conclude by showing that this allows our model to be applied to the domain of real-time AR.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/46534
        Collections
        • Theses
        Utrecht university logo