Bridging the gap: Threefold knowledge distillation for language-enabled
action recognition models operating in real-time scenarios

Huntink, Frans

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Poppe, Ronald
dc.contributor.author	Huntink, Frans
dc.date.accessioned	2024-06-21T23:02:13Z
dc.date.available	2024-06-21T23:02:13Z
dc.date.issued	2024
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/46534
dc.description.abstract	In this work, we propose a set of knowledge distillation techniques that aim to transfer the benefits of large and computationally slow language-enabled action recognition (AR) models to smaller, faster student models that can operate in real-time scenarios. This means that important benefits of these models such as unprecedented predictive performance, flexible natural language interaction and “zero- shot” predictions can be used in real-time scenarios. We study existing language-based AR models and find that transformer models using the Contrastive Language-Image Pretraining (CLIP) model as an encoder backbone perform best among the set of existing language-enabled AR models. We determine that the CLIP-based AR model called ActionCLIP is most suitable for our distillation experiments by comparing it to other CLIP-based models in terms of predictive performance and inference time using a dense frame sampling strategy. We then propose three distillation techniques that each distill a specific portion of the knowledge contained in the ActionCLIP model into a smaller, faster student model. First, we propose a way to replace the CLIP encoder backbone of the ActionCLIP model with a model from the distilled TinyCLIP family. In doing so, we find a steep decrease in inference time but also find a significant drop in predictive performance. Next, we propose a method to distill the spatial knowledge contained in the CLIP model itself. We do this by creating a multi-task learning problem for our ActionCLIP model in which the model has to predict both the ground truth human actions label and a set of additional spatial objectives for a given AR dataset. We generate these additional objectives by designing a CLIP-based spatial prediction framework. We find that spatial distillation improves the predictive performance of our ActionCLIP model as compared to its original single-objective implementation. Finally, we adapt a distillation technique called the data efficient image transformer distillation (DeiT) approach to be able to distill video transformers instead of image transformers. We then apply the DeiT strategy by using a large pretrained ActionCLIP model as a teacher to the smaller, faster ActionCLIP student. We find significant improvement in predictive performance over the original ActionCLIP training strategy but also find that this is caused by our so-called two-headed training strategy we introduced to accommodate for DeiT distillation, not because of the teacher supervision. This two-headed training strategy implies that not one but two perspectives on a video sample are learned: a token-based perspective and a frame-based perspective. We find that its success is likely because the model is supervised to simultaneously learn these two perspectives of a video. We then explore ways to combine the distillation techniques and obtain an ActionCLIP student model that reaches a Top-1 validation score of 66.5 on the HMDB51 dataset, which is only slightly lower than the 68.8 Top-1 validation score obtained by the original ActionCLIP model at half the backbone parameters and more than 1.73× the inference speed. We conclude by showing that this allows our model to be applied to the domain of real-time AR.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	In this work, we propose a set of knowledge distillation techniques that aim to transfer the benefits of large and computationally slow language-enabled action recognition (AR) models to smaller, faster student models that can operate in real-time scenarios. This means that important benefits of these models such as unprecedented predictive performance, flexible natural language interaction and “zero- shot” predictions can be used in real-time scenarios.
dc.title	Bridging the gap: Threefold knowledge distillation for language-enabled action recognition models operating in real-time scenarios
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Real-time, action recognition, language-enabled AR, CLIP, ActionCLIP, Image recognition, language-video transformer.
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	31709

Files in this item

Name:: MasterThesis_Frans-15-fixed-fo ...
Size:: 4.209Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record

Bridging the gap: Threefold knowledge distillation for language-enabled action recognition models operating in real-time scenarios

Files in this item

This item appears in the following Collection(s)