Bridging the gap: Threefold knowledge distillation for language-enabled action recognition models operating in real-time scenarios
Summary
In this work, we propose a set of knowledge distillation techniques that aim to transfer the benefits
of large and computationally slow language-enabled action recognition (AR) models to smaller, faster
student models that can operate in real-time scenarios. This means that important benefits of these
models such as unprecedented predictive performance, flexible natural language interaction and “zero-
shot” predictions can be used in real-time scenarios. We study existing language-based AR models and
find that transformer models using the Contrastive Language-Image Pretraining (CLIP) model as an
encoder backbone perform best among the set of existing language-enabled AR models. We determine
that the CLIP-based AR model called ActionCLIP is most suitable for our distillation experiments
by comparing it to other CLIP-based models in terms of predictive performance and inference time
using a dense frame sampling strategy. We then propose three distillation techniques that each distill
a specific portion of the knowledge contained in the ActionCLIP model into a smaller, faster student
model. First, we propose a way to replace the CLIP encoder backbone of the ActionCLIP model with
a model from the distilled TinyCLIP family. In doing so, we find a steep decrease in inference time
but also find a significant drop in predictive performance. Next, we propose a method to distill the
spatial knowledge contained in the CLIP model itself. We do this by creating a multi-task learning
problem for our ActionCLIP model in which the model has to predict both the ground truth human
actions label and a set of additional spatial objectives for a given AR dataset. We generate these
additional objectives by designing a CLIP-based spatial prediction framework. We find that spatial
distillation improves the predictive performance of our ActionCLIP model as compared to its original
single-objective implementation. Finally, we adapt a distillation technique called the data efficient
image transformer distillation (DeiT) approach to be able to distill video transformers instead of
image transformers. We then apply the DeiT strategy by using a large pretrained ActionCLIP model
as a teacher to the smaller, faster ActionCLIP student. We find significant improvement in predictive
performance over the original ActionCLIP training strategy but also find that this is caused by
our so-called two-headed training strategy we introduced to accommodate for DeiT distillation, not
because of the teacher supervision. This two-headed training strategy implies that not one but two
perspectives on a video sample are learned: a token-based perspective and a frame-based perspective.
We find that its success is likely because the model is supervised to simultaneously learn these two
perspectives of a video. We then explore ways to combine the distillation techniques and obtain an
ActionCLIP student model that reaches a Top-1 validation score of 66.5 on the HMDB51 dataset,
which is only slightly lower than the 68.8 Top-1 validation score obtained by the original ActionCLIP
model at half the backbone parameters and more than 1.73× the inference speed. We conclude by
showing that this allows our model to be applied to the domain of real-time AR.