FROSTER: Frozen CLIP is a Strong Teacher for Open-vocabulary Action Recognition

Abstract

In this paper, we introduce FROSTER, an effective framework for open-vocabulary action recognition. The CLIP model has achieved remarkable success in a range of image-based tasks, benefiting from its strong generalization capability stemming from pretaining on massive image-text pairs.

However, applying CLIP directly to the open-vocabulary action recognition task is challenging due to the absence of temporal information in CLIP's pretraining. Further, fine-tuning CLIP on action recognition datasets may lead to overfitting and hinder its generalizability, resulting in unsatisfactory results when dealing with unseen actions.

To address these issues, FROSTER employs a residual feature distillation approach to ensure that CLIP retains its generalization capability while effectively adapting to the action recognition task. Specifically, the residual feature distillation treats the frozen CLIP model as a teacher to maintain the generalizability exhibited by the original CLIP and supervises the feature learning for the extraction of video-specific features to bridge the gap between images and videos. Meanwhile, it uses a residual sub-network for feature distillation to reach a balance between the two distinct objectives of learning generalizable and video-specific features.

General Pipeline

The overall pipeline of FROSTER consists of two key components, namely, model finetuning to bridge the gap between image and video tasks, and knowledge distillation to maintain the generalizability of the pretrained CLIP.

As shown above, "video-specific" is achieved through common classification-based finetuning, while `generalizable' is achieved by using frozen CLIP as a teacher to impart pretrained knowledge to the tuned model, inspired by knowledge distillation techniques, which involves using frozen CLIP as a teacher to impart pretrained knowledge to the tuned model.

The distillation process is akin to a regularization term that ensures the tuned features do not diverge too far from the frozen ones. To balance the feature learning between the two distinct goals, we propose a modified residual network for conducting distillation. The intuition behind the design is to allow the tuned features to effectively receive supervision from generalized ones while also being video-specific.

Performance

Though being simple, FROSTER achieve better performance than state-of-the-art video models on both base-to-novel and cross-dataset evaluation settings. The results on the base-to-novel setting are shown below.

The results on the cross-dataset setting are shown below.

Meanwhile, our framework consistently achieves higher performance when equipped with different video models as shown below.

Visualization

In the figure below, we show the attention map of FROSTER and other baselines. Overall, our model attends to informative regions related to the action for more reliable recognition

BibTeX

@inproceedings{huang2024froster,
      title={FROSTER: Frozen CLIP is a Strong Teacher for Open-Vocabulary Action Recognition},
      author={Xiaohu Huang and Hao Zhou and Kun Yao and Kai Han},
      booktitle={International Conference on Learning Representations},
      year={2024},
  }

Acknowledgement

This web page is modified based on the template from nerfies. Thanks for their great work.