TY - GEN
T1 - TransRank
T2 - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
AU - Duan, Haodong
AU - Zhao, Nanxuan
AU - Chen, Kai
AU - Lin, Dahua
N1 - Funding Information:
To conclude, we demonstrate the great potential of RecogTrans-based video self-supervised learning by introducing a unified framework named TransRank. We have shown its effectiveness through extensive experiments on ablation studies and comparisons with state-of-the-art methods. Given the initial success on marrying RecogTrans with InstDisc [9, 27, 57], how to use TransRank to further boost this research line is also worth exploring. We will release our code and pre-train models to facilitate future research. Broader Impact. Self-supervised learning is a data-hungry task, consuming expensive computational resources, though we have mitigated the effort and expense of collecting annotation. Since we have verified our model in multiple aspects and downstream tasks, we hope our released code and models can serve as a solid baseline for RecogTrans methods and deliver good initializations to benefit downstream tasks. Besides, data-driven methods often bring the risk of learning biases and preserve them in downstream tasks. We encourage users to carefully consider the consequences of the biases when adopting our model. Acknowledgement. This study is supported by the General Research Funds (GRF) of Hong Kong (No.14203518) and Shanghai Committee of Science and Technology, China (No. 20DZ1100800).
PY - 2022/6/24
Y1 - 2022/6/24
N2 - Recognizing transformation types applied to a video clip (RecogTrans) is a long-established paradigm for selfsupervised video representation learning, which achieves much inferior performance compared to instance discrimination approaches (InstDisc) in recent works. However, based on a thorough comparison of representative Recog-Trans and InstDisc methods, we observe the great potential of RecogTrans on both semantic-related and temporalrelated downstream tasks. Based on hard-label classification, existing RecogTrans approaches suffer from noisy supervision signals in pre-training. To mitigate this problem, we developed TransRank, a unified framework for recognizing Transformations in a Ranking formulation. TransRank provides accurate supervision signals by recognizing transformations relatively, consistently outperforming the classification-based formulation. Meanwhile, the unified framework can be instantiated with an arbitrary set of temporal or spatial transformations, demonstrating good generality. With a ranking-based formulation and several empirical practices, we achieve competitive performance on video retrieval and action recognition. Under the same setting, TransRank surpasses the previous state-of-the-art method [28] by 6.4% on UCF101 and 8.3% on HMDB51 for action recognition (Topl Acc); improves video retrieval on UCF101 by 20.4% (R@1). The promising results validate that RecogTrans is still a worth exploring paradigm for video self-supervised learning. Codes will be released at https://github.com/kennymckormick/TransRank.
AB - Recognizing transformation types applied to a video clip (RecogTrans) is a long-established paradigm for selfsupervised video representation learning, which achieves much inferior performance compared to instance discrimination approaches (InstDisc) in recent works. However, based on a thorough comparison of representative Recog-Trans and InstDisc methods, we observe the great potential of RecogTrans on both semantic-related and temporalrelated downstream tasks. Based on hard-label classification, existing RecogTrans approaches suffer from noisy supervision signals in pre-training. To mitigate this problem, we developed TransRank, a unified framework for recognizing Transformations in a Ranking formulation. TransRank provides accurate supervision signals by recognizing transformations relatively, consistently outperforming the classification-based formulation. Meanwhile, the unified framework can be instantiated with an arbitrary set of temporal or spatial transformations, demonstrating good generality. With a ranking-based formulation and several empirical practices, we achieve competitive performance on video retrieval and action recognition. Under the same setting, TransRank surpasses the previous state-of-the-art method [28] by 6.4% on UCF101 and 8.3% on HMDB51 for action recognition (Topl Acc); improves video retrieval on UCF101 by 20.4% (R@1). The promising results validate that RecogTrans is still a worth exploring paradigm for video self-supervised learning. Codes will be released at https://github.com/kennymckormick/TransRank.
KW - Representation learning
KW - Self-& semi-& meta- & unsupervised learning
KW - Video analysis and understanding
UR - http://www.scopus.com/inward/record.url?scp=85141795604&partnerID=8YFLogxK
U2 - 10.1109/CVPR52688.2022.00301
DO - 10.1109/CVPR52688.2022.00301
M3 - Chapter in a published conference proceeding
AN - SCOPUS:85141795604
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 2990
EP - 3000
BT - Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
PB - IEEE
Y2 - 19 June 2022 through 24 June 2022
ER -