TY - GEN
T1 - TransRank
T2 - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
AU - Duan, Haodong
AU - Zhao, Nanxuan
AU - Chen, Kai
AU - Lin, Dahua
PY - 2022/6/24
Y1 - 2022/6/24
N2 - Recognizing transformation types applied to a video clip (RecogTrans) is a long-established paradigm for selfsupervised video representation learning, which achieves much inferior performance compared to instance discrimination approaches (InstDisc) in recent works. However, based on a thorough comparison of representative Recog-Trans and InstDisc methods, we observe the great potential of RecogTrans on both semantic-related and temporalrelated downstream tasks. Based on hard-label classification, existing RecogTrans approaches suffer from noisy supervision signals in pre-training. To mitigate this problem, we developed TransRank, a unified framework for recognizing Transformations in a Ranking formulation. TransRank provides accurate supervision signals by recognizing transformations relatively, consistently outperforming the classification-based formulation. Meanwhile, the unified framework can be instantiated with an arbitrary set of temporal or spatial transformations, demonstrating good generality. With a ranking-based formulation and several empirical practices, we achieve competitive performance on video retrieval and action recognition. Under the same setting, TransRank surpasses the previous state-of-the-art method [28] by 6.4% on UCF101 and 8.3% on HMDB51 for action recognition (Topl Acc); improves video retrieval on UCF101 by 20.4% (R@1). The promising results validate that RecogTrans is still a worth exploring paradigm for video self-supervised learning. Codes will be released at https://github.com/kennymckormick/TransRank.
AB - Recognizing transformation types applied to a video clip (RecogTrans) is a long-established paradigm for selfsupervised video representation learning, which achieves much inferior performance compared to instance discrimination approaches (InstDisc) in recent works. However, based on a thorough comparison of representative Recog-Trans and InstDisc methods, we observe the great potential of RecogTrans on both semantic-related and temporalrelated downstream tasks. Based on hard-label classification, existing RecogTrans approaches suffer from noisy supervision signals in pre-training. To mitigate this problem, we developed TransRank, a unified framework for recognizing Transformations in a Ranking formulation. TransRank provides accurate supervision signals by recognizing transformations relatively, consistently outperforming the classification-based formulation. Meanwhile, the unified framework can be instantiated with an arbitrary set of temporal or spatial transformations, demonstrating good generality. With a ranking-based formulation and several empirical practices, we achieve competitive performance on video retrieval and action recognition. Under the same setting, TransRank surpasses the previous state-of-the-art method [28] by 6.4% on UCF101 and 8.3% on HMDB51 for action recognition (Topl Acc); improves video retrieval on UCF101 by 20.4% (R@1). The promising results validate that RecogTrans is still a worth exploring paradigm for video self-supervised learning. Codes will be released at https://github.com/kennymckormick/TransRank.
KW - Representation learning
KW - Self-& semi-& meta- & unsupervised learning
KW - Video analysis and understanding
UR - http://www.scopus.com/inward/record.url?scp=85141795604&partnerID=8YFLogxK
U2 - 10.1109/CVPR52688.2022.00301
DO - 10.1109/CVPR52688.2022.00301
M3 - Chapter in a published conference proceeding
AN - SCOPUS:85141795604
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 2990
EP - 3000
BT - Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
PB - IEEE
Y2 - 19 June 2022 through 24 June 2022
ER -