TY - JOUR
T1 - The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines
AU - Damen, Dima
AU - Doughty, Hazel
AU - Farinella, Giovanni Maria
AU - Fidler, Sanja
AU - Furnari, Antonino
AU - Kazakos, Evangelos
AU - Moltisanti, Davide
AU - Munro, Jonathan
AU - Perrett, Toby
AU - Price, Will
AU - Wray, Michael
PY - 2021/11/1
Y1 - 2021/11/1
N2 - Since its introduction in 2018, EPIC-KITCHENS has attracted attention as the largest egocentric video benchmark, offering a unique viewpoint on people’s interaction with objects, their attention, and even intention. In this paper, we detail how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions. Our videos depict nonscripted daily activities, as recording is started every time a participant entered their kitchen. Recording took place in four countries by participants belonging to ten different nationalities, resulting in highly diverse kitchen habits and cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.2K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. We introduce new baselines that highlight the multimodal nature of the dataset and the importance of explicit temporal modelling to discriminate fine-grained actions (e.g., ‘closing a tap’ from ‘opening’ it up).
AB - Since its introduction in 2018, EPIC-KITCHENS has attracted attention as the largest egocentric video benchmark, offering a unique viewpoint on people’s interaction with objects, their attention, and even intention. In this paper, we detail how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions. Our videos depict nonscripted daily activities, as recording is started every time a participant entered their kitchen. Recording took place in four countries by participants belonging to ten different nationalities, resulting in highly diverse kitchen habits and cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.2K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. We introduce new baselines that highlight the multimodal nature of the dataset and the importance of explicit temporal modelling to discriminate fine-grained actions (e.g., ‘closing a tap’ from ‘opening’ it up).
U2 - 10.1109/TPAMI.2020.2991965
DO - 10.1109/TPAMI.2020.2991965
M3 - Article
SN - 0162-8828
VL - 43
SP - 4125
EP - 4141
JO - IEEE Transactions on Pattern Analysis and Machine Intelligence
JF - IEEE Transactions on Pattern Analysis and Machine Intelligence
IS - 11
ER -