Action Search: Targets in Untrimmed Videos and Its Application to Temporal Action Localization [2017]

State-of-the-art approaches for video-based tasks inefficiently search the entire video for specific targets. Despite the encouraging progress these methods achieve, it is crucial to design automated approaches that only explore parts of the video which are the most relevant to the given task. To address such needs, we propose the new problem of target spotting in video, which we define as finding a specific target in a video sequence while observing a small portion of that video. Inspired by the observation that humans are extremely efficient and accurate in finding individual targets in video, we propose Action Search, a novel Recurrent Neural Network approach that mimics the way humans spot targets in untrimmed video sequences. Moreover, to address the absence of data recording the behavior of human annotators, we put forward the Human Searches dataset, a new dataset composed of the search sequences of human annotators for the AVA and THUMOS14 datasets. We consider temporal action localization as an application of the target spotting problem. Experiments on the THUMOS14 dataset reveal that our model is not only able to explore the video efficiently (observing on average 17.3% of the video) but also accurately find human activities with 30.8% mAP, outperforming state-of-the-art methods.