Rethinking Learning Approaches for Long-Term Action Anticipation 3
In the second stage, we utilize both the segment-level and video-level rep-
resentations for long-term action anticipation. We design a transformer-based
model that contains two encoders: (1) the segment encoder to derive representa-
tions corresponding to segments in the observed video, and (2) a video encoder
to derive the video-level representations of the observed video. These encoded
representations are then fed into an anticipation decoder that predicts actions
that would occur in the future. Our model is designed to directly predict a set
of future action instances, wherein, each element of the set (i.e., an action in-
stance) contains the start and end timestamps of the instance along with the
action label. Using direct set prediction, our approach predicts the actions at all
the timestamps over a given anticipation duration in a single forward pass.
To summarize, this paper makes the following contributions: (1) a novel learn-
ing approach for long-term action anticipation that leverages segment-level rep-
resentations and video-level representations of the observed video, (2) a novel
transformer-based model that receives a video and anticipation duration as in-
puts to predict future actions over the specified anticipation duration, (3) a direct
set prediction formulation that enables single-pass prediction of actions, and (4)
state-of-the-art performance on a diverse set of anticipation benchmarks: Break-
fast [32], 50Salads [60], Epic-Kitchens-55 [12], and EGTEA Gaze+ [34]. Code is
available at https://github.com/Nmegha2601/anticipatr
Overall, our work highlights the benefits of learning representations that
capture different aspects of a video, and particularly demonstrates the value of
such representations for action anticipation.
2 Related Work
Action Anticipation. Action anticipation is generally described as the pre-
diction of actions before they occur. Prior research efforts have used various
formulations of this problem depending on three variables: (1) anticipation for-
mat, i.e., representation format of predicted actions, (2) anticipation duration,
i.e., duration over which actions are anticipated, and (3) model architectures.
Current approaches span a wide variety of anticipation formats involving
different representations of prediction outcomes. They range from pixel-level
representations such as frames or segmentations [8,36,38,43] and human tra-
jectories [6,13,23,27,30,42] to label-level representations such as action la-
bels [15,16,17,19,28,33,50,52,53,55,57,63,68,69] or temporal occurrences of
actions [5,18,35,41,44,61] through to semantic representations such as affor-
dances [31] and language descriptions of sub-activities [56]. We focus on label-
level anticipation format and use ‘action anticipation’ to refer to this task.
Existing anticipation tasks can be grouped into two categories based on the
anticipation duration: (1) near-term action anticipation, and (2) long-term action
anticipation. In this paper, we focus on long-term action anticipation.
Near-term anticipation involves predicting label for the immediate next
action that would occur in the range of a few seconds having observed a short
video segment of duration of a few seconds. Prior work propose a variety of