Rethinking Learning Approaches for Long-Term Action Anticipation Megha Nawhal1 Akash Abdu Jyothi1 and Greg Mori12

2025-04-29 0 0 6.01MB 32 页 10玖币
侵权投诉
Rethinking Learning Approaches for Long-Term
Action Anticipation
Megha Nawhal1, Akash Abdu Jyothi1, and Greg Mori1,2
1Simon Fraser University, Burnaby, Canada
2Borealis AI, Vancouver, Canada
Abstract. Action anticipation involves predicting future actions having
observed the initial portion of a video. Typically, the observed video is
processed as a whole to obtain a video-level representation of the ongoing
activity in the video, which is then used for future prediction. We intro-
duce Anticipatr which performs long-term action anticipation leverag-
ing segment-level representations learned using individual segments from
different activities, in addition to a video-level representation. We pro-
pose a two-stage learning approach to train a novel transformer-based
model that uses these two types of representations to directly predict
a set of future action instances over any given anticipation duration.
Results on Breakfast, 50Salads, Epic-Kitchens-55, and EGTEA Gaze+
datasets demonstrate the effectiveness of our approach.
Keywords: Action Anticipation; Transformer; Long-form videos
1 Introduction
The ability to envision future events is a crucial component of human intelligence
which helps in decision making during our interactions with the environment.
We are naturally capable of anticipating future events when interacting with the
environment in a wide variety of scenarios. Similarly, anticipation capabilities
are essential to practical AI systems that operate in complex environments and
interact with other agents or humans (e.g., wearable devices [59], human-robot
interaction systems [31], autonomous vehicles [40,66]).
Existing anticipation methods have made considerable progress on the task
of near-term action anticipation [12,13,16,17,19,21,41,63] that involves predict-
ing the immediate next action that would occur over the course of a few seconds.
While near-term anticipation is a valuable step towards the goal of future predic-
tion in AI systems, going beyond short time-horizon prediction has applicability
in a broader range of tasks that involve long-term interactions with the environ-
ment. The ability to anticipate actions over long time-horizons is imperative for
applications such as efficient planning in robotic systems [11,18] and intelligent
augmented reality systems.
In this paper, we focus on long-term action anticipation. Figure 1illustrates
the problem – having observed an initial portion of an untrimmed activity video,
we predict what actions would occur when in the future.
arXiv:2210.11566v1 [cs.CV] 20 Oct 2022
2 M. Nawhal et al.
Anticipation
Model
observed video
anticipation duration
Future action instances
time
Fig. 1. Long-Term Action Anticipation. Given the initial portion of an activity
video (0,...,To) and anticipation duration Ta, the task is to predict the actions that
would occur from time To+ 1 to To+Ta. Our proposed anticipation model receives the
observed video and the anticipation duration as inputs and directly predicts a set of
future action instances. Here, the action anticipation is long-term – both the observed
duration Toand the anticipation duration Taare in the order of minutes.
Long-term anticipation methods [5,15,18,28,55] predict future actions based
on the information in the observed video (i.e., an initial portion of an untrimmed
activity video) that partially depicts the activity in the video. Current ap-
proaches rely on encoding the observed video (input) as a whole to obtain video-
level representations to perform action anticipation.
We propose a novel approach that leverages segment-level and video-level
representations for the task of long-term action anticipation. Consider the ex-
ample in Figure 1. The video depicts the activity person making pasta spanning
several minutes. This activity has segments with actions such as slice onion,put
pesto,put courgette,add cheese. One of these segments such as put pesto tends
to co-occur with actions involving objects such as courgette,onion, or cheese
in a specific order. However, other videos with a different activity, say, person
making pizza, could potentially have a similar set and/or sequence of actions in
a different kitchen scenario. As such, while a specific sequence of actions (i.e.,
segments of a video) help denote an activity, an individual video segment (con-
taining a single action) alone contains valuable information for predicting the
future. Based on this intuition, we introduce an approach that leverages segment-
level representations in conjunction with video-level representations for the task
of long-term action anticipation. In so doing, our approach enables reasoning
beyond the limited context of the input video sequence.
In this work, we propose Anticipatr that consists of a two-stage learning
approach employed to train a transformer-based model for long-term anticipa-
tion (see Fig. 2for an overview). In the first stage, we train a segment encoder to
learn segment-level representations. As we focus on action anticipation, we de-
sign this training task based on co-occurrences of actions. Specifically, we train
the segment encoder to learn which future actions are likely to occur after a
given segment? Intuitively, consider a video segment showing a pizza pan being
moved towards a microwave. Irrespective of the ongoing activity in the video
that contains this segment, it is easy to anticipate that certain actions such as
open microwave,put pizza and close microwave are more likely to follow than
the actions wash spoon or close tap.
Rethinking Learning Approaches for Long-Term Action Anticipation 3
In the second stage, we utilize both the segment-level and video-level rep-
resentations for long-term action anticipation. We design a transformer-based
model that contains two encoders: (1) the segment encoder to derive representa-
tions corresponding to segments in the observed video, and (2) a video encoder
to derive the video-level representations of the observed video. These encoded
representations are then fed into an anticipation decoder that predicts actions
that would occur in the future. Our model is designed to directly predict a set
of future action instances, wherein, each element of the set (i.e., an action in-
stance) contains the start and end timestamps of the instance along with the
action label. Using direct set prediction, our approach predicts the actions at all
the timestamps over a given anticipation duration in a single forward pass.
To summarize, this paper makes the following contributions: (1) a novel learn-
ing approach for long-term action anticipation that leverages segment-level rep-
resentations and video-level representations of the observed video, (2) a novel
transformer-based model that receives a video and anticipation duration as in-
puts to predict future actions over the specified anticipation duration, (3) a direct
set prediction formulation that enables single-pass prediction of actions, and (4)
state-of-the-art performance on a diverse set of anticipation benchmarks: Break-
fast [32], 50Salads [60], Epic-Kitchens-55 [12], and EGTEA Gaze+ [34]. Code is
available at https://github.com/Nmegha2601/anticipatr
Overall, our work highlights the benefits of learning representations that
capture different aspects of a video, and particularly demonstrates the value of
such representations for action anticipation.
2 Related Work
Action Anticipation. Action anticipation is generally described as the pre-
diction of actions before they occur. Prior research efforts have used various
formulations of this problem depending on three variables: (1) anticipation for-
mat, i.e., representation format of predicted actions, (2) anticipation duration,
i.e., duration over which actions are anticipated, and (3) model architectures.
Current approaches span a wide variety of anticipation formats involving
different representations of prediction outcomes. They range from pixel-level
representations such as frames or segmentations [8,36,38,43] and human tra-
jectories [6,13,23,27,30,42] to label-level representations such as action la-
bels [15,16,17,19,28,33,50,52,53,55,57,63,68,69] or temporal occurrences of
actions [5,18,35,41,44,61] through to semantic representations such as affor-
dances [31] and language descriptions of sub-activities [56]. We focus on label-
level anticipation format and use ‘action anticipation’ to refer to this task.
Existing anticipation tasks can be grouped into two categories based on the
anticipation duration: (1) near-term action anticipation, and (2) long-term action
anticipation. In this paper, we focus on long-term action anticipation.
Near-term anticipation involves predicting label for the immediate next
action that would occur in the range of a few seconds having observed a short
video segment of duration of a few seconds. Prior work propose a variety of
4 M. Nawhal et al.
temporal modeling techniques to encode the observed segment such as regression
networks [63], reinforced encoder-decoder network [19], TCNs [67], temporal
segment network [12], LSTMs [16,17,49], VAEs [44,65] and transformers [21].
Long-term anticipation involves predicting action labels over long time-
horizons in the range of several minutes having observed an initial portion of a
video (observed duration of a few minutes). A popular formulation of this task in-
volves prediction of a sequence of action labels having observed an initial portion
of the video. Prior approaches encode the observed video as a whole to obtain a
video-level representation. Using these representations, these approaches either
predict actions recursively over individual future time instants or use time as a
conditional parameter to predict action label for the given single time instant.
The recursive methods [5,15,18,50,55] accumulate prediction error over time
resulting in inaccurate anticipation outcomes for scenarios with long anticipa-
tion duration. The time-conditioned method [28] employs skip-connections based
temporal models and aims to avoid error accumulation by directly predicting an
action label for a specified future time instant in a single forward pass. However,
this approach still requires multiple forward passes during inference as the task
involves predicting actions at all future time instants over a given anticipation
duration. Additionally, sparse skip connections used in [28] do not fully utilize
the relations among the actions at intermediate future time instants while pre-
dicting action at a given future time instant. In contrast to these approaches
based on video-level representations, our approach leverages segment-level rep-
resentations (learned using individual segments across different activities) in
conjunction with video-level representations. Both these representations are uti-
lized to directly predict action instances corresponding to actions at all the time
instants over a given anticipation duration in a single forward pass.
An alternate formulation of long-term anticipation proposed in [46] focuses
on predicting a set of future action labels without inferring when they would oc-
cur. [46] extracts a graph representation of the video based on frame-level visual
affordances and uses graph convolutional network to encode the graph repre-
sentation to predict a set of action labels. In contrast, our approach leverages
both the segment-level and video-level representations of the input video and
a transformer-based model to predict action instances - both action labels and
their corresponding timestamps.
Other methods design approaches to model uncertainty in predicting actions
over long time horizons [4,48,50] and self-supervised learning [51].
Early action detection. The task of early action detection [24,39,54,58] in-
volves recognizing an ongoing action in a video as early as possible given an
initial portion of the video. Though the early action detection task is different
from action anticipation (anticipation involves prediction of actions before they
begin), the two tasks share the inspiration of future prediction.
Transformers in computer vision. The transformer architecture [62], origi-
nally proposed for machine translation task, has achieved state-of-the-art perfor-
mance for many NLP tasks. In recent years, there has been a flurry of work on
transformer architectures designed for high-level reasoning tasks on images and
Rethinking Learning Approaches for Long-Term Action Anticipation 5
videos. Examples include object detection [9], image classification [14], spatio-
temporal localization in videos [20], video instance segmentation [64], action
recognition [7,70], action detection [47], multi-object tracking [45], next ac-
tion anticipation [21], human-object interaction detection [29,71]. DETR [9]
is a transformer model for object detection, wherein, the task is formulated as
a set prediction problem. This work has since inspired transformer designs for
similar vision tasks – video instance segmentation [64] and human-object inter-
action detection [71]. Inspired by these works, we propose a novel transformer
architecture that uses two encoder to encode different representations derived
from the input video and a decoder to predict the set of future action instances
in a single pass. Our proposed decoder also receives anticipation duration as an
input parameter to control the duration over which actions are predicted.
3 Action Anticipation with Anticipatr
In this section, we first describe our formulation of long-term action anticipation
and then describe our approach.
Problem Formulation. Let vobe an observed video containing Toframes.
Our goal is to predict the actions that occur from time To+1 to To+Tawhere Ta
is the anticipation duration, i.e., the duration over which actions are predicted.
Specifically, we predict a set A={ai= (ci, ti
s, ti
e)}containing future action in-
stances. The i-th element denotes an action instance aidepicting action category
cioccurring from time ti
sto ti
ewhere To< ti
s< ti
eTo+Ta. Here, ci∈ C where
Cis the set of action classes in the dataset.
Intuitively, for action anticipation, the observed video as a whole helps pro-
vide a broad, video-level representation of the ongoing activity depicted in the
video. However, the observed video is composed of several segments that indi-
vidually also contain valuable information about future actions and provide an
opportunity to capture the video with segment-level representations. Using this
intuition, in this paper, we propose Anticipatr that leverages these two types
of representations of the observed video for the task of long-term anticipation.
Anticipatr employs a two-stage learning approach to train a transformer-
based model that takes an observed video as input and produces a set of future
action instances as output. See Fig. 2for an overview. In the first stage, we train
asegment encoder that receives a segment (sequence of frames from a video) as
input and predicts the set of action labels that would occur at any time in the
future after the occurrence of the segment in the video. We refer to this stage
as segment-level training (described in Sec. 3.1). As the segment encoder only
operates over individual segments, it is unaware of the broader context of the
activity induced by a specific sequence of segments in the observed video.
In the second stage, we train a video encoder and an anticipation decoder to
be used along with the segment encoder for long-term action anticipation. The
video encoder encodes the observed video to a video-level representation. The
segment encoder (trained in the first stage) is fed with a sequence of segments
from the observed video as input to obtain a segment-level representation of
摘要:

RethinkingLearningApproachesforLong-TermActionAnticipationMeghaNawhal1,AkashAbduJyothi1,andGregMori1,21SimonFraserUniversity,Burnaby,Canada2BorealisAI,Vancouver,CanadaAbstract.Actionanticipationinvolvespredictingfutureactionshavingobservedtheinitialportionofavideo.Typically,theobservedvideoisprocess...

展开>> 收起<<
Rethinking Learning Approaches for Long-Term Action Anticipation Megha Nawhal1 Akash Abdu Jyothi1 and Greg Mori12.pdf

共32页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:32 页 大小:6.01MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 32
客服
关注