Rethinking Learning Approaches for Long-Term Action Anticipation Megha Nawhal1 Akash Abdu Jyothi1 and Greg Mori12

2025-04-29 0 0 6.01MB 32 页 10玖币

侵权投诉

Rethinking Learning Approaches for Long-Term

Action Anticipation

Megha Nawhal1, Akash Abdu Jyothi1, and Greg Mori1,2

1Simon Fraser University, Burnaby, Canada

2Borealis AI, Vancouver, Canada

Abstract. Action anticipation involves predicting future actions having

observed the initial portion of a video. Typically, the observed video is

processed as a whole to obtain a video-level representation of the ongoing

activity in the video, which is then used for future prediction. We intro-

duce Anticipatr which performs long-term action anticipation leverag-

ing segment-level representations learned using individual segments from

diﬀerent activities, in addition to a video-level representation. We pro-

pose a two-stage learning approach to train a novel transformer-based

model that uses these two types of representations to directly predict

a set of future action instances over any given anticipation duration.

Results on Breakfast, 50Salads, Epic-Kitchens-55, and EGTEA Gaze+

datasets demonstrate the eﬀectiveness of our approach.

Keywords: Action Anticipation; Transformer; Long-form videos

1 Introduction

The ability to envision future events is a crucial component of human intelligence

which helps in decision making during our interactions with the environment.

We are naturally capable of anticipating future events when interacting with the

environment in a wide variety of scenarios. Similarly, anticipation capabilities

are essential to practical AI systems that operate in complex environments and

interact with other agents or humans (e.g., wearable devices [59], human-robot

interaction systems [31], autonomous vehicles [40,66]).

Existing anticipation methods have made considerable progress on the task

of near-term action anticipation [12,13,16,17,19,21,41,63] that involves predict-

ing the immediate next action that would occur over the course of a few seconds.

While near-term anticipation is a valuable step towards the goal of future predic-

tion in AI systems, going beyond short time-horizon prediction has applicability

in a broader range of tasks that involve long-term interactions with the environ-

ment. The ability to anticipate actions over long time-horizons is imperative for

applications such as eﬃcient planning in robotic systems [11,18] and intelligent

augmented reality systems.

In this paper, we focus on long-term action anticipation. Figure 1illustrates

the problem – having observed an initial portion of an untrimmed activity video,

we predict what actions would occur when in the future.

arXiv:2210.11566v1 [cs.CV] 20 Oct 2022

2 M. Nawhal et al.

Anticipation

Model

observed video

anticipation duration

Future action instances

time

Fig. 1. Long-Term Action Anticipation. Given the initial portion of an activity

video (0,...,To) and anticipation duration Ta, the task is to predict the actions that

would occur from time To+ 1 to To+Ta. Our proposed anticipation model receives the

observed video and the anticipation duration as inputs and directly predicts a set of

future action instances. Here, the action anticipation is long-term – both the observed

duration Toand the anticipation duration Taare in the order of minutes.

Long-term anticipation methods [5,15,18,28,55] predict future actions based

on the information in the observed video (i.e., an initial portion of an untrimmed

activity video) that partially depicts the activity in the video. Current ap-

proaches rely on encoding the observed video (input) as a whole to obtain video-

level representations to perform action anticipation.

We propose a novel approach that leverages segment-level and video-level

representations for the task of long-term action anticipation. Consider the ex-

ample in Figure 1. The video depicts the activity person making pasta spanning

several minutes. This activity has segments with actions such as slice onion,put

pesto,put courgette,add cheese. One of these segments such as put pesto tends

to co-occur with actions involving objects such as courgette,onion, or cheese

in a speciﬁc order. However, other videos with a diﬀerent activity, say, person

making pizza, could potentially have a similar set and/or sequence of actions in

a diﬀerent kitchen scenario. As such, while a speciﬁc sequence of actions (i.e.,

segments of a video) help denote an activity, an individual video segment (con-

taining a single action) alone contains valuable information for predicting the

future. Based on this intuition, we introduce an approach that leverages segment-

level representations in conjunction with video-level representations for the task

of long-term action anticipation. In so doing, our approach enables reasoning

beyond the limited context of the input video sequence.

In this work, we propose Anticipatr that consists of a two-stage learning

approach employed to train a transformer-based model for long-term anticipa-

tion (see Fig. 2for an overview). In the ﬁrst stage, we train a segment encoder to

learn segment-level representations. As we focus on action anticipation, we de-

sign this training task based on co-occurrences of actions. Speciﬁcally, we train

the segment encoder to learn which future actions are likely to occur after a

given segment? Intuitively, consider a video segment showing a pizza pan being

moved towards a microwave. Irrespective of the ongoing activity in the video

that contains this segment, it is easy to anticipate that certain actions such as

open microwave,put pizza and close microwave are more likely to follow than

the actions wash spoon or close tap.

Rethinking Learning Approaches for Long-Term Action Anticipation 3

In the second stage, we utilize both the segment-level and video-level rep-

resentations for long-term action anticipation. We design a transformer-based

model that contains two encoders: (1) the segment encoder to derive representa-

tions corresponding to segments in the observed video, and (2) a video encoder

to derive the video-level representations of the observed video. These encoded

representations are then fed into an anticipation decoder that predicts actions

that would occur in the future. Our model is designed to directly predict a set

of future action instances, wherein, each element of the set (i.e., an action in-

stance) contains the start and end timestamps of the instance along with the

action label. Using direct set prediction, our approach predicts the actions at all

the timestamps over a given anticipation duration in a single forward pass.

To summarize, this paper makes the following contributions: (1) a novel learn-

ing approach for long-term action anticipation that leverages segment-level rep-

resentations and video-level representations of the observed video, (2) a novel

transformer-based model that receives a video and anticipation duration as in-

puts to predict future actions over the speciﬁed anticipation duration, (3) a direct

set prediction formulation that enables single-pass prediction of actions, and (4)

state-of-the-art performance on a diverse set of anticipation benchmarks: Break-

fast [32], 50Salads [60], Epic-Kitchens-55 [12], and EGTEA Gaze+ [34]. Code is

available at https://github.com/Nmegha2601/anticipatr

Overall, our work highlights the beneﬁts of learning representations that

capture diﬀerent aspects of a video, and particularly demonstrates the value of

such representations for action anticipation.

2 Related Work

Action Anticipation. Action anticipation is generally described as the pre-

diction of actions before they occur. Prior research eﬀorts have used various

formulations of this problem depending on three variables: (1) anticipation for-

mat, i.e., representation format of predicted actions, (2) anticipation duration,

i.e., duration over which actions are anticipated, and (3) model architectures.

Current approaches span a wide variety of anticipation formats involving

diﬀerent representations of prediction outcomes. They range from pixel-level

representations such as frames or segmentations [8,36,38,43] and human tra-

jectories [6,13,23,27,30,42] to label-level representations such as action la-

bels [15,16,17,19,28,33,50,52,53,55,57,63,68,69] or temporal occurrences of

actions [5,18,35,41,44,61] through to semantic representations such as aﬀor-

dances [31] and language descriptions of sub-activities [56]. We focus on label-

level anticipation format and use ‘action anticipation’ to refer to this task.

Existing anticipation tasks can be grouped into two categories based on the

anticipation duration: (1) near-term action anticipation, and (2) long-term action

anticipation. In this paper, we focus on long-term action anticipation.

Near-term anticipation involves predicting label for the immediate next

action that would occur in the range of a few seconds having observed a short

video segment of duration of a few seconds. Prior work propose a variety of

4 M. Nawhal et al.

temporal modeling techniques to encode the observed segment such as regression

networks [63], reinforced encoder-decoder network [19], TCNs [67], temporal

segment network [12], LSTMs [16,17,49], VAEs [44,65] and transformers [21].

Long-term anticipation involves predicting action labels over long time-

horizons in the range of several minutes having observed an initial portion of a

video (observed duration of a few minutes). A popular formulation of this task in-

volves prediction of a sequence of action labels having observed an initial portion

of the video. Prior approaches encode the observed video as a whole to obtain a

video-level representation. Using these representations, these approaches either

predict actions recursively over individual future time instants or use time as a

conditional parameter to predict action label for the given single time instant.

The recursive methods [5,15,18,50,55] accumulate prediction error over time

resulting in inaccurate anticipation outcomes for scenarios with long anticipa-

tion duration. The time-conditioned method [28] employs skip-connections based

temporal models and aims to avoid error accumulation by directly predicting an

action label for a speciﬁed future time instant in a single forward pass. However,

this approach still requires multiple forward passes during inference as the task

involves predicting actions at all future time instants over a given anticipation

duration. Additionally, sparse skip connections used in [28] do not fully utilize

the relations among the actions at intermediate future time instants while pre-

dicting action at a given future time instant. In contrast to these approaches

based on video-level representations, our approach leverages segment-level rep-

resentations (learned using individual segments across diﬀerent activities) in

conjunction with video-level representations. Both these representations are uti-

lized to directly predict action instances corresponding to actions at all the time

instants over a given anticipation duration in a single forward pass.

An alternate formulation of long-term anticipation proposed in [46] focuses

on predicting a set of future action labels without inferring when they would oc-

cur. [46] extracts a graph representation of the video based on frame-level visual

aﬀordances and uses graph convolutional network to encode the graph repre-

sentation to predict a set of action labels. In contrast, our approach leverages

both the segment-level and video-level representations of the input video and

a transformer-based model to predict action instances - both action labels and

their corresponding timestamps.

Other methods design approaches to model uncertainty in predicting actions

over long time horizons [4,48,50] and self-supervised learning [51].

Early action detection. The task of early action detection [24,39,54,58] in-

volves recognizing an ongoing action in a video as early as possible given an

initial portion of the video. Though the early action detection task is diﬀerent

from action anticipation (anticipation involves prediction of actions before they

begin), the two tasks share the inspiration of future prediction.

Transformers in computer vision. The transformer architecture [62], origi-

nally proposed for machine translation task, has achieved state-of-the-art perfor-

mance for many NLP tasks. In recent years, there has been a ﬂurry of work on

transformer architectures designed for high-level reasoning tasks on images and

Rethinking Learning Approaches for Long-Term Action Anticipation 5

videos. Examples include object detection [9], image classiﬁcation [14], spatio-

temporal localization in videos [20], video instance segmentation [64], action

recognition [7,70], action detection [47], multi-object tracking [45], next ac-

tion anticipation [21], human-object interaction detection [29,71]. DETR [9]

is a transformer model for object detection, wherein, the task is formulated as

a set prediction problem. This work has since inspired transformer designs for

similar vision tasks – video instance segmentation [64] and human-object inter-

action detection [71]. Inspired by these works, we propose a novel transformer

architecture that uses two encoder to encode diﬀerent representations derived

from the input video and a decoder to predict the set of future action instances

in a single pass. Our proposed decoder also receives anticipation duration as an

input parameter to control the duration over which actions are predicted.

3 Action Anticipation with Anticipatr

In this section, we ﬁrst describe our formulation of long-term action anticipation

and then describe our approach.

Problem Formulation. Let vobe an observed video containing Toframes.

Our goal is to predict the actions that occur from time To+1 to To+Tawhere Ta

is the anticipation duration, i.e., the duration over which actions are predicted.

Speciﬁcally, we predict a set A={ai= (ci, ti

s, ti

e)}containing future action in-

stances. The i-th element denotes an action instance aidepicting action category

cioccurring from time ti

sto ti

ewhere To< ti

s< ti

e≤To+Ta. Here, ci∈ C where

Cis the set of action classes in the dataset.

Intuitively, for action anticipation, the observed video as a whole helps pro-

vide a broad, video-level representation of the ongoing activity depicted in the

video. However, the observed video is composed of several segments that indi-

vidually also contain valuable information about future actions and provide an

opportunity to capture the video with segment-level representations. Using this

intuition, in this paper, we propose Anticipatr that leverages these two types

of representations of the observed video for the task of long-term anticipation.

Anticipatr employs a two-stage learning approach to train a transformer-

based model that takes an observed video as input and produces a set of future

action instances as output. See Fig. 2for an overview. In the ﬁrst stage, we train

asegment encoder that receives a segment (sequence of frames from a video) as

input and predicts the set of action labels that would occur at any time in the

future after the occurrence of the segment in the video. We refer to this stage

as segment-level training (described in Sec. 3.1). As the segment encoder only

operates over individual segments, it is unaware of the broader context of the

activity induced by a speciﬁc sequence of segments in the observed video.

In the second stage, we train a video encoder and an anticipation decoder to

be used along with the segment encoder for long-term action anticipation. The

video encoder encodes the observed video to a video-level representation. The

segment encoder (trained in the ﬁrst stage) is fed with a sequence of segments

from the observed video as input to obtain a segment-level representation of

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RethinkingLearningApproachesforLong-TermActionAnticipationMeghaNawhal1,AkashAbduJyothi1,andGregMori1,21SimonFraserUniversity,Burnaby,Canada2BorealisAI,Vancouver,CanadaAbstract.Actionanticipationinvolvespredictingfutureactionshavingobservedtheinitialportionofavideo.Typically,theobservedvideoisprocess...

展开>> 收起<<

Rethinking Learning Approaches for Long-Term Action Anticipation Megha Nawhal1 Akash Abdu Jyothi1 and Greg Mori12.pdf

共32页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Rethinking Learning Approaches for Long-Term Action Anticipation Megha Nawhal1 Akash Abdu Jyothi1 and Greg Mori12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: