Text-Derived Knowledge Helps Vision A Simple Cross-modal Distillation for Video-based Action Anticipation Sayontan Ghosh1Tanvi Aggarwal1

2025-04-26 0 0 1.07MB 11 页 10玖币

侵权投诉

Text-Derived Knowledge Helps Vision: A Simple Cross-modal Distillation

for Video-based Action Anticipation

Sayontan Ghosh1Tanvi Aggarwal1

Minh Hoai1Niranjan Balasubramanian1

1Stony Brook University

{sagghosh, taggarwal, minhhoai, niranjan }@cs.stonybrook.edu

Abstract

Anticipating future actions in a video is use-

ful for many autonomous and assistive tech-

nologies. Most prior action anticipation work

treat this as a vision modality problem, where

the models learn the task information pri-

marily from the video features in the action

anticipation datasets. However, knowledge

about action sequences can also be obtained

from external textual data. In this work,

we show how knowledge in pretrained lan-

guage models can be adapted and distilled into

vision-based action anticipation models. We

show that a simple distillation technique can

achieve effective knowledge transfer and pro-

vide consistent gains on a strong vision model

(Anticipative Vision Transformer) for two ac-

tion anticipation datasets (3.5% relative gain

on EGTEA-GAZE+ and 7.2% relative gain on

EPIC-KITCHEN 55), giving a new state-of-the-

art result1.

1 Introduction

Anticipating future actions in the video of an un-

folding scenario is an important capability for

many applications in augmented reality (Salamin

et al.,2006;Azuma,2004), robotics (Duarte et al.,

2018;Schydlo et al.,2018), and autonomous driv-

ing (Chaabane et al.,2020;Suzuki et al.,2018).

Anticipating what actions will likely happen in a

scenario, requires one to both recognize what has

happened so far, and use anticipative general knowl-

edge about how action sequences tend to play out.

Most models for this task use a pre-trained video

encoder to extract information about what has hap-

pened so far in the scenario, and use a text-based

decoder to predict what action is likely to happen

in the future (Carion et al.,2020;Dessalene et al.,

2021;Liu et al.,2020;Sener et al.,2020).

The models and code used are available

at:https://github.com/StonyBrookNLP/action-anticipation-

lmtovideo

Figure 1: A model learning the action anticipation from

only the vision modality (video frames) is essentially

exposed to a very limited set of action sequences. Lan-

guage models, which are pre-trained on large-scale text,

can learn this distribution from the task, and a much

larger domain-relevant text. We propose distilling this

knowledge from text modality models to vision modal-

ity model for video action anticipation task.

However, when trained on the target video

datasets, the generalization of the models depends

on how well these video datasets cover the space of

action sequence distributions. In other words, the

knowledge that is learnt for predicting future ac-

tions is, in effect, limited to the information in the

target video datasets, where obtaining large scale

coverage of action sequences is difﬁcult.

Knowledge about action sequences can also

be obtained from text resources at scale. Lan-

guage models, (e.g.

BERT

(Devlin et al.,2019),

RoBERTa

(Liu et al.,2019b)), are typically pre-

trained on large collections of unlabeled texts with

billions of tokens, where they acquire a wide-

variety of knowledge including large scale knowl-

edge about common action sequences. For exam-

ple, Table 1illustrates how the pre-trained

BERT

is able to predict the next action in a sequence of

actions extracted from a recipe video in terms of

its verb and the object. Also, it is easier to col-

arXiv:2210.05991v2 [cs.CV] 21 Feb 2023

Masked action sequence BERT

@top5

Clean the board

→

takeout pan

→

wash the onion

→

clean the ﬁsh

→

cut the onion

→

heat the pan

→

pour

oil in pan →[MASK] the ﬁsh.

fry,

cook,

boil,

wash,

clean

Clean the board

→

takeout pan

→

wash the onion

→

clean the ﬁsh

→

cut the onion

→

heat the pan

→

pour

oil in pan →fry [MASK].

pan,

ﬁsh,

chicken,

it,

onion

Table 1: Given a sequence of actions extracted from a

video, BERT@top5 shows the top5 prediction made

by a standard pre-trained BERT for the masked verb

and object of the next action.

lect a much larger collection of action sequences

from text sources compared to video annotated with

segments. As illustrated in Figure 1,

EPIC55

, a

video dataset of about

800GB

only has about

38K

action sequences, whereas there are around

sequences in the text recipes dataset

Recipe1M

Text modality models can thus be exposed to a

much larger variety of action sequences compared

to video-modality anticipation models. However,

because the task is deﬁned only over the video in-

puts there is a question of how one can transfer this

knowledge.

In this work, we show that we can augment

video-based anticipation models with this exter-

nal text-derived knowledge. To this end, we pro-

pose a simple cross-modal distillation approach,

where we distill the knowledge gained by a lan-

guage model from the text modality of the data

into a vision-modality model. We build a teacher

using a pre-trained language model which already

carries general knowledge about action sequences.

We adapt this teacher to the action sequences in the

video domain by ﬁne-tuning them for the action

anticipation task. Then, we train a vision-modality

student, which is now tasked with both predict-

ing the target action label as well as matching the

output probability distribution of the teacher.

There are two aspects of language models that

can be adjusted further for improved distillation.

First, while they may contain knowledge about

a broad range of action sequences, we can focus

them towards speciﬁc action sequences in the target

dataset. Second, the text modality teacher can be

further improved by pretraining on domain-relevant

The task requires the anticipation model to make infer-

ence based on the vision modality (video frames) of the video

texts (e.g. cooking recipes), to further adapt it to

the action sequences in the task domain.

Our empirical evaluation shows that this cross-

modal training yields consistent improvements over

a state-of-the-art Anticipative Vision Transformer

model (Girdhar and Grauman,2021) on two ego-

centric action anticipation datasets in the cooking

domain. Adapting the teacher to the task domain by

pretraining on domain relevant texts yields further

gains and the gains are stable for different language

models. Interestingly, our analysis shows that the

language model based teacher can provide gains

even when it is not necessarily better than the vi-

sion student, suggesting that distillation beneﬁts

can also come from the complementary of knowl-

edge, as in the case of the text modality.

In summary we make the following contribu-

tions: (i) We show that a simple distillation scheme

can effectively transfer text-derived knowledge

about action sequences (i.e. knowledge external

to the video datasets) to a vision-based action an-

ticipation model. (ii) We show that text-derived

knowledge about actions sequences contain com-

plementary information that is useful for the antici-

pation task, especially for the case where the action

label space is large. (iii) Using a strong action

anticipation model as a student, we achieve new

state-of-the-art results on two benchmark datasets.

2 Related Work

There has been a wide range of solutions for ac-

tion anticipation ranging from hierarchical rep-

resentaions (Lan et al.,2014), unsupervised rep-

resentation learning (Vondrick et al.,2016), to

encoder-decoder frameworks that decode future ac-

tions at different time scales (Furnari and Farinella,

2019), and transformers trained on multiple auxil-

iary tasks (Girdhar and Grauman,2021). However,

these only use the vision modality features of the

observed video to train the model for the antici-

pation task. Our work aims to distill text-derived

knowledge to improve action anticipation. Here

we relate our work to others that have made use of

(i) textual knowledge for related tasks, (ii) general

knowledge distillation, and (iii) multimodal mod-

els which also allow for integration of information

from different modalities.

Textual Knowledge for Action Anticipation:

Other works have also shown the utility of model-

ing text-modality. Sener and Yao (2019) transfer

knowledge in a text-to-text encoder-decoder to a

video-to-text encoder-decoder, by substituting the

text encoder with the video encoder. However, this

relies on projecting the image and text features

in a shared space, which requires lots of properly

aligned text and its corresponding image. Cam-

porese et al. (2021) model label semantics with

a hand engineered deterministic label prior based

on the global co-occurrence statistics of the action

labels from the overall training data, which can

be ineffective in case the underlying joint action

distribution is complex. In contrast, our work pro-

poses a different approach to leverage the text in

the training data by using language models to learn

the complex underlying distribution of action se-

quences in the video and then distill this knowledge

into a vision model to improve their performance.

Cross-modal Knowledge Distillation:

Thoker

and Gall (2019) propose learning from RGB videos

to recognize actions for another modality. Oth-

ers have used cross-modal distillation for video

retrieval tasks (Hu et al.,2020;Chen et al.,2020)

and for text-to-speech (Wang et al.,2020). Most

relevant to ours is a recent system that improves lan-

guage understanding of text models by transferring

the knowledge of a multi-modal teacher trained on

a video-text dataset, into a student language model

with a text dataset (Tang et al.,2021) . In con-

trast, our proposed method for action anticipation

transfers knowledge gained by a text-based teacher

model into a vision-based student model.

Mutlimodal Models:

Due to the recent preva-

lence of multimodal data and applications (Lin

et al.,2014;Sharma et al.,2018;Antol et al.,

2015;Krishna et al.,2017;Ordonez et al.,2011;

Abu Farha et al.,2018;Talmor et al.,2021;Afouras

et al.,2018), there has been plethora of recent work

on multimodal transformers. One commonly used

approach used to train these models is to learn a

cross-modal representation in a shared space. Ex-

amples include learning to align text-image pairs

for cross-modal retrieval (Radford et al.,2021;

Wehrmann et al.,2020), grounded image repre-

sentations (Liu et al.,2019a), and grounded text

representations (Tan and Bansal,2020;Li et al.,

2019). Hu and Singh (2021) extend the idea for

multi-task settings with multiple language-vision

based tasks. Tsimpoukelli et al. (2021) adapt a vi-

sion model to a frozen large LM to transfer its few-

shot capability to a multimodal setting (vision and

language). However these methods rely on large-

scale image-text aligned datasets for the training

the model, which may not always be available, for

e.g.

EGTEA-GAZE+

video dataset has only

10.3

labelled action sequences. In contrast our distil-

lation approach does not require any image-text

alignment for the anticipation task.

3 Language-to-vision knowledge

distillation for action anticipation

The action anticipation task asks to predict the class

label of a future action based on information from

an observed video sequence. In this task setting,

the model has access to both, video and annotated

action segments (action text) during the train time,

but needs to make the inference only using the

video sequence. The input to the prediction model

is a sequence of video frames up until time step

X= (X1, X2, . . . , Xt)

, and the desired output of

the model is the class label

of the action at time

t+τ, where τis the anticipation time.

To learn an anticipation model, we assume there

is training data of the following form:

{(Xi,Li, Y i)}n

i=1

, where

Xi= (Xi

1, . . . , Xi

ti)

the

ith

training video sequence,

is the class

label of the future action at time

ti+τ

, and

Li= (Li

1, . . . , Li

ki)

is the sequence of action label

of the action segments in the video sequence

Each human action can span multiple time steps,

so so the number of actions

might be different

from the number of video frames ti.

Our task is to learn a model

that can predict the

future action label based on the vison modality of

the video sequence

only. A common approach

is to optimize cross entropy loss

between the

model’s predicted label

g(Xi)

and the ground truth

label

of each training instances, i.e., to mini-

mize:

PiL(g(Xi), Y i)

. Although the sequence

of action labels

is available in the training data,

the semantics associated with these labels is not

properly used by the existing methods for training

the anticipation model.

Here we propose to learn a text-based anticipa-

tion model

gtext

and use it to supervise the training

of the vision-based anticipation model

. This train-

ing approach utilizes the knowledge from the text

domain, which is easier to learn than the vision-

based knowledge, given the abundance of event

sequences described in text corpora. Hereafter,

we will refer to the language-based model as the

teacher, and the vision-based model as the student.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Text-DerivedKnowledgeHelpsVision:ASimpleCross-modalDistillationforVideo-basedActionAnticipationSayontanGhosh1TanviAggarwal1MinhHoai1NiranjanBalasubramanian11StonyBrookUniversity{sagghosh,taggarwal,minhhoai,niranjan}@cs.stonybrook.eduAbstractAnticipatingfutureactionsinavideoisuse-fulformanyautonomous...

展开>> 收起<<

Text-Derived Knowledge Helps Vision A Simple Cross-modal Distillation for Video-based Action Anticipation Sayontan Ghosh1Tanvi Aggarwal1.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Text-Derived Knowledge Helps Vision A Simple Cross-modal Distillation for Video-based Action Anticipation Sayontan Ghosh1Tanvi Aggarwal1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: