Text-Derived Knowledge Helps Vision A Simple Cross-modal Distillation for Video-based Action Anticipation Sayontan Ghosh1Tanvi Aggarwal1

2025-04-26 0 0 1.07MB 11 页 10玖币
侵权投诉
Text-Derived Knowledge Helps Vision: A Simple Cross-modal Distillation
for Video-based Action Anticipation
Sayontan Ghosh1Tanvi Aggarwal1
Minh Hoai1Niranjan Balasubramanian1
1Stony Brook University
{sagghosh, taggarwal, minhhoai, niranjan }@cs.stonybrook.edu
Abstract
Anticipating future actions in a video is use-
ful for many autonomous and assistive tech-
nologies. Most prior action anticipation work
treat this as a vision modality problem, where
the models learn the task information pri-
marily from the video features in the action
anticipation datasets. However, knowledge
about action sequences can also be obtained
from external textual data. In this work,
we show how knowledge in pretrained lan-
guage models can be adapted and distilled into
vision-based action anticipation models. We
show that a simple distillation technique can
achieve effective knowledge transfer and pro-
vide consistent gains on a strong vision model
(Anticipative Vision Transformer) for two ac-
tion anticipation datasets (3.5% relative gain
on EGTEA-GAZE+ and 7.2% relative gain on
EPIC-KITCHEN 55), giving a new state-of-the-
art result1.
1 Introduction
Anticipating future actions in the video of an un-
folding scenario is an important capability for
many applications in augmented reality (Salamin
et al.,2006;Azuma,2004), robotics (Duarte et al.,
2018;Schydlo et al.,2018), and autonomous driv-
ing (Chaabane et al.,2020;Suzuki et al.,2018).
Anticipating what actions will likely happen in a
scenario, requires one to both recognize what has
happened so far, and use anticipative general knowl-
edge about how action sequences tend to play out.
Most models for this task use a pre-trained video
encoder to extract information about what has hap-
pened so far in the scenario, and use a text-based
decoder to predict what action is likely to happen
in the future (Carion et al.,2020;Dessalene et al.,
2021;Liu et al.,2020;Sener et al.,2020).
1
The models and code used are available
at:https://github.com/StonyBrookNLP/action-anticipation-
lmtovideo
Figure 1: A model learning the action anticipation from
only the vision modality (video frames) is essentially
exposed to a very limited set of action sequences. Lan-
guage models, which are pre-trained on large-scale text,
can learn this distribution from the task, and a much
larger domain-relevant text. We propose distilling this
knowledge from text modality models to vision modal-
ity model for video action anticipation task.
However, when trained on the target video
datasets, the generalization of the models depends
on how well these video datasets cover the space of
action sequence distributions. In other words, the
knowledge that is learnt for predicting future ac-
tions is, in effect, limited to the information in the
target video datasets, where obtaining large scale
coverage of action sequences is difficult.
Knowledge about action sequences can also
be obtained from text resources at scale. Lan-
guage models, (e.g.
BERT
(Devlin et al.,2019),
RoBERTa
(Liu et al.,2019b)), are typically pre-
trained on large collections of unlabeled texts with
billions of tokens, where they acquire a wide-
variety of knowledge including large scale knowl-
edge about common action sequences. For exam-
ple, Table 1illustrates how the pre-trained
BERT
is able to predict the next action in a sequence of
actions extracted from a recipe video in terms of
its verb and the object. Also, it is easier to col-
arXiv:2210.05991v2 [cs.CV] 21 Feb 2023
Masked action sequence BERT
@top5
Clean the board
takeout pan
wash the onion
clean the fish
cut the onion
heat the pan
pour
oil in pan [MASK] the fish.
fry,
cook,
boil,
wash,
clean
Clean the board
takeout pan
wash the onion
clean the fish
cut the onion
heat the pan
pour
oil in pan fry [MASK].
pan,
fish,
chicken,
it,
onion
Table 1: Given a sequence of actions extracted from a
video, BERT@top5 shows the top5 prediction made
by a standard pre-trained BERT for the masked verb
and object of the next action.
lect a much larger collection of action sequences
from text sources compared to video annotated with
segments. As illustrated in Figure 1,
EPIC55
, a
video dataset of about
800GB
only has about
38K
action sequences, whereas there are around
1M
sequences in the text recipes dataset
Recipe1M
.
Text modality models can thus be exposed to a
much larger variety of action sequences compared
to video-modality anticipation models. However,
because the task is defined only over the video in-
puts there is a question of how one can transfer this
knowledge.
In this work, we show that we can augment
video-based anticipation models with this exter-
nal text-derived knowledge. To this end, we pro-
pose a simple cross-modal distillation approach,
where we distill the knowledge gained by a lan-
guage model from the text modality of the data
into a vision-modality model. We build a teacher
using a pre-trained language model which already
carries general knowledge about action sequences.
We adapt this teacher to the action sequences in the
video domain by fine-tuning them for the action
anticipation task. Then, we train a vision-modality
2
student, which is now tasked with both predict-
ing the target action label as well as matching the
output probability distribution of the teacher.
There are two aspects of language models that
can be adjusted further for improved distillation.
First, while they may contain knowledge about
a broad range of action sequences, we can focus
them towards specific action sequences in the target
dataset. Second, the text modality teacher can be
further improved by pretraining on domain-relevant
2
The task requires the anticipation model to make infer-
ence based on the vision modality (video frames) of the video
texts (e.g. cooking recipes), to further adapt it to
the action sequences in the task domain.
Our empirical evaluation shows that this cross-
modal training yields consistent improvements over
a state-of-the-art Anticipative Vision Transformer
model (Girdhar and Grauman,2021) on two ego-
centric action anticipation datasets in the cooking
domain. Adapting the teacher to the task domain by
pretraining on domain relevant texts yields further
gains and the gains are stable for different language
models. Interestingly, our analysis shows that the
language model based teacher can provide gains
even when it is not necessarily better than the vi-
sion student, suggesting that distillation benefits
can also come from the complementary of knowl-
edge, as in the case of the text modality.
In summary we make the following contribu-
tions: (i) We show that a simple distillation scheme
can effectively transfer text-derived knowledge
about action sequences (i.e. knowledge external
to the video datasets) to a vision-based action an-
ticipation model. (ii) We show that text-derived
knowledge about actions sequences contain com-
plementary information that is useful for the antici-
pation task, especially for the case where the action
label space is large. (iii) Using a strong action
anticipation model as a student, we achieve new
state-of-the-art results on two benchmark datasets.
2 Related Work
There has been a wide range of solutions for ac-
tion anticipation ranging from hierarchical rep-
resentaions (Lan et al.,2014), unsupervised rep-
resentation learning (Vondrick et al.,2016), to
encoder-decoder frameworks that decode future ac-
tions at different time scales (Furnari and Farinella,
2019), and transformers trained on multiple auxil-
iary tasks (Girdhar and Grauman,2021). However,
these only use the vision modality features of the
observed video to train the model for the antici-
pation task. Our work aims to distill text-derived
knowledge to improve action anticipation. Here
we relate our work to others that have made use of
(i) textual knowledge for related tasks, (ii) general
knowledge distillation, and (iii) multimodal mod-
els which also allow for integration of information
from different modalities.
Textual Knowledge for Action Anticipation:
Other works have also shown the utility of model-
ing text-modality. Sener and Yao (2019) transfer
knowledge in a text-to-text encoder-decoder to a
video-to-text encoder-decoder, by substituting the
text encoder with the video encoder. However, this
relies on projecting the image and text features
in a shared space, which requires lots of properly
aligned text and its corresponding image. Cam-
porese et al. (2021) model label semantics with
a hand engineered deterministic label prior based
on the global co-occurrence statistics of the action
labels from the overall training data, which can
be ineffective in case the underlying joint action
distribution is complex. In contrast, our work pro-
poses a different approach to leverage the text in
the training data by using language models to learn
the complex underlying distribution of action se-
quences in the video and then distill this knowledge
into a vision model to improve their performance.
Cross-modal Knowledge Distillation:
Thoker
and Gall (2019) propose learning from RGB videos
to recognize actions for another modality. Oth-
ers have used cross-modal distillation for video
retrieval tasks (Hu et al.,2020;Chen et al.,2020)
and for text-to-speech (Wang et al.,2020). Most
relevant to ours is a recent system that improves lan-
guage understanding of text models by transferring
the knowledge of a multi-modal teacher trained on
a video-text dataset, into a student language model
with a text dataset (Tang et al.,2021) . In con-
trast, our proposed method for action anticipation
transfers knowledge gained by a text-based teacher
model into a vision-based student model.
Mutlimodal Models:
Due to the recent preva-
lence of multimodal data and applications (Lin
et al.,2014;Sharma et al.,2018;Antol et al.,
2015;Krishna et al.,2017;Ordonez et al.,2011;
Abu Farha et al.,2018;Talmor et al.,2021;Afouras
et al.,2018), there has been plethora of recent work
on multimodal transformers. One commonly used
approach used to train these models is to learn a
cross-modal representation in a shared space. Ex-
amples include learning to align text-image pairs
for cross-modal retrieval (Radford et al.,2021;
Wehrmann et al.,2020), grounded image repre-
sentations (Liu et al.,2019a), and grounded text
representations (Tan and Bansal,2020;Li et al.,
2019). Hu and Singh (2021) extend the idea for
multi-task settings with multiple language-vision
based tasks. Tsimpoukelli et al. (2021) adapt a vi-
sion model to a frozen large LM to transfer its few-
shot capability to a multimodal setting (vision and
language). However these methods rely on large-
scale image-text aligned datasets for the training
the model, which may not always be available, for
e.g.
EGTEA-GAZE+
video dataset has only
10.3
K
labelled action sequences. In contrast our distil-
lation approach does not require any image-text
alignment for the anticipation task.
3 Language-to-vision knowledge
distillation for action anticipation
The action anticipation task asks to predict the class
label of a future action based on information from
an observed video sequence. In this task setting,
the model has access to both, video and annotated
action segments (action text) during the train time,
but needs to make the inference only using the
video sequence. The input to the prediction model
is a sequence of video frames up until time step
t
:
X= (X1, X2, . . . , Xt)
, and the desired output of
the model is the class label
Y
of the action at time
t+τ, where τis the anticipation time.
To learn an anticipation model, we assume there
is training data of the following form:
D=
{(Xi,Li, Y i)}n
i=1
, where
Xi= (Xi
1, . . . , Xi
ti)
is
the
ith
training video sequence,
Yi
is the class
label of the future action at time
ti+τ
, and
Li= (Li
1, . . . , Li
ki)
is the sequence of action label
of the action segments in the video sequence
Xi
.
Each human action can span multiple time steps,
so so the number of actions
ki
might be different
from the number of video frames ti.
Our task is to learn a model
g
that can predict the
future action label based on the vison modality of
the video sequence
Xi
only. A common approach
is to optimize cross entropy loss
L
between the
model’s predicted label
g(Xi)
and the ground truth
label
Yi
of each training instances, i.e., to mini-
mize:
PiL(g(Xi), Y i)
. Although the sequence
of action labels
Li
is available in the training data,
the semantics associated with these labels is not
properly used by the existing methods for training
the anticipation model.
Here we propose to learn a text-based anticipa-
tion model
gtext
and use it to supervise the training
of the vision-based anticipation model
g
. This train-
ing approach utilizes the knowledge from the text
domain, which is easier to learn than the vision-
based knowledge, given the abundance of event
sequences described in text corpora. Hereafter,
we will refer to the language-based model as the
teacher, and the vision-based model as the student.
摘要:

Text-DerivedKnowledgeHelpsVision:ASimpleCross-modalDistillationforVideo-basedActionAnticipationSayontanGhosh1TanviAggarwal1MinhHoai1NiranjanBalasubramanian11StonyBrookUniversity{sagghosh,taggarwal,minhhoai,niranjan}@cs.stonybrook.eduAbstractAnticipatingfutureactionsinavideoisuse-fulformanyautonomous...

展开>> 收起<<
Text-Derived Knowledge Helps Vision A Simple Cross-modal Distillation for Video-based Action Anticipation Sayontan Ghosh1Tanvi Aggarwal1.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:1.07MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注