video-to-text encoder-decoder, by substituting the
text encoder with the video encoder. However, this
relies on projecting the image and text features
in a shared space, which requires lots of properly
aligned text and its corresponding image. Cam-
porese et al. (2021) model label semantics with
a hand engineered deterministic label prior based
on the global co-occurrence statistics of the action
labels from the overall training data, which can
be ineffective in case the underlying joint action
distribution is complex. In contrast, our work pro-
poses a different approach to leverage the text in
the training data by using language models to learn
the complex underlying distribution of action se-
quences in the video and then distill this knowledge
into a vision model to improve their performance.
Cross-modal Knowledge Distillation:
Thoker
and Gall (2019) propose learning from RGB videos
to recognize actions for another modality. Oth-
ers have used cross-modal distillation for video
retrieval tasks (Hu et al.,2020;Chen et al.,2020)
and for text-to-speech (Wang et al.,2020). Most
relevant to ours is a recent system that improves lan-
guage understanding of text models by transferring
the knowledge of a multi-modal teacher trained on
a video-text dataset, into a student language model
with a text dataset (Tang et al.,2021) . In con-
trast, our proposed method for action anticipation
transfers knowledge gained by a text-based teacher
model into a vision-based student model.
Mutlimodal Models:
Due to the recent preva-
lence of multimodal data and applications (Lin
et al.,2014;Sharma et al.,2018;Antol et al.,
2015;Krishna et al.,2017;Ordonez et al.,2011;
Abu Farha et al.,2018;Talmor et al.,2021;Afouras
et al.,2018), there has been plethora of recent work
on multimodal transformers. One commonly used
approach used to train these models is to learn a
cross-modal representation in a shared space. Ex-
amples include learning to align text-image pairs
for cross-modal retrieval (Radford et al.,2021;
Wehrmann et al.,2020), grounded image repre-
sentations (Liu et al.,2019a), and grounded text
representations (Tan and Bansal,2020;Li et al.,
2019). Hu and Singh (2021) extend the idea for
multi-task settings with multiple language-vision
based tasks. Tsimpoukelli et al. (2021) adapt a vi-
sion model to a frozen large LM to transfer its few-
shot capability to a multimodal setting (vision and
language). However these methods rely on large-
scale image-text aligned datasets for the training
the model, which may not always be available, for
e.g.
EGTEA-GAZE+
video dataset has only
10.3
K
labelled action sequences. In contrast our distil-
lation approach does not require any image-text
alignment for the anticipation task.
3 Language-to-vision knowledge
distillation for action anticipation
The action anticipation task asks to predict the class
label of a future action based on information from
an observed video sequence. In this task setting,
the model has access to both, video and annotated
action segments (action text) during the train time,
but needs to make the inference only using the
video sequence. The input to the prediction model
is a sequence of video frames up until time step
t
:
X= (X1, X2, . . . , Xt)
, and the desired output of
the model is the class label
Y
of the action at time
t+τ, where τis the anticipation time.
To learn an anticipation model, we assume there
is training data of the following form:
D=
{(Xi,Li, Y i)}n
i=1
, where
Xi= (Xi
1, . . . , Xi
ti)
is
the
ith
training video sequence,
Yi
is the class
label of the future action at time
ti+τ
, and
Li= (Li
1, . . . , Li
ki)
is the sequence of action label
of the action segments in the video sequence
Xi
.
Each human action can span multiple time steps,
so so the number of actions
ki
might be different
from the number of video frames ti.
Our task is to learn a model
g
that can predict the
future action label based on the vison modality of
the video sequence
Xi
only. A common approach
is to optimize cross entropy loss
L
between the
model’s predicted label
g(Xi)
and the ground truth
label
Yi
of each training instances, i.e., to mini-
mize:
PiL(g(Xi), Y i)
. Although the sequence
of action labels
Li
is available in the training data,
the semantics associated with these labels is not
properly used by the existing methods for training
the anticipation model.
Here we propose to learn a text-based anticipa-
tion model
gtext
and use it to supervise the training
of the vision-based anticipation model
g
. This train-
ing approach utilizes the knowledge from the text
domain, which is easier to learn than the vision-
based knowledge, given the abundance of event
sequences described in text corpora. Hereafter,
we will refer to the language-based model as the
teacher, and the vision-based model as the student.