C2KD Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval Andrew Rouditchenko1Yung-Sung Chuang1Nina Shvetsova2Samuel Thomas34Rogerio Feris34

2025-04-27 0 0 6.06MB 12 页 10玖币
侵权投诉
C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for
Multilingual Text-Video Retrieval
Andrew Rouditchenko1Yung-Sung Chuang1Nina Shvetsova2Samuel Thomas3,4Rogerio Feris3,4
Brian Kingsbury3,4Leonid Karlinsky3,4David Harwath5Hilde Kuehne2,4James Glass1
MIT1Goethe University Frankfurt2IBM Research AI3
MIT-IBM Watson AI Lab4UT Austin5
roudi@mit.edu
Abstract
Multilingual text-video retrieval methods have
improved significantly in recent years, but the
performance for other languages lags behind
English. We propose a Cross-Lingual Cross-
Modal Knowledge Distillation method to im-
prove multilingual text-video retrieval. In-
spired by the fact that English text-video re-
trieval outperforms other languages, we train
a student model using input text in different
languages to match the cross-modal predic-
tions from teacher models using input text in
English. We propose a cross entropy based
objective which forces the distribution over
the student’s text-video similarity scores to be
similar to those of the teacher models. We
introduce a new multilingual video dataset,
Multi-YouCook2, by translating the English
captions in the YouCook2 video dataset to 8
other languages. Our method improves mul-
tilingual text-video retrieval performance on
Multi-YouCook2 and several other datasets
such as Multi-MSRVTT and VATEX. We also
conducted an analysis on the effectiveness of
different multilingual text models as teachers.
The code, models, and dataset are available at
https://github.com/roudimit/c2kd.
1 Introduction
Text-video retrieval, or the task of searching for
videos with text queries, is becoming increasingly
important as more videos are uploaded to the in-
ternet. Currently, most methods developed for this
task are trained and evaluated with English text.
The focus of this work is to improve the perfor-
mance of text-video retrieval on more languages.
Learning a multilingual multimodal embedding
space (Huang et al.,2021;Akula et al.,2021) has
been useful for multilingual text-video retrieval.
Text in different languages and video are processed
by separate encoders and projected into the shared
embedding space, where text and video that are
semantically related should be close together re-
gardless of the language. During inference, text
queries and candidate videos are projected into
the embedding space, and videos are ranked ac-
cording to the similarity scores between the text
and video embeddings. These methods are trained
with a cross-modal contrastive objective on video
datasets with parallel text translations in multiple
languages, which are often derived from the origi-
nal captions in English using machine translation.
They leverage recently available multilingual mod-
els pre-trained on many languages (Devlin et al.,
2019;Conneau et al.,2020) to process text in dif-
ferent languages with only a single encoder.
While these methods have improved multilin-
gual text-video retrieval, the performance for En-
glish is usually higher than for other languages.
Two possible reasons are: (1) multilingual text
translated from English often has errors; (2) the
multilingual text models are pre-trained on large-
scale text data, but there is more data for English
than other languages.
To address the gap in performance between En-
glish and multilingual text-video retrieval, we pro-
pose C2KD: Cross-Lingual Cross-Modal Knowl-
edge Distillation. Our method trains a student
model to learn better multilingual text-video sim-
ilarity scores by learning from the English text-
video scores of multiple trained and frozen teach-
ers. The student learns to pull together video and
multilingual text embeddings by optimizing their
text-video scores through the contrastive loss. We
introduce a framework where several trained and
frozen teachers simultaneously process the English
translations of the student’s inputs and predict En-
glish text-video scores. Further, we propose a cross
entropy based objective between the student’s mul-
tilingual text-video scores and the teachers’ En-
glish text-video scores. This teaches the student to
learn multilingual text-video scores which are more
aligned with the English scores, thus improving the
multilingual text-video retrieval performance.
We applied our method to three existing multilin-
arXiv:2210.03625v2 [cs.CL] 9 May 2023
gual text-video datasets: Multi-MSRVTT (Huang
et al.,2021), VATEX (Wang et al.,2019), and RUD-
DER (Akula et al.,2021). Since these datasets are
mainly focused on open-domain videos, we col-
lected the Multi-YouCook2 dataset as an extension
of the YouCook2 (Zhou et al.,2018) cooking video
dataset to test the model in a domain which requires
more fine-grained reasoning, such as understanding
specific ingredients in recipes. Our results show
that C2KD can improve the multilingual text-video
retrieval performance on all datasets, despite the
variety in languages, domains, and dataset sizes.
In summary, our contributions are: (1) We pro-
pose the C2KD method which guides a student
model to learn better multilingual text-video simi-
larity scores by learning from the text-video scores
of teachers using English text translations as input.
(2) We propose a cross entropy based objective
between the student and teacher text-video simi-
larity scores to distill the cross-modal knowledge
from the teachers. (3) We collected the Multi-
YouCook2 dataset with parallel text translations
in 9 languages for over 10k video clips. (4) Our
method improves the multilingual text-video per-
formance on four datasets. We conduct an analysis
on the impact of different teachers to gain further
insights. The code, models, and dataset are avail-
able at https://github.com/roudimit/c2kd.
2 Related Work
Multilingual Text-Video Retrieval. Recent work
introduced methods and datasets to improve mul-
tilingual text-video retrieval. Multilingual mul-
timodal pretraining (Huang et al.,2021) demon-
strated text-video retrieval in 9 languages with a
single model. They released the Multi-MSRVTT
dataset by machine-translating the English text
captions from the MSR-VTT video dataset (Xu
et al.,2016) into 8 other languages. Their model is
trained with a cross-modal contrastive objective to
pull together the embeddings of parallel text trans-
lations and video inputs together. In separate work,
the RUDDER (Akula et al.,2021) dataset was intro-
duced with captions in languages spoken in India.
They propose to augment the text-video triplet loss
with hard negatives which improved performance
in a low-resource setting. We observed that per-
formance for English text-video retrieval typically
outperformed other languages, which motivated
our approach.
Multilingual Learning.
Multilingual text-video
retrieval methods rely on pre-trained multilingual
text encoders to handle many languages with a
single model. MBERT (Devlin et al.,2019) and
XLM-R (Conneau et al.,2020) learn multilingual
representations through masked language model-
ing. LaBSE (Feng et al.,2022) is instead trained
to maximize the similarity of translation pairs in a
shared embedding space. In our experiments, we
evaluated these different models and found LaBSE
to be the best encoder for multilingual text-video
retrieval.
Cross-Lingual & Cross-Modal Knowledge Dis-
tillation.
Another approach for training a mul-
tilingual text model with good sentence embed-
dings is to distill the knowledge (Hinton et al.,
2015) from a monolingual model. Distill Sentence
BERT (Reimers and Gurevych,2020) is initialized
from XLM-R and trained to output similar multi-
lingual embeddings to Sentence BERT (Reimers
and Gurevych,2019) using English translations as
input. Our C2KD approach has a similar idea, but
it incorporates visual context. We use English text
as input to several cross-modal teachers, and train
a student to output similar text-video similarity
scores using text in other languages.
Of most relevance to our work, Teach-
Text (Croitoru et al.,2021) introduced cross-modal
Knowledge Distillation for English text-video re-
trieval. They use teacher retrieval models with
various English text embeddings and train a stu-
dent to output similar text-video similarity scores
with a regression loss. Our approach has several
major differences. First, our text and models are
multilingual. Second, we enforce the teachers to
use English input instead of using the same multi-
lingual input as the students. Third, we use a cross
entropy objective between the student and teacher
text-video scores instead of using a regression loss,
which is more effective since it considers the con-
text of all of the text-video pairs in the batch. We
compare our objective to theirs in Section 4.4.
Finally, some multilingual knowledge distilla-
tion methods were proposed for visual question
answering based on images (Raj Khan et al.,2021;
Gupta et al.,2022a).
Other Multilingual Video Datasets.
Several mul-
tilingual video datasets are designed for other
tasks, such as captioning (Wang et al.,2019;Su
et al.,2021), sentiment analysis (Bagher Zadeh
et al.,2020;Gupta et al.,2022b), moment detec-
tion (Lei et al.,2021), audio-visual speech recog-
Pooler
Student model Teacher models
(training only)
fry the
sandwiches
English Query
Video
Query
Encoder
Video
Encoder
Video
炸三明治
Chinese Query
Query
Encoder 1
Video
Encoder 1
Query
Encoder
faire frire les
sandwichs
French Query
Query
Encoder 2
Video
Encoder 2
Query
Encoder 3
Video
Encoder 3
Figure 1: Overview of C2KD. A multilingual student model computes text-video similarity scores for a batch of
video and text inputs, while teacher models process the same video and English translations. The student is trained
with two objectives. LN C E (described in Section 3.1) trains the model to have high text-video scores for text and
video pairs using the cross entropy loss. LC2K D (described in Section 3.3) distills the knowledge from the teacher
English text-video scores using a cross entropy loss.
nition (Ephrat et al.,2018), and audio-video re-
trieval (Rouditchenko et al.,2021b). Instructional
videos with captions from automatic speech recog-
nition have been used for learning word embed-
dings (Sigurdsson et al.,2020) and visually-guided
machine translation (Sanabria et al.,2018). How-
ever, the transcriptions often have errors and can
be unrelated to the visuals. Our Multi-YouCook2
dataset contains captions which were originally
written by human annotators in English (Zhou et al.,
2018), which makes them visually relevant.
Concurrent Work.
Madasu et al. (2023) pro-
pose a similar framework to improve multilingual
text-video retrieval. However, their method uses
knowledge transfer from multilingual text, while
our method uses knowledge transfer from English
text. They use a separate encoder for English and
multilingual text, while our final model uses a sin-
gle encoder for all languages.
3 Method
3.1 Text-Video Contrastive Loss
We handle the problem of learning multilingual
text-video representations. For simplicity, we first
describe the approach for learning with English
text and then explain how to extend it to more lan-
guages. We consider a dataset
Den ={(ti, vi)}N
i=1
of paired videos and English captions. The goal
of text-video retrieval is to learn text and vision
models,
f(·)
and
g(·)
respectively, which output
embeddings that are similar to each other when
the input text caption
ti
and video
vi
are semanti-
cally related (ie. describing similar concepts), and
have low similarity when they are unrelated. In this
work, we use cosine similarity by L2-normalizing
the outputs of
f(·)
and
g(·)
and taking the dot-
product.
The Noise-Contrastive Estimation loss
(NCE) (Gutmann and Hyvärinen,2010;Joze-
fowicz et al.,2016;Oord et al.,2018) has been
commonly used to learn text-video represen-
tations (Sun et al.,2019;Rouditchenko et al.,
2021a). Given a batch of
B
text-video pairs,
let
S
be the text-video similarity matrix, with
Sij =f(ti)>g(vj)
. With temperature
τ
, the NCE
loss is given as:
LNCE =
B
X
i=1
log exp(Sii)
PB
k=1 exp(Sik).(1)
This can be interpreted as the cross entropy loss be-
tween the distribution over normalized text-video
similarity scores in
S
and the one-hot distribu-
tion. Specifically, let
Qti(vj)
be the probability
that video vjmatches with text ti:
Qti(vj) = exp(Sij )
PB
k=1 exp(Sik).(2)
The target distribution,
Pti(vj)
, is one-hot (since
the correct match for text tiis video vi):
Pti(vj) = (1,if i=j
0,otherwise. (3)
Given the equation for cross entropy,
LCE =
B
X
i=1 X
j
Pti(vj) log Qti(vj),(4)
摘要:

C2KD:Cross-LingualCross-ModalKnowledgeDistillationforMultilingualText-VideoRetrievalAndrewRouditchenko1Yung-SungChuang1NinaShvetsova2SamuelThomas3;4RogerioFeris3;4BrianKingsbury3;4LeonidKarlinsky3;4DavidHarwath5HildeKuehne2;4JamesGlass1MIT1GoetheUniversityFrankfurt2IBMResearchAI3MIT-IBMWatsonAILab4U...

展开>> 收起<<
C2KD Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval Andrew Rouditchenko1Yung-Sung Chuang1Nina Shvetsova2Samuel Thomas34Rogerio Feris34.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:6.06MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注