C2KD Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval Andrew Rouditchenko1Yung-Sung Chuang1Nina Shvetsova2Samuel Thomas34Rogerio Feris34

2025-04-27 1 0 6.06MB 12 页 10玖币

侵权投诉

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for

Multilingual Text-Video Retrieval

Andrew Rouditchenko1Yung-Sung Chuang1Nina Shvetsova2Samuel Thomas3,4Rogerio Feris3,4

Brian Kingsbury3,4Leonid Karlinsky3,4David Harwath5Hilde Kuehne2,4James Glass1

MIT1Goethe University Frankfurt2IBM Research AI3

MIT-IBM Watson AI Lab4UT Austin5

roudi@mit.edu

Abstract

Multilingual text-video retrieval methods have

improved signiﬁcantly in recent years, but the

performance for other languages lags behind

English. We propose a Cross-Lingual Cross-

Modal Knowledge Distillation method to im-

prove multilingual text-video retrieval. In-

spired by the fact that English text-video re-

trieval outperforms other languages, we train

a student model using input text in different

languages to match the cross-modal predic-

tions from teacher models using input text in

English. We propose a cross entropy based

objective which forces the distribution over

the student’s text-video similarity scores to be

similar to those of the teacher models. We

introduce a new multilingual video dataset,

Multi-YouCook2, by translating the English

captions in the YouCook2 video dataset to 8

other languages. Our method improves mul-

tilingual text-video retrieval performance on

Multi-YouCook2 and several other datasets

such as Multi-MSRVTT and VATEX. We also

conducted an analysis on the effectiveness of

different multilingual text models as teachers.

The code, models, and dataset are available at

https://github.com/roudimit/c2kd.

1 Introduction

Text-video retrieval, or the task of searching for

videos with text queries, is becoming increasingly

important as more videos are uploaded to the in-

ternet. Currently, most methods developed for this

task are trained and evaluated with English text.

The focus of this work is to improve the perfor-

mance of text-video retrieval on more languages.

Learning a multilingual multimodal embedding

space (Huang et al.,2021;Akula et al.,2021) has

been useful for multilingual text-video retrieval.

Text in different languages and video are processed

by separate encoders and projected into the shared

embedding space, where text and video that are

semantically related should be close together re-

gardless of the language. During inference, text

queries and candidate videos are projected into

the embedding space, and videos are ranked ac-

cording to the similarity scores between the text

and video embeddings. These methods are trained

with a cross-modal contrastive objective on video

datasets with parallel text translations in multiple

languages, which are often derived from the origi-

nal captions in English using machine translation.

They leverage recently available multilingual mod-

els pre-trained on many languages (Devlin et al.,

2019;Conneau et al.,2020) to process text in dif-

ferent languages with only a single encoder.

While these methods have improved multilin-

gual text-video retrieval, the performance for En-

glish is usually higher than for other languages.

Two possible reasons are: (1) multilingual text

translated from English often has errors; (2) the

multilingual text models are pre-trained on large-

scale text data, but there is more data for English

than other languages.

To address the gap in performance between En-

glish and multilingual text-video retrieval, we pro-

pose C2KD: Cross-Lingual Cross-Modal Knowl-

edge Distillation. Our method trains a student

model to learn better multilingual text-video sim-

ilarity scores by learning from the English text-

video scores of multiple trained and frozen teach-

ers. The student learns to pull together video and

multilingual text embeddings by optimizing their

text-video scores through the contrastive loss. We

introduce a framework where several trained and

frozen teachers simultaneously process the English

translations of the student’s inputs and predict En-

glish text-video scores. Further, we propose a cross

entropy based objective between the student’s mul-

tilingual text-video scores and the teachers’ En-

glish text-video scores. This teaches the student to

learn multilingual text-video scores which are more

aligned with the English scores, thus improving the

multilingual text-video retrieval performance.

We applied our method to three existing multilin-

arXiv:2210.03625v2 [cs.CL] 9 May 2023

gual text-video datasets: Multi-MSRVTT (Huang

et al.,2021), VATEX (Wang et al.,2019), and RUD-

DER (Akula et al.,2021). Since these datasets are

mainly focused on open-domain videos, we col-

lected the Multi-YouCook2 dataset as an extension

of the YouCook2 (Zhou et al.,2018) cooking video

dataset to test the model in a domain which requires

more ﬁne-grained reasoning, such as understanding

speciﬁc ingredients in recipes. Our results show

that C2KD can improve the multilingual text-video

retrieval performance on all datasets, despite the

variety in languages, domains, and dataset sizes.

In summary, our contributions are: (1) We pro-

pose the C2KD method which guides a student

model to learn better multilingual text-video simi-

larity scores by learning from the text-video scores

of teachers using English text translations as input.

(2) We propose a cross entropy based objective

between the student and teacher text-video simi-

larity scores to distill the cross-modal knowledge

from the teachers. (3) We collected the Multi-

YouCook2 dataset with parallel text translations

in 9 languages for over 10k video clips. (4) Our

method improves the multilingual text-video per-

formance on four datasets. We conduct an analysis

on the impact of different teachers to gain further

insights. The code, models, and dataset are avail-

able at https://github.com/roudimit/c2kd.

2 Related Work

Multilingual Text-Video Retrieval. Recent work

introduced methods and datasets to improve mul-

tilingual text-video retrieval. Multilingual mul-

timodal pretraining (Huang et al.,2021) demon-

strated text-video retrieval in 9 languages with a

single model. They released the Multi-MSRVTT

dataset by machine-translating the English text

captions from the MSR-VTT video dataset (Xu

et al.,2016) into 8 other languages. Their model is

trained with a cross-modal contrastive objective to

pull together the embeddings of parallel text trans-

lations and video inputs together. In separate work,

the RUDDER (Akula et al.,2021) dataset was intro-

duced with captions in languages spoken in India.

They propose to augment the text-video triplet loss

with hard negatives which improved performance

in a low-resource setting. We observed that per-

formance for English text-video retrieval typically

outperformed other languages, which motivated

our approach.

Multilingual Learning.

Multilingual text-video

retrieval methods rely on pre-trained multilingual

text encoders to handle many languages with a

single model. MBERT (Devlin et al.,2019) and

XLM-R (Conneau et al.,2020) learn multilingual

representations through masked language model-

ing. LaBSE (Feng et al.,2022) is instead trained

to maximize the similarity of translation pairs in a

shared embedding space. In our experiments, we

evaluated these different models and found LaBSE

to be the best encoder for multilingual text-video

retrieval.

Cross-Lingual & Cross-Modal Knowledge Dis-

tillation.

Another approach for training a mul-

tilingual text model with good sentence embed-

dings is to distill the knowledge (Hinton et al.,

2015) from a monolingual model. Distill Sentence

BERT (Reimers and Gurevych,2020) is initialized

from XLM-R and trained to output similar multi-

lingual embeddings to Sentence BERT (Reimers

and Gurevych,2019) using English translations as

input. Our C2KD approach has a similar idea, but

it incorporates visual context. We use English text

as input to several cross-modal teachers, and train

a student to output similar text-video similarity

scores using text in other languages.

Of most relevance to our work, Teach-

Text (Croitoru et al.,2021) introduced cross-modal

Knowledge Distillation for English text-video re-

trieval. They use teacher retrieval models with

various English text embeddings and train a stu-

dent to output similar text-video similarity scores

with a regression loss. Our approach has several

major differences. First, our text and models are

multilingual. Second, we enforce the teachers to

use English input instead of using the same multi-

lingual input as the students. Third, we use a cross

entropy objective between the student and teacher

text-video scores instead of using a regression loss,

which is more effective since it considers the con-

text of all of the text-video pairs in the batch. We

compare our objective to theirs in Section 4.4.

Finally, some multilingual knowledge distilla-

tion methods were proposed for visual question

answering based on images (Raj Khan et al.,2021;

Gupta et al.,2022a).

Other Multilingual Video Datasets.

Several mul-

tilingual video datasets are designed for other

tasks, such as captioning (Wang et al.,2019;Su

et al.,2021), sentiment analysis (Bagher Zadeh

et al.,2020;Gupta et al.,2022b), moment detec-

tion (Lei et al.,2021), audio-visual speech recog-

Pooler

Student model Teacher models

(training only)

fry the

sandwiches

English Query

Video

Query

Encoder

Video

Encoder

Video

炸三明治

Chinese Query

Query

Encoder 1

Video

Encoder 1

Query

Encoder

faire frire les

sandwichs

French Query

Query

Encoder 2

Video

Encoder 2

Query

Encoder 3

Video

Encoder 3

Figure 1: Overview of C2KD. A multilingual student model computes text-video similarity scores for a batch of

video and text inputs, while teacher models process the same video and English translations. The student is trained

with two objectives. LN C E (described in Section 3.1) trains the model to have high text-video scores for text and

video pairs using the cross entropy loss. LC2K D (described in Section 3.3) distills the knowledge from the teacher

English text-video scores using a cross entropy loss.

nition (Ephrat et al.,2018), and audio-video re-

trieval (Rouditchenko et al.,2021b). Instructional

videos with captions from automatic speech recog-

nition have been used for learning word embed-

dings (Sigurdsson et al.,2020) and visually-guided

machine translation (Sanabria et al.,2018). How-

ever, the transcriptions often have errors and can

be unrelated to the visuals. Our Multi-YouCook2

dataset contains captions which were originally

written by human annotators in English (Zhou et al.,

2018), which makes them visually relevant.

Concurrent Work.

Madasu et al. (2023) pro-

pose a similar framework to improve multilingual

text-video retrieval. However, their method uses

knowledge transfer from multilingual text, while

our method uses knowledge transfer from English

text. They use a separate encoder for English and

multilingual text, while our ﬁnal model uses a sin-

gle encoder for all languages.

3 Method

3.1 Text-Video Contrastive Loss

We handle the problem of learning multilingual

text-video representations. For simplicity, we ﬁrst

describe the approach for learning with English

text and then explain how to extend it to more lan-

guages. We consider a dataset

Den ={(ti, vi)}N

i=1

of paired videos and English captions. The goal

of text-video retrieval is to learn text and vision

models,

f(·)

and

g(·)

respectively, which output

embeddings that are similar to each other when

the input text caption

and video

are semanti-

cally related (ie. describing similar concepts), and

have low similarity when they are unrelated. In this

work, we use cosine similarity by L2-normalizing

the outputs of

f(·)

and

g(·)

and taking the dot-

product.

The Noise-Contrastive Estimation loss

(NCE) (Gutmann and Hyvärinen,2010;Joze-

fowicz et al.,2016;Oord et al.,2018) has been

commonly used to learn text-video represen-

tations (Sun et al.,2019;Rouditchenko et al.,

2021a). Given a batch of

text-video pairs,

let

be the text-video similarity matrix, with

Sij =f(ti)>g(vj)

. With temperature

, the NCE

loss is given as:

LNCE =−

i=1

log exp(Sii/τ)

k=1 exp(Sik/τ).(1)

This can be interpreted as the cross entropy loss be-

tween the distribution over normalized text-video

similarity scores in

and the one-hot distribu-

tion. Speciﬁcally, let

Qti(vj)

be the probability

that video vjmatches with text ti:

Qti(vj) = exp(Sij /τ)

k=1 exp(Sik/τ).(2)

The target distribution,

Pti(vj)

, is one-hot (since

the correct match for text tiis video vi):

Pti(vj) = (1,if i=j

0,otherwise. (3)

Given the equation for cross entropy,

LCE =−

i=1 X

Pti(vj) log Qti(vj),(4)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

C2KD:Cross-LingualCross-ModalKnowledgeDistillationforMultilingualText-VideoRetrievalAndrewRouditchenko1Yung-SungChuang1NinaShvetsova2SamuelThomas3;4RogerioFeris3;4BrianKingsbury3;4LeonidKarlinsky3;4DavidHarwath5HildeKuehne2;4JamesGlass1MIT1GoetheUniversityFrankfurt2IBMResearchAI3MIT-IBMWatsonAILab4U...

展开>> 收起<<

C2KD Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval Andrew Rouditchenko1Yung-Sung Chuang1Nina Shvetsova2Samuel Thomas34Rogerio Feris34.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

C2KD Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval Andrew Rouditchenko1Yung-Sung Chuang1Nina Shvetsova2Samuel Thomas34Rogerio Feris34

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: