gual text-video datasets: Multi-MSRVTT (Huang
et al.,2021), VATEX (Wang et al.,2019), and RUD-
DER (Akula et al.,2021). Since these datasets are
mainly focused on open-domain videos, we col-
lected the Multi-YouCook2 dataset as an extension
of the YouCook2 (Zhou et al.,2018) cooking video
dataset to test the model in a domain which requires
more fine-grained reasoning, such as understanding
specific ingredients in recipes. Our results show
that C2KD can improve the multilingual text-video
retrieval performance on all datasets, despite the
variety in languages, domains, and dataset sizes.
In summary, our contributions are: (1) We pro-
pose the C2KD method which guides a student
model to learn better multilingual text-video simi-
larity scores by learning from the text-video scores
of teachers using English text translations as input.
(2) We propose a cross entropy based objective
between the student and teacher text-video simi-
larity scores to distill the cross-modal knowledge
from the teachers. (3) We collected the Multi-
YouCook2 dataset with parallel text translations
in 9 languages for over 10k video clips. (4) Our
method improves the multilingual text-video per-
formance on four datasets. We conduct an analysis
on the impact of different teachers to gain further
insights. The code, models, and dataset are avail-
able at https://github.com/roudimit/c2kd.
2 Related Work
Multilingual Text-Video Retrieval. Recent work
introduced methods and datasets to improve mul-
tilingual text-video retrieval. Multilingual mul-
timodal pretraining (Huang et al.,2021) demon-
strated text-video retrieval in 9 languages with a
single model. They released the Multi-MSRVTT
dataset by machine-translating the English text
captions from the MSR-VTT video dataset (Xu
et al.,2016) into 8 other languages. Their model is
trained with a cross-modal contrastive objective to
pull together the embeddings of parallel text trans-
lations and video inputs together. In separate work,
the RUDDER (Akula et al.,2021) dataset was intro-
duced with captions in languages spoken in India.
They propose to augment the text-video triplet loss
with hard negatives which improved performance
in a low-resource setting. We observed that per-
formance for English text-video retrieval typically
outperformed other languages, which motivated
our approach.
Multilingual Learning.
Multilingual text-video
retrieval methods rely on pre-trained multilingual
text encoders to handle many languages with a
single model. MBERT (Devlin et al.,2019) and
XLM-R (Conneau et al.,2020) learn multilingual
representations through masked language model-
ing. LaBSE (Feng et al.,2022) is instead trained
to maximize the similarity of translation pairs in a
shared embedding space. In our experiments, we
evaluated these different models and found LaBSE
to be the best encoder for multilingual text-video
retrieval.
Cross-Lingual & Cross-Modal Knowledge Dis-
tillation.
Another approach for training a mul-
tilingual text model with good sentence embed-
dings is to distill the knowledge (Hinton et al.,
2015) from a monolingual model. Distill Sentence
BERT (Reimers and Gurevych,2020) is initialized
from XLM-R and trained to output similar multi-
lingual embeddings to Sentence BERT (Reimers
and Gurevych,2019) using English translations as
input. Our C2KD approach has a similar idea, but
it incorporates visual context. We use English text
as input to several cross-modal teachers, and train
a student to output similar text-video similarity
scores using text in other languages.
Of most relevance to our work, Teach-
Text (Croitoru et al.,2021) introduced cross-modal
Knowledge Distillation for English text-video re-
trieval. They use teacher retrieval models with
various English text embeddings and train a stu-
dent to output similar text-video similarity scores
with a regression loss. Our approach has several
major differences. First, our text and models are
multilingual. Second, we enforce the teachers to
use English input instead of using the same multi-
lingual input as the students. Third, we use a cross
entropy objective between the student and teacher
text-video scores instead of using a regression loss,
which is more effective since it considers the con-
text of all of the text-video pairs in the batch. We
compare our objective to theirs in Section 4.4.
Finally, some multilingual knowledge distilla-
tion methods were proposed for visual question
answering based on images (Raj Khan et al.,2021;
Gupta et al.,2022a).
Other Multilingual Video Datasets.
Several mul-
tilingual video datasets are designed for other
tasks, such as captioning (Wang et al.,2019;Su
et al.,2021), sentiment analysis (Bagher Zadeh
et al.,2020;Gupta et al.,2022b), moment detec-
tion (Lei et al.,2021), audio-visual speech recog-