
caption pairs from the web. Having access to thou-
sands of data points in a target language might
indeed be necessary to improve cross-lingual per-
formance in downstream tasks (Bugliarello et al.,
2022). As such, translating fine-tuning data into
multiple languages may be a compelling approach
towards downstream task success. Moreover, if this
can be achieved through machine translated text,
it raises the question of whether we can also pre-
train on many millions of multilingual translated
examples. Motivated by the initial experiments of
Zhou et al. (2021), we test this hypothesis further,
on more languages and more tasks, reporting more
nuanced results from large-scale translated text.
Overall, we show that machine translation can
provide inexpensive and impressive improvements
when fine-tuning models for multilingual multi-
modal tasks. Moreover, translation-based pretrain-
ing leads to significant gains in zero-shot cross-
lingual transfer over existing approaches. How-
ever, we find mixed results when combining this
with multilingual fine-tuning. There are still op-
portunities to realise further benefits from machine
translated text, which may be found through more
compute-intensive pretraining.
Contributions
.
1)
We present the TD-MML
framework to narrow the gap between English and
non-English languages in multimodal research.
2)
In the process of translation-based pretraining, we
present a reliable strategy to filter out bad trans-
lations.
3)
We conduct systematic evaluations in
zero-shot and machine translated scenarios, and
show the benefits that can be gained from simply
having more data in the target languages.
2 Related Work
Inspired by the success of self-supervised language
model pretraining (Devlin et al.,2019,inter-alia),
researchers have also explored this paradigm with
multimodal models (Gella et al.,2017;Ákos Kádár
et al.,2018). The first wave (Li et al.,2019;Tan
and Bansal,2019;Li et al.,2020;Chen et al.,
2020) were initialised from BERT and pretrained
on English image–text datasets like Conceptual
Captions (Sharma et al.,2018) and COCO (Lin
et al.,2014), where the visual modality was repre-
sented using feature vectors extracted from 10–100
automatically detected object proposals (Anderson
et al.,2018). More recent models (Kim et al.,2021;
Li et al.,2021;Singh et al.,2022) represent the
visual modality using a Vision Transformer (Doso-
vitskiy et al.,2021), which can be end-to-end fine-
tuned during pretraining, as opposed to working
with pre-extracted object proposals.
More related to our work are the multilingual
variants of these models (Liu et al.,2021;Zhou
et al.,2021;Ni et al.,2021;Jain et al.,2021).
The lack of large-scale multilingual multimodal
datasets has resulted in different strategies to train
such models. Liu et al. (2021) simply augment
English caption data with text-only multilingual
Wikipedia data. In addition to this, Ni et al. (2021)
further create code-switched multimodal data
2
by
randomly swapping English words in Conceptual
Captions with the corresponding translation in one
of 50 other languages, obtained through the Panlex
dictionary. On the other hand, Zhou et al. (2021)
machine translate the Conceptual Captions dataset
into German, French, Czech, Japanese, and Chi-
nese, for a total of 19.8M pretraining data points.
Finally, Jain et al. (2021) pretrain on 3.6B multi-
lingual captions by extending the Conceptual Cap-
tions collection pipeline to multiple languages.3
In this paper, we further explore the potential
of machine translation for pretraining and fine-
tuning. Zhou et al. (2021) first pretrained a model
on machine translations of the Conceptual Captions
pretraining data in five high-resource languages
(Mandarin Chinese, Czech, French, German, and
Japanese), which then resulted in overall better
multilingual representations across a number of di-
verse languages (Bugliarello et al.,2022). Here, we
explore the potential of training multimodal mod-
els on a much larger and diverse set of languages,
including low-resource ones. Effectively doing so
requires tackling issues and limitations with ma-
chine translation systems, which do not produce
high quality translations across all languages. This
is especially relevant when translating a large cor-
pus, which might include a large number of data
points with low-quality text.
3 The IGLUE Benchmark
The impetus of our work is the recent creation
of the Image-Grounded Language Understanding
Evaluation (IGLUE; Bugliarello et al. 2022) bench-
mark for evaluating multimodal models across
twenty languages and four tasks, using five differ-
ent datasets. Specifically, the benchmark focuses
2
French code-switching might transform “The dog is chas-
ing a ball” into “The chien est chasing a ball”, for example.
3This large-scale dataset is not publicly available.