Multilingual Multimodal Learning with Machine Translated Text Chen QiuADan Oneat ăBEmanuele BugliarelloC Stella FrankCDDesmond ElliottCD

2025-05-02 0 0 1.97MB 16 页 10玖币
侵权投诉
Multilingual Multimodal Learning with Machine Translated Text
Chen QiuADan Oneat
,ăBEmanuele BugliarelloC
Stella FrankC,DDesmond ElliottC,D
ASchool of Computer Science and Technology,
Wuhan University of Science and Technology, China
BUniversity Politehnica of Bucharest, Romania
CDepartment of Computer Science, University of Copenhagen, Denmark
DPioneer Centre for AI, Denmark
chen@wust.edu.cn dan.oneata@speed.pub.ro {emanuele, stfr, de}@di.ku.dk
Abstract
Most vision-and-language pretraining research
focuses on English tasks. However, the cre-
ation of multilingual multimodal evaluation
datasets (e.g. Multi30K, xGQA, XVNLI, and
MaRVL) poses a new challenge in finding
high-quality training data that is both multi-
lingual and multimodal. In this paper, we in-
vestigate whether machine translating English
multimodal data can be an effective proxy for
the lack of readily available multilingual data.
We call this framework TD-MML: Translated
Data for Multilingual Multimodal Learning,
and it can be applied to any multimodal dataset
and model. We apply it to both pretraining and
fine-tuning data with a state-of-the-art model.
In order to prevent models from learning from
low-quality translated text, we propose two
metrics for automatically removing such trans-
lations from the resulting datasets. In experi-
ments on five tasks across 20 languages in the
IGLUE benchmark, we show that translated
data can provide a useful signal for multilin-
gual multimodal learning, both at pretraining
and fine-tuning.
1 Introduction
Vision-and-language (V&L) pretraining is the pro-
cess of learning deep contextualised cross-modal
representations from large collections of image–
sentence pairs (Li et al.,2019;Tan and Bansal,
2019;Chen et al.,2020,inter-alia). These pre-
trained models are an excellent backbone for trans-
fer learning to a wide range of downstream tasks,
such as visual question answering (Antol et al.,
2015;Gurari et al.,2018;Agrawal et al.,2022),
referring expression alignment (Kazemzadeh et al.,
2014;Mao et al.,2016), and image–sentence re-
trieval (Young et al.,2014;Lin et al.,2014). Thus
far, downstream evaluations have mostly focused on
English tasks due to the availability of datasets, but
the recent IGLUE benchmark (Bugliarello et al.,
2022) makes it now possible to evaluate models on
several downstream tasks across 20 languages.
Welche Art von Essen befindet sich oben
links in der Brotdose?
盒左上角是什么食物 ?
¿Qué tipo de comida se encuentra en la parte
superior izquierda de la lonchera?
Welche Art von Essen befindet sich oben
links in der Brotdose?
盒左上角是什么食物 ?
¿Qué tipo de comida se encuentra en la parte
superior izquierda de la lonchera?
Q: What type of food is in the top
left of the lunchbox?
Q: What is the cat sitting on?
Translation System Multilingual Text
English Text V&L Model
xGQA
MaRVL
XVNLI
A
Figure 1: Multilingual multimodal data is a scarce re-
source compared to English multimodal data. Given
an English multimodal dataset, we generate a multilin-
gual dataset using a black box translation system. We
explore the utility of this approach to creating multi-
lingual text for both downstream task fine-tuning and
pretraining.
Success in multilingual multimodal tasks, such
as those in IGLUE, is expected to depend on mod-
els with grounded representations that transfer
across languages (Bugliarello et al.,2022). For
example, in the MaRVL dataset (Liu et al.,2021),
models need to deal with a linguistic and cultural
domain shift compared to English data. Therefore,
an open problem is to define pretraining strategies
that induce high-quality multilingual multimodal
representations. Existing work has tackled this
problem by either jointly training on English mul-
timodal data and multilingual text-only data (Liu
et al.,2021;Ni et al.,2021), pretraining with a pri-
vate dataset of multilingual captioned images (Jain
et al.,2021), or machine translating multimodal
pretraining data (Zhou et al.,2021).
In this paper, we further investigate the poten-
tial of machine translated text for both fine-tuning
and pretraining across four diverse V&L tasks.
1
The overarching motivation is that machine trans-
lation is an inexpensive approach to producing large
amounts of multilingual text compared to collecting
data from humans, or scraping high-quality image–
1
The models and the machine translated text are available
at https://github.com/danoneata/td-mml
arXiv:2210.13134v1 [cs.CL] 24 Oct 2022
caption pairs from the web. Having access to thou-
sands of data points in a target language might
indeed be necessary to improve cross-lingual per-
formance in downstream tasks (Bugliarello et al.,
2022). As such, translating fine-tuning data into
multiple languages may be a compelling approach
towards downstream task success. Moreover, if this
can be achieved through machine translated text,
it raises the question of whether we can also pre-
train on many millions of multilingual translated
examples. Motivated by the initial experiments of
Zhou et al. (2021), we test this hypothesis further,
on more languages and more tasks, reporting more
nuanced results from large-scale translated text.
Overall, we show that machine translation can
provide inexpensive and impressive improvements
when fine-tuning models for multilingual multi-
modal tasks. Moreover, translation-based pretrain-
ing leads to significant gains in zero-shot cross-
lingual transfer over existing approaches. How-
ever, we find mixed results when combining this
with multilingual fine-tuning. There are still op-
portunities to realise further benefits from machine
translated text, which may be found through more
compute-intensive pretraining.
Contributions
.
1)
We present the TD-MML
framework to narrow the gap between English and
non-English languages in multimodal research.
2)
In the process of translation-based pretraining, we
present a reliable strategy to filter out bad trans-
lations.
3)
We conduct systematic evaluations in
zero-shot and machine translated scenarios, and
show the benefits that can be gained from simply
having more data in the target languages.
2 Related Work
Inspired by the success of self-supervised language
model pretraining (Devlin et al.,2019,inter-alia),
researchers have also explored this paradigm with
multimodal models (Gella et al.,2017;Ákos Kádár
et al.,2018). The first wave (Li et al.,2019;Tan
and Bansal,2019;Li et al.,2020;Chen et al.,
2020) were initialised from BERT and pretrained
on English image–text datasets like Conceptual
Captions (Sharma et al.,2018) and COCO (Lin
et al.,2014), where the visual modality was repre-
sented using feature vectors extracted from 10–100
automatically detected object proposals (Anderson
et al.,2018). More recent models (Kim et al.,2021;
Li et al.,2021;Singh et al.,2022) represent the
visual modality using a Vision Transformer (Doso-
vitskiy et al.,2021), which can be end-to-end fine-
tuned during pretraining, as opposed to working
with pre-extracted object proposals.
More related to our work are the multilingual
variants of these models (Liu et al.,2021;Zhou
et al.,2021;Ni et al.,2021;Jain et al.,2021).
The lack of large-scale multilingual multimodal
datasets has resulted in different strategies to train
such models. Liu et al. (2021) simply augment
English caption data with text-only multilingual
Wikipedia data. In addition to this, Ni et al. (2021)
further create code-switched multimodal data
2
by
randomly swapping English words in Conceptual
Captions with the corresponding translation in one
of 50 other languages, obtained through the Panlex
dictionary. On the other hand, Zhou et al. (2021)
machine translate the Conceptual Captions dataset
into German, French, Czech, Japanese, and Chi-
nese, for a total of 19.8M pretraining data points.
Finally, Jain et al. (2021) pretrain on 3.6B multi-
lingual captions by extending the Conceptual Cap-
tions collection pipeline to multiple languages.3
In this paper, we further explore the potential
of machine translation for pretraining and fine-
tuning. Zhou et al. (2021) first pretrained a model
on machine translations of the Conceptual Captions
pretraining data in five high-resource languages
(Mandarin Chinese, Czech, French, German, and
Japanese), which then resulted in overall better
multilingual representations across a number of di-
verse languages (Bugliarello et al.,2022). Here, we
explore the potential of training multimodal mod-
els on a much larger and diverse set of languages,
including low-resource ones. Effectively doing so
requires tackling issues and limitations with ma-
chine translation systems, which do not produce
high quality translations across all languages. This
is especially relevant when translating a large cor-
pus, which might include a large number of data
points with low-quality text.
3 The IGLUE Benchmark
The impetus of our work is the recent creation
of the Image-Grounded Language Understanding
Evaluation (IGLUE; Bugliarello et al. 2022) bench-
mark for evaluating multimodal models across
twenty languages and four tasks, using five differ-
ent datasets. Specifically, the benchmark focuses
2
French code-switching might transform “The dog is chas-
ing a ball” into “The chien est chasing a ball”, for example.
3This large-scale dataset is not publicly available.
on zero- and few-shot transfer, where models are
fine-tuned on English data and then tested to cross-
lingually generalise with no or few samples in the
target language for the target downstream task. The
following datasets are included in IGLUE:
XVNLI
is a cross-lingual Visual Natural Language
Inference task (Bugliarello et al.,2022), which
requires models to predict the relation (en-
tailment, contradiction, or neutral) between a
premise in the form of an image, and a hy-
pothesis in the form of a sentence.
xGQA
is a cross-lingual Grounded Question An-
swering task (Pfeiffer et al.,2022), using im-
ages from Visual Genome (Krishna et al.,
2017) and translations of the English ques-
tions from GQA (Hudson and Manning,
2019). The questions in GQA are automati-
cally generated from the image scene graphs.
MaRVL
focuses on multicultural reasoning over
images (Liu et al.,2021). The task is in the
same format as the English NLVR2 (Suhr
et al.,2019) data, namely to judge whether
a sentence is true or false for a pair of images.
However, the images and the descriptions are
sourced directly in the target languages.
xFlickr&CO
is an evaluation dataset for image–
text retrieval in eight high-resource lan-
guages (Bugliarello et al.,2022). The images
are collected from Flickr30K (Young et al.,
2014) and COCO (Lin et al.,2014), while the
captions are new descriptions sourced from
native speakers in the target languages.
WIT
is a second image–text retrieval dataset
based on the Wikipedia-based Image Text
dataset (Srinivasan et al.,2021). WIT is
scraped directly from Wikipedia and contains
a much more diverse set of image types than
the other datasets, as well as more complex
and entity-centric descriptions.
Each of the tasks has a natural English train-
ing counterpart: SNLI-VE (Xie et al.,2019) for
XVNLI; GQA for xGQA, NLVR2 for MaRVL, and
English training splits of Flickr30K and WIT.
Bugliarello et al. (2022) found that current mul-
tilingual V&L models show a large gap in perfor-
mance, in each of these tasks, when evaluating on
non-English data. Moreover, further training these
models on a few examples in a target language only
slightly improved their cross-lingual capabilities.
Approach ENG IND SWA TAM TUR CMN avg
English 71.6 55.1 55.5 53.1 56.2 53.1 54.6
MT 67.9 59.6 61.4 60.4 64.3 59.4 61.0
Table 1: MaRVL accuracy results for zero-shot cross-
lingual evaluation, i.e. English-only NLVR2 fine-
tuning, and multilingual fine-tuning using machine
translated NLVR2 data (MT). The average results ex-
clude ENG accuracy.
Approach ENG BEN DEU IND KOR POR RUS CMN avg
English 54.8 10.8 34.8 33.7 12.1 22.1 18.8 19.6 21.7
MT 48.1 41.8 46.5 45.7 44.8 46.8 46.2 45.7 45.3
Table 2: xGQA accuracy results for zero-shot cross-
lingual evaluation, i.e. English-only GQA fine-tuning,
and multilingual finetuning using machine translated
GQA data (MT). Average results exclude ENG accu-
racy.
4 Fine-Tuning with Translated Data
As an initial experiment, we investigate the extent
to which performance can be improved by fine-
tuning on multilingual machine-translated data in-
stead of only English data. We conduct this ex-
periment on the MaRVL and xGQA datasets. The
results can be seen in Tables 1and 2, respectively.
We use the M2M-100-large model (Fan et al.,
2021) to translate the NLVR2 training data into the
5 MaRVL languages, and the GQA training data
into the 7 xGQA languages. For the model, we use
the xUNITER (Liu et al.,2021) implementation
from VOLTA (Bugliarello et al.,2021). xUNITER
extends the UNITER architecture (Chen et al.,
2020) multilingually, by initializing the model from
XLM-RoBERTa (Conneau et al.,2020) and pre-
training on English image captions and text-only
multilingual Wikipedia paragraphs. Starting from
the publicly-released xUNITER checkpoint, we
fine-tune on the machine translated training sets for
each task. For a fair comparison to English-only
fine-tuning, we ensure that the multilingual fine-
tuning is based on the same number of parameter
updates. In effect, this reduces the number of train-
ing epochs from 20
3 for MaRVL, and 5
1 for
xGQA. We round number of epochs so it is close
to the English-only setup.
4
This means that, in our
setup, all the images are seen for the same number
of times, but each unique caption will be seen fewer
times in each of the target languages.
4
Note: This is an approximation. For MaRVL, 3 epochs
are equivalent to 18 (rather than 20) of English-only data (6
languages). For xGQA, 1 translated text epoch is equivalent
to 8 English-only epochs (8 languages) rather than 5.
Using machine translated data for fine-tuning
brings large improvements in performance for both
MarVL and xGQA. Table 1shows the results for
MaRVL, where each non-English language in-
creases by between 4.5–8.1 accuracy points. Ta-
ble 2shows the results for xGQA, where the perfor-
mance for the non-English languages increases by
11.7–32.7 points. We also observe small decreases
in performance on English for each task but this is
expected. Recall that the models were fine-tuned
for the same number of steps, which means the
model fine-tuned on translations has been exposed
to less English text in order to process multilingual
text. We conclude that using machine translated
fine-tuning data is an inexpensive and viable path
to better task-specific performance.
5 On Pretraining with Translated Data
The previous section showed the benefits of us-
ing machine translated data for multilingual fine-
tuning. We now turn our attention to whether fur-
ther improvements can be realised by adding multi-
linguality via machine translated data is useful for
pretraining. This requires two components: (i) a
large-scale translation pipeline and the means to
deal with potential data quality issues, and (ii) a
model that can exploit the machine translated train-
ing data, which we dub TD-MML for Translated-
Data Multimodal Multilingual Learning.
5.1 Translation and Data Preparation
A commonly used dataset for multimodal pretrain-
ing is Conceptual Captions (Sharma et al.,2018),
gathered from alt-text on the Internet and post-
processed to remove proper names. We translate
2.77M English sentences from the Conceptual Cap-
tions training split into the twenty target languages
in IGLUE. Once again, we use the M2M-100-large
model (Fan et al.,2021), with 1.2B parameters.
We notice that the quality of the translations
varies across languages, presumably due to the
amount of data used to train M2M-100. Moreover,
captions in this dataset often consist of sentence
fragments, which may be harder to translate well.
In order to prevent bad data from corrupting the
model, we apply a filtering step to the translated
data. The two most frequent types of errors are
single words being repeated multiple times and
English words being copied into the translation. We
discard sentences that exhibit these characteristics
based on the following two “badness” scores:
(a) languages with a non-Latin script
(b) non-Indo-European languages
(c) Indo-European languages with Latin script
Figure 2: Cumulative distributions of the two bad-
ness scores (1 - TTR, the complement of the token-
to-type ratio, and BLUE src-tgt, the BLEU score be-
tween the source and target sentence) for the nineteen
non-English languages in IGLUE. The languages are
grouped in three categories, and the vertical lines de-
note the filtering thresholds for each of the categories
and two scores.
Complement of the token-to-type ratio. The
token-to-type ratio (TTR) measures the frac-
tion of unique tokens in a given text. We use
its complement (
1TTR
), such that a large
score (close to one) indicates repetition.
BLEU score between the source sentence and
its translation. We measure the similarilty
between the English source and the (non-
English) target by computing the BLEU score
using the NLTK toolkit (Bird,2006). A large
score (close to one) indicates that English text
has been copied into the translation.
We estimate thresholds for the two scores by
manually inspecting a subset of 2,000 sentences
from each of the twenty target languages. We use
the same TTR threshold (0.5) for all languages
(since repetition is language-independent). We
observe different patterns of English copying so
we set different thresholds for different language
groups (Figure 2): Indo-European languages with a
Latin script, all languages with a non-Latin script,
and non-Indo-European languages using a Latin
摘要:

MultilingualMultimodalLearningwithMachineTranslatedTextChenQiuADanOneat,«BEmanueleBugliarelloCStellaFrankC;DDesmondElliottC;DASchoolofComputerScienceandTechnology,WuhanUniversityofScienceandTechnology,ChinaBUniversityPolitehnicaofBucharest,RomaniaCDepartmentofComputerScience,UniversityofCopenhagen,D...

展开>> 收起<<
Multilingual Multimodal Learning with Machine Translated Text Chen QiuADan Oneat ăBEmanuele BugliarelloC Stella FrankCDDesmond ElliottCD.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.97MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注