Multilingual Multimodal Learning with Machine Translated Text Chen QiuADan Oneat ăBEmanuele BugliarelloC Stella FrankCDDesmond ElliottCD

2025-05-02 0 0 1.97MB 16 页 10玖币

侵权投诉

Multilingual Multimodal Learning with Machine Translated Text

Chen QiuADan Oneat

,ăBEmanuele BugliarelloC

Stella FrankC,DDesmond ElliottC,D

ASchool of Computer Science and Technology,

Wuhan University of Science and Technology, China

BUniversity Politehnica of Bucharest, Romania

CDepartment of Computer Science, University of Copenhagen, Denmark

DPioneer Centre for AI, Denmark

chen@wust.edu.cn dan.oneata@speed.pub.ro {emanuele, stfr, de}@di.ku.dk

Abstract

Most vision-and-language pretraining research

focuses on English tasks. However, the cre-

ation of multilingual multimodal evaluation

datasets (e.g. Multi30K, xGQA, XVNLI, and

MaRVL) poses a new challenge in finding

high-quality training data that is both multi-

lingual and multimodal. In this paper, we in-

vestigate whether machine translating English

multimodal data can be an effective proxy for

the lack of readily available multilingual data.

We call this framework TD-MML: Translated

Data for Multilingual Multimodal Learning,

and it can be applied to any multimodal dataset

and model. We apply it to both pretraining and

fine-tuning data with a state-of-the-art model.

In order to prevent models from learning from

low-quality translated text, we propose two

metrics for automatically removing such trans-

lations from the resulting datasets. In experi-

ments on five tasks across 20 languages in the

IGLUE benchmark, we show that translated

data can provide a useful signal for multilin-

gual multimodal learning, both at pretraining

and fine-tuning.

1 Introduction

Vision-and-language (V&L) pretraining is the pro-

cess of learning deep contextualised cross-modal

representations from large collections of image–

sentence pairs (Li et al.,2019;Tan and Bansal,

2019;Chen et al.,2020,inter-alia). These pre-

trained models are an excellent backbone for trans-

fer learning to a wide range of downstream tasks,

such as visual question answering (Antol et al.,

2015;Gurari et al.,2018;Agrawal et al.,2022),

referring expression alignment (Kazemzadeh et al.,

2014;Mao et al.,2016), and image–sentence re-

trieval (Young et al.,2014;Lin et al.,2014). Thus

far, downstream evaluations have mostly focused on

English tasks due to the availability of datasets, but

the recent IGLUE benchmark (Bugliarello et al.,

2022) makes it now possible to evaluate models on

several downstream tasks across 20 languages.

Welche Art von Essen befindet sich oben

links in der Brotdose?

饭盒左上角是什么食物 ?

¿Qué tipo de comida se encuentra en la parte

superior izquierda de la lonchera?

…

Welche Art von Essen befindet sich oben

links in der Brotdose?

饭盒左上角是什么食物 ?

¿Qué tipo de comida se encuentra en la parte

superior izquierda de la lonchera?

…

Q: What type of food is in the top

left of the lunchbox?

Q: What is the cat sitting on?

…

Translation System Multilingual Text

English Text V&L Model

xGQA

MaRVL

XVNLI

文

Figure 1: Multilingual multimodal data is a scarce re-

source compared to English multimodal data. Given

an English multimodal dataset, we generate a multilin-

gual dataset using a black box translation system. We

explore the utility of this approach to creating multi-

lingual text for both downstream task fine-tuning and

pretraining.

Success in multilingual multimodal tasks, such

as those in IGLUE, is expected to depend on mod-

els with grounded representations that transfer

across languages (Bugliarello et al.,2022). For

example, in the MaRVL dataset (Liu et al.,2021),

models need to deal with a linguistic and cultural

domain shift compared to English data. Therefore,

an open problem is to define pretraining strategies

that induce high-quality multilingual multimodal

representations. Existing work has tackled this

problem by either jointly training on English mul-

timodal data and multilingual text-only data (Liu

et al.,2021;Ni et al.,2021), pretraining with a pri-

vate dataset of multilingual captioned images (Jain

et al.,2021), or machine translating multimodal

pretraining data (Zhou et al.,2021).

In this paper, we further investigate the poten-

tial of machine translated text for both fine-tuning

and pretraining across four diverse V&L tasks.

The overarching motivation is that machine trans-

lation is an inexpensive approach to producing large

amounts of multilingual text compared to collecting

data from humans, or scraping high-quality image–

The models and the machine translated text are available

at https://github.com/danoneata/td-mml

arXiv:2210.13134v1 [cs.CL] 24 Oct 2022

caption pairs from the web. Having access to thou-

sands of data points in a target language might

indeed be necessary to improve cross-lingual per-

formance in downstream tasks (Bugliarello et al.,

2022). As such, translating fine-tuning data into

multiple languages may be a compelling approach

towards downstream task success. Moreover, if this

can be achieved through machine translated text,

it raises the question of whether we can also pre-

train on many millions of multilingual translated

examples. Motivated by the initial experiments of

Zhou et al. (2021), we test this hypothesis further,

on more languages and more tasks, reporting more

nuanced results from large-scale translated text.

Overall, we show that machine translation can

provide inexpensive and impressive improvements

when fine-tuning models for multilingual multi-

modal tasks. Moreover, translation-based pretrain-

ing leads to significant gains in zero-shot cross-

lingual transfer over existing approaches. How-

ever, we find mixed results when combining this

with multilingual fine-tuning. There are still op-

portunities to realise further benefits from machine

translated text, which may be found through more

compute-intensive pretraining.

Contributions

We present the TD-MML

framework to narrow the gap between English and

non-English languages in multimodal research.

In the process of translation-based pretraining, we

present a reliable strategy to filter out bad trans-

lations.

We conduct systematic evaluations in

zero-shot and machine translated scenarios, and

show the benefits that can be gained from simply

having more data in the target languages.

2 Related Work

Inspired by the success of self-supervised language

model pretraining (Devlin et al.,2019,inter-alia),

researchers have also explored this paradigm with

multimodal models (Gella et al.,2017;Ákos Kádár

et al.,2018). The first wave (Li et al.,2019;Tan

and Bansal,2019;Li et al.,2020;Chen et al.,

2020) were initialised from BERT and pretrained

on English image–text datasets like Conceptual

Captions (Sharma et al.,2018) and COCO (Lin

et al.,2014), where the visual modality was repre-

sented using feature vectors extracted from 10–100

automatically detected object proposals (Anderson

et al.,2018). More recent models (Kim et al.,2021;

Li et al.,2021;Singh et al.,2022) represent the

visual modality using a Vision Transformer (Doso-

vitskiy et al.,2021), which can be end-to-end fine-

tuned during pretraining, as opposed to working

with pre-extracted object proposals.

More related to our work are the multilingual

variants of these models (Liu et al.,2021;Zhou

et al.,2021;Ni et al.,2021;Jain et al.,2021).

The lack of large-scale multilingual multimodal

datasets has resulted in different strategies to train

such models. Liu et al. (2021) simply augment

English caption data with text-only multilingual

Wikipedia data. In addition to this, Ni et al. (2021)

further create code-switched multimodal data

randomly swapping English words in Conceptual

Captions with the corresponding translation in one

of 50 other languages, obtained through the Panlex

dictionary. On the other hand, Zhou et al. (2021)

machine translate the Conceptual Captions dataset

into German, French, Czech, Japanese, and Chi-

nese, for a total of 19.8M pretraining data points.

Finally, Jain et al. (2021) pretrain on 3.6B multi-

lingual captions by extending the Conceptual Cap-

tions collection pipeline to multiple languages.3

In this paper, we further explore the potential

of machine translation for pretraining and fine-

tuning. Zhou et al. (2021) first pretrained a model

on machine translations of the Conceptual Captions

pretraining data in five high-resource languages

(Mandarin Chinese, Czech, French, German, and

Japanese), which then resulted in overall better

multilingual representations across a number of di-

verse languages (Bugliarello et al.,2022). Here, we

explore the potential of training multimodal mod-

els on a much larger and diverse set of languages,

including low-resource ones. Effectively doing so

requires tackling issues and limitations with ma-

chine translation systems, which do not produce

high quality translations across all languages. This

is especially relevant when translating a large cor-

pus, which might include a large number of data

points with low-quality text.

3 The IGLUE Benchmark

The impetus of our work is the recent creation

of the Image-Grounded Language Understanding

Evaluation (IGLUE; Bugliarello et al. 2022) bench-

mark for evaluating multimodal models across

twenty languages and four tasks, using five differ-

ent datasets. Specifically, the benchmark focuses

French code-switching might transform “The dog is chas-

ing a ball” into “The chien est chasing a ball”, for example.

3This large-scale dataset is not publicly available.

on zero- and few-shot transfer, where models are

fine-tuned on English data and then tested to cross-

lingually generalise with no or few samples in the

target language for the target downstream task. The

following datasets are included in IGLUE:

XVNLI

is a cross-lingual Visual Natural Language

Inference task (Bugliarello et al.,2022), which

requires models to predict the relation (en-

tailment, contradiction, or neutral) between a

premise in the form of an image, and a hy-

pothesis in the form of a sentence.

xGQA

is a cross-lingual Grounded Question An-

swering task (Pfeiffer et al.,2022), using im-

ages from Visual Genome (Krishna et al.,

2017) and translations of the English ques-

tions from GQA (Hudson and Manning,

2019). The questions in GQA are automati-

cally generated from the image scene graphs.

MaRVL

focuses on multicultural reasoning over

images (Liu et al.,2021). The task is in the

same format as the English NLVR2 (Suhr

et al.,2019) data, namely to judge whether

a sentence is true or false for a pair of images.

However, the images and the descriptions are

sourced directly in the target languages.

xFlickr&CO

is an evaluation dataset for image–

text retrieval in eight high-resource lan-

guages (Bugliarello et al.,2022). The images

are collected from Flickr30K (Young et al.,

2014) and COCO (Lin et al.,2014), while the

captions are new descriptions sourced from

native speakers in the target languages.

WIT

is a second image–text retrieval dataset

based on the Wikipedia-based Image Text

dataset (Srinivasan et al.,2021). WIT is

scraped directly from Wikipedia and contains

a much more diverse set of image types than

the other datasets, as well as more complex

and entity-centric descriptions.

Each of the tasks has a natural English train-

ing counterpart: SNLI-VE (Xie et al.,2019) for

XVNLI; GQA for xGQA, NLVR2 for MaRVL, and

English training splits of Flickr30K and WIT.

Bugliarello et al. (2022) found that current mul-

tilingual V&L models show a large gap in perfor-

mance, in each of these tasks, when evaluating on

non-English data. Moreover, further training these

models on a few examples in a target language only

slightly improved their cross-lingual capabilities.

Approach ENG IND SWA TAM TUR CMN avg

English 71.6 55.1 55.5 53.1 56.2 53.1 54.6

MT 67.9 59.6 61.4 60.4 64.3 59.4 61.0

Table 1: MaRVL accuracy results for zero-shot cross-

lingual evaluation, i.e. English-only NLVR2 fine-

tuning, and multilingual fine-tuning using machine

translated NLVR2 data (MT). The average results ex-

clude ENG accuracy.

Approach ENG BEN DEU IND KOR POR RUS CMN avg

English 54.8 10.8 34.8 33.7 12.1 22.1 18.8 19.6 21.7

MT 48.1 41.8 46.5 45.7 44.8 46.8 46.2 45.7 45.3

Table 2: xGQA accuracy results for zero-shot cross-

lingual evaluation, i.e. English-only GQA fine-tuning,

and multilingual finetuning using machine translated

GQA data (MT). Average results exclude ENG accu-

racy.

4 Fine-Tuning with Translated Data

As an initial experiment, we investigate the extent

to which performance can be improved by fine-

tuning on multilingual machine-translated data in-

stead of only English data. We conduct this ex-

periment on the MaRVL and xGQA datasets. The

results can be seen in Tables 1and 2, respectively.

We use the M2M-100-large model (Fan et al.,

2021) to translate the NLVR2 training data into the

5 MaRVL languages, and the GQA training data

into the 7 xGQA languages. For the model, we use

the xUNITER (Liu et al.,2021) implementation

from VOLTA (Bugliarello et al.,2021). xUNITER

extends the UNITER architecture (Chen et al.,

2020) multilingually, by initializing the model from

XLM-RoBERTa (Conneau et al.,2020) and pre-

training on English image captions and text-only

multilingual Wikipedia paragraphs. Starting from

the publicly-released xUNITER checkpoint, we

fine-tune on the machine translated training sets for

each task. For a fair comparison to English-only

fine-tuning, we ensure that the multilingual fine-

tuning is based on the same number of parameter

updates. In effect, this reduces the number of train-

ing epochs from 20

→

3 for MaRVL, and 5

→

1 for

xGQA. We round number of epochs so it is close

to the English-only setup.

This means that, in our

setup, all the images are seen for the same number

of times, but each unique caption will be seen fewer

times in each of the target languages.

Note: This is an approximation. For MaRVL, 3 epochs

are equivalent to 18 (rather than 20) of English-only data (6

languages). For xGQA, 1 translated text epoch is equivalent

to 8 English-only epochs (8 languages) rather than 5.

Using machine translated data for fine-tuning

brings large improvements in performance for both

MarVL and xGQA. Table 1shows the results for

MaRVL, where each non-English language in-

creases by between 4.5–8.1 accuracy points. Ta-

ble 2shows the results for xGQA, where the perfor-

mance for the non-English languages increases by

11.7–32.7 points. We also observe small decreases

in performance on English for each task but this is

expected. Recall that the models were fine-tuned

for the same number of steps, which means the

model fine-tuned on translations has been exposed

to less English text in order to process multilingual

text. We conclude that using machine translated

fine-tuning data is an inexpensive and viable path

to better task-specific performance.

5 On Pretraining with Translated Data

The previous section showed the benefits of us-

ing machine translated data for multilingual fine-

tuning. We now turn our attention to whether fur-

ther improvements can be realised by adding multi-

linguality via machine translated data is useful for

pretraining. This requires two components: (i) a

large-scale translation pipeline and the means to

deal with potential data quality issues, and (ii) a

model that can exploit the machine translated train-

ing data, which we dub TD-MML for Translated-

Data Multimodal Multilingual Learning.

5.1 Translation and Data Preparation

A commonly used dataset for multimodal pretrain-

ing is Conceptual Captions (Sharma et al.,2018),

gathered from alt-text on the Internet and post-

processed to remove proper names. We translate

2.77M English sentences from the Conceptual Cap-

tions training split into the twenty target languages

in IGLUE. Once again, we use the M2M-100-large

model (Fan et al.,2021), with 1.2B parameters.

We notice that the quality of the translations

varies across languages, presumably due to the

amount of data used to train M2M-100. Moreover,

captions in this dataset often consist of sentence

fragments, which may be harder to translate well.

In order to prevent bad data from corrupting the

model, we apply a filtering step to the translated

data. The two most frequent types of errors are

single words being repeated multiple times and

English words being copied into the translation. We

discard sentences that exhibit these characteristics

based on the following two “badness” scores:

(a) languages with a non-Latin script

(b) non-Indo-European languages

Figure 2: Cumulative distributions of the two bad-

ness scores (1 - TTR, the complement of the token-

to-type ratio, and BLUE src-tgt, the BLEU score be-

tween the source and target sentence) for the nineteen

non-English languages in IGLUE. The languages are

grouped in three categories, and the vertical lines de-

note the filtering thresholds for each of the categories

and two scores.

•

Complement of the token-to-type ratio. The

token-to-type ratio (TTR) measures the frac-

tion of unique tokens in a given text. We use

its complement (

1−TTR

), such that a large

score (close to one) indicates repetition.

•

BLEU score between the source sentence and

its translation. We measure the similarilty

between the English source and the (non-

English) target by computing the BLEU score

using the NLTK toolkit (Bird,2006). A large

score (close to one) indicates that English text

has been copied into the translation.

We estimate thresholds for the two scores by

manually inspecting a subset of 2,000 sentences

from each of the twenty target languages. We use

the same TTR threshold (0.5) for all languages

(since repetition is language-independent). We

observe different patterns of English copying so

we set different thresholds for different language

groups (Figure 2): Indo-European languages with a

Latin script, all languages with a non-Latin script,

and non-Indo-European languages using a Latin

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MultilingualMultimodalLearningwithMachineTranslatedTextChenQiuADanOneat,«BEmanueleBugliarelloCStellaFrankC;DDesmondElliottC;DASchoolofComputerScienceandTechnology,WuhanUniversityofScienceandTechnology,ChinaBUniversityPolitehnicaofBucharest,RomaniaCDepartmentofComputerScience,UniversityofCopenhagen,D...

展开>> 收起<<

Multilingual Multimodal Learning with Machine Translated Text Chen QiuADan Oneat ăBEmanuele BugliarelloC Stella FrankCDDesmond ElliottCD.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multilingual Multimodal Learning with Machine Translated Text Chen QiuADan Oneat ăBEmanuele BugliarelloC Stella FrankCDDesmond ElliottCD

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: