LVP-M3 Language-aware Visual Prompt for Multilingual Multimodal Machine Translation Hongcheng Guo1 Jiaheng Liu1 Haoyang Huang2 Jian Yang1

2025-05-02 0 0 1.76MB 11 页 10玖币

侵权投诉

LVP-M3: Language-aware Visual Prompt for Multilingual

Multimodal Machine Translation

Hongcheng Guo*1, Jiaheng Liu*1, Haoyang Huang2, Jian Yang1,

Zhoujun LiB1,Dongdong Zhang2,Zheng Cui2,Furu Wei2

1Beihang University

2Microsoft Research Asia

{hongchengguo,liujiaheng,jiaya,lizj}@buaa.edu.cn

{haohua,dozhang,zhcui,fuwei}@microsoft.com

Abstract

Multimodal Machine Translation (MMT) fo-

cuses on enhancing text-only translation with

visual features, which has attracted consider-

able attention from both natural language pro-

cessing and computer vision communities. Re-

cent advances still struggle to train a sepa-

rate model for each language pair, which is

costly and unaffordable when the number of

languages increases in the real world. In other

words, the multilingual multimodal machine

translation (Multilingual MMT) task has not

been investigated, which aims to handle the

aforementioned issues by providing a shared

semantic space for multiple languages. Be-

sides, the image modality has no language

boundaries, which is superior to bridging the

semantic gap between languages. To this end,

we ﬁrst propose the Multilingual MMT task

by establishing two new Multilingual MMT

benchmark datasets covering seven languages.

Then, an effective baseline LVP-M3using vi-

sual prompts is proposed to support transla-

tions between different languages, which in-

cludes three stages (token encoding, language-

aware visual prompt generation, and language

translation). Extensive experimental results

on our constructed benchmark datasets demon-

strate the effectiveness of LVP-M3method for

Multilingual MMT.

1 Introduction

Multimodal Machine Translation (MMT) extends

the conventional text-based machine translation

by taking corresponding images as additional in-

puts (Lin et al.,2020;Li et al.,2022) to mitigate

the problems of data sparsity and ambiguity (Ive

et al.,2019;Yang et al.,2022) when compared

with purely text-based machine translation. Simi-

lar to other multimodal tasks (e.g., visual question

answering (Antol et al.,2015;Shih et al.,2016),

image captioning (Vinyals et al.,2015;Jia et al.,

* First two authors contributed equally.

BCorresponding author.

Target Language

De: Ein Junge taucht in...

MMT

Model (En-De)

MMT

Model (En-Fr)

MMT

Model (En-Cs)

Target Language

De: Ein Junge taucht in...

Fr: Un garc¸on plonge...

Cs: Chlapec ska´ce do...

...

Multilingual

MMT Model

Fr: Un garc¸on plonge...

Cs: Chlapec ska´ce do...

En: A boy dives into...

Source Language

Image (a) MMT

(b) Multilingual MMT

En: A boy dives into...

Source Language

Image

Figure 1: Comparison of MMT and Multilingual MMT.

(a) For MMT, we need to train different MMT mod-

els to support translations between different language

pairs (e.g., “En-De” represents to translate the English

to German). (b). For Multilingual MMT, we only need

one single model to translate the source language to dif-

ferent target languages.

2015) and video-text retrieval (Liu et al.,2022d)),

MMT aims to exploit the effectiveness of vision

information for the machine translation task.

Moreover, MMT has broad applications (Zhou

et al.,2018), such as multimedia news and movie

subtitles in different languages.

However, as shown in Fig. 1(a), previous MMT

models (e.g., DCCN (Lin et al.,2020)) can handle

a single language translation pair (e.g., English

→

German, English

→

French) well, but training a

separate model for each language pair is unafford-

able considering there are thousands of languages

in the world. A straightforward solution to reduce

computational cost is to use one model for handling

the translations of multiple languages as shown in

Fig. 1(b). Meanwhile, multilingual machine trans-

lation has been investigated for many years (Con-

neau et al.,2020), but these existing methods only

consider the language as the input, where the vi-

sion context has been ignored. Therefore, in our

work, we ﬁrst propose the Multilingual Multimodal

arXiv:2210.15461v2 [cs.CL] 28 Nov 2022

Machine Translation (

Multilingual MMT

) task to

achieve the translations for multiple languages us-

ing one single model.

To eliminate the above limitations, we propose

a simple and effective

LVP-M3

method, including

Token Encoding, Language-aware Visual Prompt

Generation (LVPG), and Language Translation.

Speciﬁcally, in the token encoding stage, we use

the pre-trained vision encoder to extract the visual

tokens. Then, we follow (Johnson et al.,2017) to

utilize the Transformer to encode the textual to-

kens. In LVPG, inspired by (Yang et al.,2019) and

(Tian et al.,2020), a controller network in Fig. 3is

leveraged to dynamically generate the parameters

of the mapping network conditioned on the target

language. Further, the mapping network outputs

the language-aware visual prompts. After that, dur-

ing the language translation, following the works

(e.g., ViLBERT (Lu et al.,2019)), we utilize co-

Transformer to generate the vision-guided language

tokens. Then the Transformer decoder is adopted

to predict the translation results.

Extensive experiments are conducted on our pro-

posed benchmark datasets for LVP-M

. Results

show that our model achieves the state-of-the-art

performance in all translation directions, especially

outperforming the text-only multilingual model by

4.3BLEU scores on average.

The contributions of this work are summarized

as follows:

•

We ﬁrst propose the Multilingual Multimodal

Machine Translation (Multilingual MMT) to

handle the translations for multiple language

pairs, which investigates the effect of vision

modality for multilingual translation and re-

duces the computation costs of existing MMT

methods for multiple languages.

•

For Multilingual MMT, we propose an effec-

tive language-aware visual prompt generation

strategy to produce different visual prompts

for different target languages based on the vi-

sion modality and type of the target language.

•

We establish two Multilingual MMT bench-

mark datasets to nourish the further research

on Multilingual MMT, and extensive experi-

ments on these datasets demonstrate the effec-

tiveness of our proposed LVP-M3method.

2 Related Works

Multimodal Machine Translation.

The multi-

modal context plays a key role in Multimodal Ma-

chine Translation (MMT). Recent MMT methods

can be divided into three categories: (1) Using

global visual features directly (Calixto and Liu,

2017). For instance, Huang et al. (2016) proposes

to concatenate global and regional visual features

with source sequences. (2) Exploiting visual fea-

tures via attention scheme (Libovick

y and Helcl,

2017;Helcl et al.,2018). Calixto et al. (2017) in-

troduces the visual features into the MMT model by

using an independent attention module. (3) Com-

bining other vision tasks with the translation task by

multitask learning (Calixto et al.,2019;Yin et al.,

2020). Elliott and Kádár (2017) decomposes mul-

timodal translation into two sub-tasks (i.e., transla-

tion and visual grounding). Recently, (Huang et al.,

2020) focuses on unsupervised setting for MMT,

which utilizes pseudo visual pivoting and visual

content to improve the cross-lingual alignments

in latent space. In contrast, LVP-M

considers

fully-supervised multilingual setting by mapping

vision embeddings into different feature spaces and

achieving the purpose of using one MT model for

handling translations of multiple languages. Be-

sides, reducing computation cost is vital for many

tasks (Liu et al.,2021,2022c,a) and we focus on

the Multilingual MMT task by using one single

model for efﬁciency.

Multilingual Language Models

. Pre-trained mul-

tilingual Transformer-based language models (e.g.,

mBERT (Kenton and Toutanova,2019) and XLM-

R (Conneau et al.,2020)) utilize the same pre-

training strategies as their respective monolingual

counterparts (e.g., BERT (Kenton and Toutanova,

2019) and RoBERTa (Liu et al.,2019)). They are

pre-trained via the masked language modeling ob-

jective (MLM) Strategy. Artetxe et al. (2020) pro-

poses a method to transfer monolingual representa-

tions to new languages in an unsupervised fashion

and provide new insights into the generalization

abilities of multilingual models. Hu et al. (2020)

introduces the Cross-lingual Transfer Evaluation

of Multilingual Encoders (XTREME) benchmark

to evaluate the cross-lingual generalization capa-

bilities, Karthikeyan et al. (2020) also provides

a comprehensive study of the contribution of dif-

ferent components in M-BERT to its cross-lingual

ability. Rust et al. (2021) shows that monolingually

adapted tokenizers can robustly improve the mono-

lingual performance of multilingual models. Over-

all, when compared with these methods, we focus

on the multilingual setting for MMT, which has not

been investigated before.

Vision-Language Models

. The success of vision-

language models can be credited to the following

three important reasons: Transformers (Liu et al.,

2022b;Vaswani et al.,2017), contrastive repre-

sentation learning (Radford et al.,2021;Li et al.,

2020), and large-scale training datasets (Sharma

et al.,2018;Miech et al.,2019). Previous

Transformer-based multimodal models ( (Tan and

Bansal,2019;Chen et al.,2020;Gan et al.,2020;

Bugliarello et al.,2021)) jointly encode text to-

kens and image region features by preprocessing

images using object detection models. The image

region features are projected into the joint embed-

ding space of the multimodal Transformer, and

then the multi-head attention attends to all text

and image inputs to learn a joint representation of

both modalities. Besides, Kamath et al. (2021)

avoids using object detectors as a black box for

pre-extracting these region features and incorpo-

rates the object detector end-to-end with the multi-

modal Transformer to achieve ﬂexibility and better

representation capacity. Recently, a representative

approach CLIP (Radford et al.,2021) is proposed,

which trains two neural network-based encoders

using a contrastive loss to match pairs of images

and texts. After consuming 400 million data pairs,

the CLIP model demonstrates a remarkable zero-

shot image recognition capability, and has been

applied to many downstream tasks. For example,

Shen et al. (2022) proposes to leverage the CLIP

model for different vision-language models across

various tasks (e.g., image caption, visual question

answering). In our work, we aim to investigate

the effectiveness of the multimodal information for

Multilingual MMT.

3 Datasets

We introduce two Multilingual MMT benchmark

datasets (i.e., M

-Multi30K, M

-AmbigCaps) us-

ing Multi30K (Elliott et al.,2016) and Ambig-

Caps (Li et al.,2021). Here, we descried the details

of the M3-Multi30K and M3-AmbigCaps.

Data Construction.

The widely-used Multi30K

dataset for the MMT task is based on the Flickr30K

Entities dataset (Plummer et al.,2017). For each

image of Multi30K, one of the English (En) descrip-

tions is selected in Flickr30K Entities. Currently,

En: A child is splashing in the water

De: Ein Kind plantscht im Wasser

Cs: Dítě se šplouchá ve vodě

Tr: Bir çocuk suya sıçratıyo

Hi:

Lv: Bērns plunčājas ūdenī

Fr: Un enfant éclabousse dans l'eau

Figure 2: Example of an image with its descriptions of

seven different languages.

Language ISO Family Speakers

English En Germanic 400M

German De Germanic 95M

French Fr Romance 250M

Czech Cs Slavic 11M

Hindi Hi Indo-Aryan 800M

Turkish Tr Turkic 65M

Latvian Lv Baltic 2M

Table 1: Languages covered by our proposed M3-

Multi30K and M3-AmbigCaps datasets.

the English description of each image is translated

into German (De), French (Fr), and Czech (Cs) (El-

liott et al.,2017;Barrault et al.,2018). To support

more languages from different language families

and various language distributions for Multilingual

MMT, we extend the existing Multi30K dataset

with additional three languages as shown in Table 1,

where one sample of the M

-Multi30K dataset is

shown in Fig. 2.

Speciﬁcally, in the annotation process, based

on the recent state-of-the-art multilingual machine

translation model XLM-R (Conneau et al.,2020),

we ﬁrst translate the English description into Hindi

(Hi), Turkish (Tr), and Latvian (Lv) for each im-

age in Multi30K. Then, we hire independent native

speakers to verify and improve the quality of the

translation results of different languages. In addi-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LVP-M3:Language-awareVisualPromptforMultilingualMultimodalMachineTranslationHongchengGuo*1,JiahengLiu*1,HaoyangHuang2,JianYang1,ZhoujunLiB1,DongdongZhang2,ZhengCui2,FuruWei21BeihangUniversity2MicrosoftResearchAsia{hongchengguo,liujiaheng,jiaya,lizj}@buaa.edu.cn{haohua,dozhang,zhcui,fuwei}@microsoft....

展开>> 收起<<

LVP-M3 Language-aware Visual Prompt for Multilingual Multimodal Machine Translation Hongcheng Guo1 Jiaheng Liu1 Haoyang Huang2 Jian Yang1.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LVP-M3 Language-aware Visual Prompt for Multilingual Multimodal Machine Translation Hongcheng Guo1 Jiaheng Liu1 Haoyang Huang2 Jian Yang1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: