LVP-M3 Language-aware Visual Prompt for Multilingual Multimodal Machine Translation Hongcheng Guo1 Jiaheng Liu1 Haoyang Huang2 Jian Yang1

2025-05-02 0 0 1.76MB 11 页 10玖币
侵权投诉
LVP-M3: Language-aware Visual Prompt for Multilingual
Multimodal Machine Translation
Hongcheng Guo*1, Jiaheng Liu*1, Haoyang Huang2, Jian Yang1,
Zhoujun LiB1,Dongdong Zhang2,Zheng Cui2,Furu Wei2
1Beihang University
2Microsoft Research Asia
{hongchengguo,liujiaheng,jiaya,lizj}@buaa.edu.cn
{haohua,dozhang,zhcui,fuwei}@microsoft.com
Abstract
Multimodal Machine Translation (MMT) fo-
cuses on enhancing text-only translation with
visual features, which has attracted consider-
able attention from both natural language pro-
cessing and computer vision communities. Re-
cent advances still struggle to train a sepa-
rate model for each language pair, which is
costly and unaffordable when the number of
languages increases in the real world. In other
words, the multilingual multimodal machine
translation (Multilingual MMT) task has not
been investigated, which aims to handle the
aforementioned issues by providing a shared
semantic space for multiple languages. Be-
sides, the image modality has no language
boundaries, which is superior to bridging the
semantic gap between languages. To this end,
we first propose the Multilingual MMT task
by establishing two new Multilingual MMT
benchmark datasets covering seven languages.
Then, an effective baseline LVP-M3using vi-
sual prompts is proposed to support transla-
tions between different languages, which in-
cludes three stages (token encoding, language-
aware visual prompt generation, and language
translation). Extensive experimental results
on our constructed benchmark datasets demon-
strate the effectiveness of LVP-M3method for
Multilingual MMT.
1 Introduction
Multimodal Machine Translation (MMT) extends
the conventional text-based machine translation
by taking corresponding images as additional in-
puts (Lin et al.,2020;Li et al.,2022) to mitigate
the problems of data sparsity and ambiguity (Ive
et al.,2019;Yang et al.,2022) when compared
with purely text-based machine translation. Simi-
lar to other multimodal tasks (e.g., visual question
answering (Antol et al.,2015;Shih et al.,2016),
image captioning (Vinyals et al.,2015;Jia et al.,
* First two authors contributed equally.
BCorresponding author.
Target Language
De: Ein Junge taucht in...
MMT
Model (En-De)
MMT
Model (En-Fr)
MMT
Model (En-Cs)
Target Language
De: Ein Junge taucht in...
Fr: Un garc¸on plonge...
Cs: Chlapec ska´ce do...
...
Multilingual
MMT Model
Fr: Un garc¸on plonge...
Cs: Chlapec ska´ce do...
En: A boy dives into...
Source Language
Image (a) MMT
(b) Multilingual MMT
En: A boy dives into...
Source Language
Image
Figure 1: Comparison of MMT and Multilingual MMT.
(a) For MMT, we need to train different MMT mod-
els to support translations between different language
pairs (e.g., “En-De” represents to translate the English
to German). (b). For Multilingual MMT, we only need
one single model to translate the source language to dif-
ferent target languages.
2015) and video-text retrieval (Liu et al.,2022d)),
MMT aims to exploit the effectiveness of vision
information for the machine translation task.
Moreover, MMT has broad applications (Zhou
et al.,2018), such as multimedia news and movie
subtitles in different languages.
However, as shown in Fig. 1(a), previous MMT
models (e.g., DCCN (Lin et al.,2020)) can handle
a single language translation pair (e.g., English
German, English
French) well, but training a
separate model for each language pair is unafford-
able considering there are thousands of languages
in the world. A straightforward solution to reduce
computational cost is to use one model for handling
the translations of multiple languages as shown in
Fig. 1(b). Meanwhile, multilingual machine trans-
lation has been investigated for many years (Con-
neau et al.,2020), but these existing methods only
consider the language as the input, where the vi-
sion context has been ignored. Therefore, in our
work, we first propose the Multilingual Multimodal
arXiv:2210.15461v2 [cs.CL] 28 Nov 2022
Machine Translation (
Multilingual MMT
) task to
achieve the translations for multiple languages us-
ing one single model.
To eliminate the above limitations, we propose
a simple and effective
LVP-M3
method, including
Token Encoding, Language-aware Visual Prompt
Generation (LVPG), and Language Translation.
Specifically, in the token encoding stage, we use
the pre-trained vision encoder to extract the visual
tokens. Then, we follow (Johnson et al.,2017) to
utilize the Transformer to encode the textual to-
kens. In LVPG, inspired by (Yang et al.,2019) and
(Tian et al.,2020), a controller network in Fig. 3is
leveraged to dynamically generate the parameters
of the mapping network conditioned on the target
language. Further, the mapping network outputs
the language-aware visual prompts. After that, dur-
ing the language translation, following the works
(e.g., ViLBERT (Lu et al.,2019)), we utilize co-
Transformer to generate the vision-guided language
tokens. Then the Transformer decoder is adopted
to predict the translation results.
Extensive experiments are conducted on our pro-
posed benchmark datasets for LVP-M
3
. Results
show that our model achieves the state-of-the-art
performance in all translation directions, especially
outperforming the text-only multilingual model by
4.3BLEU scores on average.
The contributions of this work are summarized
as follows:
We first propose the Multilingual Multimodal
Machine Translation (Multilingual MMT) to
handle the translations for multiple language
pairs, which investigates the effect of vision
modality for multilingual translation and re-
duces the computation costs of existing MMT
methods for multiple languages.
For Multilingual MMT, we propose an effec-
tive language-aware visual prompt generation
strategy to produce different visual prompts
for different target languages based on the vi-
sion modality and type of the target language.
We establish two Multilingual MMT bench-
mark datasets to nourish the further research
on Multilingual MMT, and extensive experi-
ments on these datasets demonstrate the effec-
tiveness of our proposed LVP-M3method.
2 Related Works
Multimodal Machine Translation.
The multi-
modal context plays a key role in Multimodal Ma-
chine Translation (MMT). Recent MMT methods
can be divided into three categories: (1) Using
global visual features directly (Calixto and Liu,
2017). For instance, Huang et al. (2016) proposes
to concatenate global and regional visual features
with source sequences. (2) Exploiting visual fea-
tures via attention scheme (Libovick
`
y and Helcl,
2017;Helcl et al.,2018). Calixto et al. (2017) in-
troduces the visual features into the MMT model by
using an independent attention module. (3) Com-
bining other vision tasks with the translation task by
multitask learning (Calixto et al.,2019;Yin et al.,
2020). Elliott and Kádár (2017) decomposes mul-
timodal translation into two sub-tasks (i.e., transla-
tion and visual grounding). Recently, (Huang et al.,
2020) focuses on unsupervised setting for MMT,
which utilizes pseudo visual pivoting and visual
content to improve the cross-lingual alignments
in latent space. In contrast, LVP-M
3
considers
fully-supervised multilingual setting by mapping
vision embeddings into different feature spaces and
achieving the purpose of using one MT model for
handling translations of multiple languages. Be-
sides, reducing computation cost is vital for many
tasks (Liu et al.,2021,2022c,a) and we focus on
the Multilingual MMT task by using one single
model for efficiency.
Multilingual Language Models
. Pre-trained mul-
tilingual Transformer-based language models (e.g.,
mBERT (Kenton and Toutanova,2019) and XLM-
R (Conneau et al.,2020)) utilize the same pre-
training strategies as their respective monolingual
counterparts (e.g., BERT (Kenton and Toutanova,
2019) and RoBERTa (Liu et al.,2019)). They are
pre-trained via the masked language modeling ob-
jective (MLM) Strategy. Artetxe et al. (2020) pro-
poses a method to transfer monolingual representa-
tions to new languages in an unsupervised fashion
and provide new insights into the generalization
abilities of multilingual models. Hu et al. (2020)
introduces the Cross-lingual Transfer Evaluation
of Multilingual Encoders (XTREME) benchmark
to evaluate the cross-lingual generalization capa-
bilities, Karthikeyan et al. (2020) also provides
a comprehensive study of the contribution of dif-
ferent components in M-BERT to its cross-lingual
ability. Rust et al. (2021) shows that monolingually
adapted tokenizers can robustly improve the mono-
lingual performance of multilingual models. Over-
all, when compared with these methods, we focus
on the multilingual setting for MMT, which has not
been investigated before.
Vision-Language Models
. The success of vision-
language models can be credited to the following
three important reasons: Transformers (Liu et al.,
2022b;Vaswani et al.,2017), contrastive repre-
sentation learning (Radford et al.,2021;Li et al.,
2020), and large-scale training datasets (Sharma
et al.,2018;Miech et al.,2019). Previous
Transformer-based multimodal models ( (Tan and
Bansal,2019;Chen et al.,2020;Gan et al.,2020;
Bugliarello et al.,2021)) jointly encode text to-
kens and image region features by preprocessing
images using object detection models. The image
region features are projected into the joint embed-
ding space of the multimodal Transformer, and
then the multi-head attention attends to all text
and image inputs to learn a joint representation of
both modalities. Besides, Kamath et al. (2021)
avoids using object detectors as a black box for
pre-extracting these region features and incorpo-
rates the object detector end-to-end with the multi-
modal Transformer to achieve flexibility and better
representation capacity. Recently, a representative
approach CLIP (Radford et al.,2021) is proposed,
which trains two neural network-based encoders
using a contrastive loss to match pairs of images
and texts. After consuming 400 million data pairs,
the CLIP model demonstrates a remarkable zero-
shot image recognition capability, and has been
applied to many downstream tasks. For example,
Shen et al. (2022) proposes to leverage the CLIP
model for different vision-language models across
various tasks (e.g., image caption, visual question
answering). In our work, we aim to investigate
the effectiveness of the multimodal information for
Multilingual MMT.
3 Datasets
We introduce two Multilingual MMT benchmark
datasets (i.e., M
3
-Multi30K, M
3
-AmbigCaps) us-
ing Multi30K (Elliott et al.,2016) and Ambig-
Caps (Li et al.,2021). Here, we descried the details
of the M3-Multi30K and M3-AmbigCaps.
Data Construction.
The widely-used Multi30K
dataset for the MMT task is based on the Flickr30K
Entities dataset (Plummer et al.,2017). For each
image of Multi30K, one of the English (En) descrip-
tions is selected in Flickr30K Entities. Currently,
En: A child is splashing in the water
De: Ein Kind plantscht im Wasser
Cs: Dítě se šplouchá ve vodě
Tr: Bir çocuk suya sıçratıyo
Hi:
Lv: Bērns plunčājas ūdenī
Fr: Un enfant éclabousse dans l'eau
Figure 2: Example of an image with its descriptions of
seven different languages.
Language ISO Family Speakers
English En Germanic 400M
German De Germanic 95M
French Fr Romance 250M
Czech Cs Slavic 11M
Hindi Hi Indo-Aryan 800M
Turkish Tr Turkic 65M
Latvian Lv Baltic 2M
Table 1: Languages covered by our proposed M3-
Multi30K and M3-AmbigCaps datasets.
the English description of each image is translated
into German (De), French (Fr), and Czech (Cs) (El-
liott et al.,2017;Barrault et al.,2018). To support
more languages from different language families
and various language distributions for Multilingual
MMT, we extend the existing Multi30K dataset
with additional three languages as shown in Table 1,
where one sample of the M
3
-Multi30K dataset is
shown in Fig. 2.
Specifically, in the annotation process, based
on the recent state-of-the-art multilingual machine
translation model XLM-R (Conneau et al.,2020),
we first translate the English description into Hindi
(Hi), Turkish (Tr), and Latvian (Lv) for each im-
age in Multi30K. Then, we hire independent native
speakers to verify and improve the quality of the
translation results of different languages. In addi-
摘要:

LVP-M3:Language-awareVisualPromptforMultilingualMultimodalMachineTranslationHongchengGuo*1,JiahengLiu*1,HaoyangHuang2,JianYang1,ZhoujunLiB1,DongdongZhang2,ZhengCui2,FuruWei21BeihangUniversity2MicrosoftResearchAsia{hongchengguo,liujiaheng,jiaya,lizj}@buaa.edu.cn{haohua,dozhang,zhcui,fuwei}@microsoft....

展开>> 收起<<
LVP-M3 Language-aware Visual Prompt for Multilingual Multimodal Machine Translation Hongcheng Guo1 Jiaheng Liu1 Haoyang Huang2 Jian Yang1.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:1.76MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注