Machine Translation (
Multilingual MMT
) task to
achieve the translations for multiple languages us-
ing one single model.
To eliminate the above limitations, we propose
a simple and effective
LVP-M3
method, including
Token Encoding, Language-aware Visual Prompt
Generation (LVPG), and Language Translation.
Specifically, in the token encoding stage, we use
the pre-trained vision encoder to extract the visual
tokens. Then, we follow (Johnson et al.,2017) to
utilize the Transformer to encode the textual to-
kens. In LVPG, inspired by (Yang et al.,2019) and
(Tian et al.,2020), a controller network in Fig. 3is
leveraged to dynamically generate the parameters
of the mapping network conditioned on the target
language. Further, the mapping network outputs
the language-aware visual prompts. After that, dur-
ing the language translation, following the works
(e.g., ViLBERT (Lu et al.,2019)), we utilize co-
Transformer to generate the vision-guided language
tokens. Then the Transformer decoder is adopted
to predict the translation results.
Extensive experiments are conducted on our pro-
posed benchmark datasets for LVP-M
3
. Results
show that our model achieves the state-of-the-art
performance in all translation directions, especially
outperforming the text-only multilingual model by
4.3BLEU scores on average.
The contributions of this work are summarized
as follows:
•
We first propose the Multilingual Multimodal
Machine Translation (Multilingual MMT) to
handle the translations for multiple language
pairs, which investigates the effect of vision
modality for multilingual translation and re-
duces the computation costs of existing MMT
methods for multiple languages.
•
For Multilingual MMT, we propose an effec-
tive language-aware visual prompt generation
strategy to produce different visual prompts
for different target languages based on the vi-
sion modality and type of the target language.
•
We establish two Multilingual MMT bench-
mark datasets to nourish the further research
on Multilingual MMT, and extensive experi-
ments on these datasets demonstrate the effec-
tiveness of our proposed LVP-M3method.
2 Related Works
Multimodal Machine Translation.
The multi-
modal context plays a key role in Multimodal Ma-
chine Translation (MMT). Recent MMT methods
can be divided into three categories: (1) Using
global visual features directly (Calixto and Liu,
2017). For instance, Huang et al. (2016) proposes
to concatenate global and regional visual features
with source sequences. (2) Exploiting visual fea-
tures via attention scheme (Libovick
`
y and Helcl,
2017;Helcl et al.,2018). Calixto et al. (2017) in-
troduces the visual features into the MMT model by
using an independent attention module. (3) Com-
bining other vision tasks with the translation task by
multitask learning (Calixto et al.,2019;Yin et al.,
2020). Elliott and Kádár (2017) decomposes mul-
timodal translation into two sub-tasks (i.e., transla-
tion and visual grounding). Recently, (Huang et al.,
2020) focuses on unsupervised setting for MMT,
which utilizes pseudo visual pivoting and visual
content to improve the cross-lingual alignments
in latent space. In contrast, LVP-M
3
considers
fully-supervised multilingual setting by mapping
vision embeddings into different feature spaces and
achieving the purpose of using one MT model for
handling translations of multiple languages. Be-
sides, reducing computation cost is vital for many
tasks (Liu et al.,2021,2022c,a) and we focus on
the Multilingual MMT task by using one single
model for efficiency.
Multilingual Language Models
. Pre-trained mul-
tilingual Transformer-based language models (e.g.,
mBERT (Kenton and Toutanova,2019) and XLM-
R (Conneau et al.,2020)) utilize the same pre-
training strategies as their respective monolingual
counterparts (e.g., BERT (Kenton and Toutanova,
2019) and RoBERTa (Liu et al.,2019)). They are
pre-trained via the masked language modeling ob-
jective (MLM) Strategy. Artetxe et al. (2020) pro-
poses a method to transfer monolingual representa-
tions to new languages in an unsupervised fashion
and provide new insights into the generalization
abilities of multilingual models. Hu et al. (2020)
introduces the Cross-lingual Transfer Evaluation
of Multilingual Encoders (XTREME) benchmark
to evaluate the cross-lingual generalization capa-
bilities, Karthikeyan et al. (2020) also provides
a comprehensive study of the contribution of dif-
ferent components in M-BERT to its cross-lingual
ability. Rust et al. (2021) shows that monolingually
adapted tokenizers can robustly improve the mono-