
Data en→fr en→es
Europarl 2,007,723 1,965,734
IWSLT 275,085 265,625
Combined (after preprocessing) 2,155,543 2,119,686
Regular 2,152,716 2,116,889
Idiom-train 1,327 1,312
Idiom-test 1,383 1,373
WMT-test 3,003 3,000
IWSLT-test 2,632 2,502
Table 1: Dataset statistics
use the extracted idiom-test data per language pair.
To generate the word alignments for APT-Eval, we
trained a fast-align (Dyer et al.,2013) model on
each language-pair’s training data. For decoding,
we use beam search with beams of size 5, and eval-
uate all models using BLEU (Papineni et al.,2002)
computed with SacreBLEU (Post,2018).
Preprocessing
We first filter out sentence pairs
with more than 80 words or with length ratio over
1.5. Then, we tokenize the remaining sentences
using sentencepiece
6
(SPM; Kudo and Richardson
2018). For the randomly initialized models, we
train SPM models with a joint vocabulary of 60K
symbols on the concatenation of the source- and
target-side of the regular training data. For the
mBART fine-tuning experiments, we use the SPM
model of mBART (250K symbols).
3.2 Models
Besides training models from scratch, we also in-
vestigate how pretraining on monolingual data af-
fects idiom translation, which yields substantial im-
provements on generic translation quality (Lample
and Conneau,2019;Song et al.,2019;Liu et al.,
2020). However, it is not obvious if monolingual
data can help idiom translation, as they do not con-
tain any examples with how to translate an idiom
from one language into another.
We use mBART (Liu et al.,2020) via finetun-
ing, which is pretrained on monolingual data from
many languages. We hypothesize that one way
multilingual pre-training can help is by bootstrap-
ping over the source and target language contexts
in which idioms occur. We also consider inject-
ing different types of noise during fine-tuning, to
corrupt the (encoder or decoder) input context and
measure the effects on the targeted evaluation met-
rics. Specifically, we use source-side word mask-
ing and replacement (Baziotis et al.,2021), and
6
We use the
unigram
model with
coverage=0.9999
target-side word-replacement noise (Voita et al.,
2021). In our experiments, “random” denotes
a randomly initialized model, while “mBART”
stands for using mBART as initialization. For
noisy finetuning we train the following variants:
“mBART+mask” where we mask 10% of the source
tokens, “mBART+replace (enc)” where we replace
10% of the source tokens with random ones, and
“mBART+replace (dec)” where we replace 10% of
the target tokens with random ones.
Model Configuration
For fair comparison, the
randomly initialized models use the same architec-
ture as mBART. Specifically, the models are based
on the Transformer architecture, with 12 encoder
and decoder layers, 1024 embedding size and 16
self-attention heads. Our code is based on the offi-
cial mBART implementation in Fairseq.
Optimization
We optimized our models using
Adam (Kingma and Ba,2015) with
β1= 0.9, β2=
0.999
, and
=1e-6. For the random initialization
experiments, the models were trained for 140K up-
dates with batches of 24K tokens, using a learning
rate of 1e-4 with a linear warm-up of 4K steps, fol-
lowed by inverted squared decay. For the mBART
initialization experiments, the models were trained
for 140K updates with batches of 12K tokens, us-
ing a fixed learning rate of 3e-5 with a linear warm-
up of 4K steps. In all experiments, we applied
dropout of 0.3, attention-dropout of 0.1 and label
smoothing of 0.1. For model selection, we evalu-
ated each model every 5K updates on the dev set,
and selected the one with the best BLEU.
3.3 Results
In this section, for brevity, we discuss a subset of
our results, in particular our experiments in en
→
fr.
Results for en
→
es are consistent with en
→
fr and
are included in Appendix B. Table 2summarizes
all of our main results. Besides global evaluation
using BLEU (§3.3.2) on diverse test sets, we also
consider two targeted evaluation methods (§3.3.1)
that focus on how the idioms are translated using
our idiom-test set. For the upsampling split, we up-
sample the idiom-train data 20x. We also experi-
mented with 100x upsampling, but models started
to exhibit overfitting effects (see §B, §D).
3.3.1 Targeted Evaluation
In targeted evaluation, we focus only on how mod-
els translate the source-side idioms. We present re-
sults on our proposed LitTER metric and on APT-