Multilingual Machine Translation (a single model
for translating between different pairs of languages,
eg: Aharoni et al. (2019); Zhang et al. (2020); Fan
et al. (2021)) has shown tremendous progress, with
Fan et al. (2021) showing that a crucial aspect of
this improvement has been moving beyond EN-X
parallel corpora and leveraging web-based mined
X-Y bitexts spanning 1000s of translation directions
(Schwenk et al.,2021a;El-Kishky et al.,2020;
Schwenk et al.,2021b). This makes a compelling
case to explore if leveraging X-Y bitexts can also
improve multilingual representation learning.
In this work, we introduce
XY-LENT
(pro-
nounced as "
Excellent
"):
X-Y
bitext enhanced
L
anguage
EN
codings using
T
ransformers. We first
identify problems with using the commonly used
sampling strategy proposed in Fan et al. (2021),
showing that it induces sparse sampling distribu-
tions leading to under-utilization of data, and thus
propose a novel strategy to mitigate this issue
(§3.2). We then propose leveraging X-Y bitexts
in conjunction with the improved sampling strat-
egy, as well as a VoCAP (Zheng et al.,2021) style
sentencepiece vocabulary re-construction for im-
proving multilingual representation learning (§3.1).
We show that our proposed method improves per-
formance across all model size bands (§6). Addi-
tionally, we show that the performance gains hold
for both Masked Language Models (MLM) and
ELECTRA style models, affording an almost 12x
speedup in training for the former (§6.2). We sys-
tematically analyse the impact of model scaling
with respect to the curse of multilinguality (Con-
neau et al.,2020) to observe that the gap between
current English only SoTA models and multilingual
models can be considerably reduced (§6.3). Our
analysis reveals that XY-LENT improves perfor-
mance across language families (§6.4) and helps
reduce the cross-lingual transfer gap in multilingual
tasks (§6.5). We then demonstrate that the training
dynamics of such models can be used to better un-
derstand the underlying datasets and use it to find
interesting defects in them (§6.6). Finally, we show
some limitations of such multilingual representa-
tional models vis-à-vis extremely low resource lan-
guages, identifying potential shortcomings that are
not addressed with scaling of such models, as well
as issues around catastrophic forgetting in the way
current models are used for domain adaptation.
In doing so, we establish state of the art on 5 mul-
tilingual downstream tasks (XNLI, PAWS-X, TY-
DIQA, XQuAD and MLQA) within a model size
band, and achieve competitive performance across
size bands, thereby showing for the first time (to
the best of our knowledge) an interesting notion of
parameter efficiency: XY-LENT
XL
outperforms
XLM-R
XXL
(Goyal et al.,2021) and performs com-
petitively with mT5
XXL
(Xue et al.,2021), whilst
being 5x and 6x smaller respectively (Figure 1).
Furthermore, our proposed model reduces the gap
for English specific tasks: XY-LENT
XL
achieves
99.3% GLUE performance and 98.5% SQuAD 2.0
performance compared to a SoTA English only
model in the same size band.
2 Related Work
Large scale self-supervised learning has emerged
as a prominent way of building cross-lingual lan-
guage models that can be adapted for numer-
ous multilingual downstream applications. Es-
pecially for building multilingual encoder trans-
former (Vaswani et al.,2017) models, two popular
paradigms have been Masked language modeling
(MLM; Devlin et al. (2019); Conneau et al. (2020))
and pre-training encoders as discriminators (ELEC-
TRA; Clark et al. (2020b); Chi et al. (2022)), with
the latter showing considerable compute efficiency.
These approaches can further be improved by lever-
aging parallel corpora in different ways: Conneau
and Lample (2019) propose a Translation Language
Modeling task (TLM) wherein the model predicts
masked tokens in concatenated translation pairs,
Chi et al. (2022) propose a Translation Replaced
Token Detection (TRTD) task, an analogous task
for Electra-style models. Other approaches include
using bitexts to construct code-switched sequences
as inputs during pre-training (ALM; Yang et al.
(2020)) and for contrastive learning (InfoXLM; Chi
et al. (2021a)), or using token-level alignments in
parallel data to improve cross-lingual modeling
(Hu et al.,2021;Chi et al.,2021b,inter alia). How-
ever, all the aforementioned works rely on English-
centric bitexts.
Fan et al. (2021) show that moving beyond EN-X
bitexts for Massively Multilingual Machine Trans-
lation affords substantial improvements over ap-
proaches that rely solely on English-centric data
(Aharoni et al.,2019;Zhang et al.,2020). The pri-
mary factor responsible for this improvement has
been the curation of X-Y aligned bitext data, con-
structed by mining bitexts from publicly available
web data (Schwenk et al.,2021a;El-Kishky et al.,