Beyond English-Centric Bitexts for Better Multilingual Language Representation Learning Barun Patra Saksham Singhal Shaohan Huang

2025-05-06 0 0 735.86KB 18 页 10玖币
侵权投诉
Beyond English-Centric Bitexts for Better Multilingual Language
Representation Learning
Barun Patra
, Saksham Singhal
, Shaohan Huang
,
Zewen Chi,Li Dong,Furu Wei,Vishrav Chaudhary,Xia Song
Microsoft
{bapatra, saksingh, shaohanh, v-zewenchi, lidong1,
fuwei, vchaudhary, xiaso}@microsoft.com
Abstract
In this paper, we elaborate upon recipes
for building multilingual representation mod-
els that are not only competitive with exist-
ing state-of-the-art models but are also more
parameter efficient, thereby promoting bet-
ter adoption in resource-constrained scenarios
and practical applications. We show that go-
ing beyond English-centric bitexts, coupled
with a novel sampling strategy aimed at re-
ducing under-utilization of training data, sub-
stantially boosts performance across model
sizes for both Electra and MLM pre-training
objectives. We introduce XY-LENT:X-Y
bitext enhanced Language ENcodings using
Transformers which not only achieves state-of-
the-art performance over 5 cross-lingual tasks
within all model size bands, is also compet-
itive across bands. Our XY-LENTXL vari-
ant outperforms XLM-RXXL and exhibits com-
petitive performance with mT5XXL while be-
ing 5x and 6x smaller respectively. We then
show that our proposed method helps ame-
liorate the curse of multilinguality, with the
XY-LENTXL achieving 99.3% GLUE perfor-
mance and 98.5% SQuAD 2.0 performance
compared to a SoTA English only model in the
same size band. We then analyze our models
performance on extremely low resource lan-
guages and posit that scaling alone may not
be sufficient for improving the performance in
this scenario.
1 Introduction
Recent advancements in Natural Language Pro-
cessing (NLP) have been a direct consequence of
leveraging foundational models (Bommasani et al.,
2021), pretrained on a large text corpora in a self-
supervised fashion. This has also been the case
for multilingual NLP where pre-trained models
like multilingual BERT (mBERT) (Devlin,2018;
Devlin et al.,2019), XLM (Conneau and Lample,
Equal Contribution.
228 229 230 231 232 233 234
75
76
77
78
79
80
81
82
83
84
85
86
B
B
B
B
L
L
L
L
XL
XL
XL
XL
XXL
XXL
# Parameters
XNLI Accuracy
XYLENT
XLM-E
XLM-R
mT5
Figure 1: The proposed XY-LENT model (green line)
achieves SoTA performance within all band sizes and
is competitive performance across larger model-size
bands. The parameter efficiency of XY-LENTXL par-
ticularly stands out, outperforming XLM-RXXL and be-
ing competitive with mT5XXL while being 5x and 6x
smaller than them respectively. We also present the per-
formance of XLM-E which used as a baseline in this
paper.
2019), XLM-Roberta (Conneau et al.,2020), XLM-
Electra (Chi et al.,2022) and mT5 (Xue et al.,2021)
have all shown non-trivial performance gains, es-
pecially in the setup of zero-shot transfer, and have
been the work-horse for a diverse number of mul-
tilingual tasks. Given their ubiquitous applicabil-
ity in zero-shot downstream scenarios, improving
the quality and enabling their usage in resource-
constrained applications is also an important vein
of research which we explore in this paper.
A source of improvement for these models has
been leveraging bitext data for better representation
learning (Conneau and Lample,2019;Chi et al.,
2022). Most prior work, however, has focused
on leveraging English-centric (EN-X) bitext data.
Contemporaneously, the related area of Massively
arXiv:2210.14867v1 [cs.CL] 26 Oct 2022
Multilingual Machine Translation (a single model
for translating between different pairs of languages,
eg: Aharoni et al. (2019); Zhang et al. (2020); Fan
et al. (2021)) has shown tremendous progress, with
Fan et al. (2021) showing that a crucial aspect of
this improvement has been moving beyond EN-X
parallel corpora and leveraging web-based mined
X-Y bitexts spanning 1000s of translation directions
(Schwenk et al.,2021a;El-Kishky et al.,2020;
Schwenk et al.,2021b). This makes a compelling
case to explore if leveraging X-Y bitexts can also
improve multilingual representation learning.
In this work, we introduce
XY-LENT
(pro-
nounced as "
Excellent
"):
X-Y
bitext enhanced
L
anguage
EN
codings using
T
ransformers. We first
identify problems with using the commonly used
sampling strategy proposed in Fan et al. (2021),
showing that it induces sparse sampling distribu-
tions leading to under-utilization of data, and thus
propose a novel strategy to mitigate this issue
3.2). We then propose leveraging X-Y bitexts
in conjunction with the improved sampling strat-
egy, as well as a VoCAP (Zheng et al.,2021) style
sentencepiece vocabulary re-construction for im-
proving multilingual representation learning (§3.1).
We show that our proposed method improves per-
formance across all model size bands (§6). Addi-
tionally, we show that the performance gains hold
for both Masked Language Models (MLM) and
ELECTRA style models, affording an almost 12x
speedup in training for the former (§6.2). We sys-
tematically analyse the impact of model scaling
with respect to the curse of multilinguality (Con-
neau et al.,2020) to observe that the gap between
current English only SoTA models and multilingual
models can be considerably reduced (§6.3). Our
analysis reveals that XY-LENT improves perfor-
mance across language families (§6.4) and helps
reduce the cross-lingual transfer gap in multilingual
tasks (§6.5). We then demonstrate that the training
dynamics of such models can be used to better un-
derstand the underlying datasets and use it to find
interesting defects in them (§6.6). Finally, we show
some limitations of such multilingual representa-
tional models vis-à-vis extremely low resource lan-
guages, identifying potential shortcomings that are
not addressed with scaling of such models, as well
as issues around catastrophic forgetting in the way
current models are used for domain adaptation.
In doing so, we establish state of the art on 5 mul-
tilingual downstream tasks (XNLI, PAWS-X, TY-
DIQA, XQuAD and MLQA) within a model size
band, and achieve competitive performance across
size bands, thereby showing for the first time (to
the best of our knowledge) an interesting notion of
parameter efficiency: XY-LENT
XL
outperforms
XLM-R
XXL
(Goyal et al.,2021) and performs com-
petitively with mT5
XXL
(Xue et al.,2021), whilst
being 5x and 6x smaller respectively (Figure 1).
Furthermore, our proposed model reduces the gap
for English specific tasks: XY-LENT
XL
achieves
99.3% GLUE performance and 98.5% SQuAD 2.0
performance compared to a SoTA English only
model in the same size band.
2 Related Work
Large scale self-supervised learning has emerged
as a prominent way of building cross-lingual lan-
guage models that can be adapted for numer-
ous multilingual downstream applications. Es-
pecially for building multilingual encoder trans-
former (Vaswani et al.,2017) models, two popular
paradigms have been Masked language modeling
(MLM; Devlin et al. (2019); Conneau et al. (2020))
and pre-training encoders as discriminators (ELEC-
TRA; Clark et al. (2020b); Chi et al. (2022)), with
the latter showing considerable compute efficiency.
These approaches can further be improved by lever-
aging parallel corpora in different ways: Conneau
and Lample (2019) propose a Translation Language
Modeling task (TLM) wherein the model predicts
masked tokens in concatenated translation pairs,
Chi et al. (2022) propose a Translation Replaced
Token Detection (TRTD) task, an analogous task
for Electra-style models. Other approaches include
using bitexts to construct code-switched sequences
as inputs during pre-training (ALM; Yang et al.
(2020)) and for contrastive learning (InfoXLM; Chi
et al. (2021a)), or using token-level alignments in
parallel data to improve cross-lingual modeling
(Hu et al.,2021;Chi et al.,2021b,inter alia). How-
ever, all the aforementioned works rely on English-
centric bitexts.
Fan et al. (2021) show that moving beyond EN-X
bitexts for Massively Multilingual Machine Trans-
lation affords substantial improvements over ap-
proaches that rely solely on English-centric data
(Aharoni et al.,2019;Zhang et al.,2020). The pri-
mary factor responsible for this improvement has
been the curation of X-Y aligned bitext data, con-
structed by mining bitexts from publicly available
web data (Schwenk et al.,2021a;El-Kishky et al.,
2020;Schwenk et al.,2021b). The dataset construc-
tion either follows a local mining approach (first
aligning documents using heuristics, and then min-
ing parallel bitexts from the aligned documents;
used in CCAligned (El-Kishky et al.,2020)), or
a global mining approach (all bitexts are embed-
ded in a common vector space, and then aligned
candidates are found by looking at the normalized
nearest neighbors; used in CCMatrix (Schwenk
et al.,2021b)). Due to the added supervision of
document alignment, the local mining approaches
tend to be less noisy; albeit at the cost of diver-
sity. Fan et al. (2021) also propose a sampling
strategy for leveraging the X-Y bitexts, wherein
the marginals are constrained to be similar to what
is used for En-X bitexts, and show their proposed
method improves over uniform sampling. How-
ever, as we show in (§3.2), their proposed strategy
has the undesirable artefact of inducing extremely
sparse solutions, thereby resulting in data wastage.
3 Leveraging Many-to-Many Bitexts
3.1 Dataset
Prior representation learning works usually con-
sider English-centric (EN-X) bitexts to improve
model quality. Thus, given the emergence of min-
ing based approaches for extracting parallel bitexts
from large monolingual datasets that are approxi-
mate translations of each other and are multi-way
aligned (the source and target languages are not
restricted to be English only), in this work we ex-
plore leveraging these many-to-many (X-Y) bitext
datasets for better representation learning. We con-
sider two such publicly available datasets: CCMa-
trix and multiCCAligned.
3.2 Sampling Distribution
A common method used for balancing training
data for the EN-X framework is using a temper-
ature based exponential sampling approach (Aha-
roni et al.,2019), wherein the probability of sam-
pling a language is chosen from a temperature
smoothed distribution to downsample high re-
source languages, whilst upsampling low resource
languages. This work was extended by Fan et al.
(2021), wherein the authors propose Sinkhorn Tem-
perature sampling: given a joint probability matrix
Q
across
L×L
language pairs (
L
being the number
of unique languages), and the marginal distribution
p
of the
L
languages, the authors estimate a sam-
pling distribution Pas:
max
PTr(PQ)|P1L=p
1
T=P>1L(1)
where
Tr
is the trace operator. The primary ad-
vantage of using this is that
P
can be efficiently
estimated with the Sinkhorn-Knopp algorithm and
also allows us to set the marginal to be the temper-
ature sampled based distribution which we know
works well in practice. The authors found this to
work better than uniform sampling.
However, in practice, we observed this to gener-
ate extremely sparse sampling distributions: Figure
2a show the sparsity induced by the naive applica-
tion of Eq. 1.
We note that one potential way of overcoming
the above issue is by modifying the optimization
problem to also maximize the entropy of
P
. Con-
sequently, we propose the following modified opti-
mization objective :
P=argminPTr (P(log Q)) H (P)
|P1L=p
1
T=P>Q)|P1L=p
1
T=P>1L
(2)
where
H(P)
denotes the entropy of
P
and
KL(P||Q)
denotes the Kullback-Leibler diver-
gence between Pand Q.
This can be solved by using the Sinkhorn-Knopp
algorithm for the entropic regularized optimal trans-
port problem (Cuturi,2013), by setting the cost
matrix to be
log(Q+)
(in practice, since
Q
can
have zero entries,
is used for smoothing). Since
the cost of assigning a non-zero probability value
to a zero entry is extremely high (
log ()
), we
never observe any entry of
P
to be non-zero if
it’s corresponding entry in
Q
was zero. In addi-
tion, since Eq. 2also maximizes the entropy of
P
,
it encourages its entries to be non-sparse, thereby
avoiding the problem present in the solution of Eq.
1. In practice, we did not see this losing out on any
data: if
Q
was non-zero, then
P
was also non-zero
(Figure 2b).
3.3 Vocabulary Construction
We construct our vocabulary using Sentence Piece
Models (SPM) (Kudo and Richardson,2018) which
cater to language specific complexities (tokeniza-
tion, accent removal, etc. ). We increase the vo-
cabulary size to 500k tokens to better serve the
varied scripts encountered while working in the
(a) M2M 100 Sampling (b) Proposed Sampling
Figure 2: Density plots for our probability distributions for sampling strategies for M2M 100 and our proposed
sampling strategy for the 21 languages considered in downstream tasks. For similar plot for all the languages, see
Figure 6b in the Appendix
multilingual setting. For this construction, we fol-
low the VoCAP algorithm (Zheng et al.,2021)
to quantify the vocabulary capacity for each lan-
guage separately and account for varied corpora
sizes across languages. Better capacity allocation
leads to smaller representative sequences (espe-
cially for mid and low resource languages) which
in-turn improves the computational efficiency of
the model. Increasing the size of the vocabulary,
however, comes at the cost of inflating the model
parameters which is particularly observed in the
case of XY-LENT
Base
and XY-LENT
Large
where
the embeddings layer constitute 80.5% and 62.9%
of the total parameters respectively.
4 Pretraining Details
We follow the XLM-E (Chi et al.,2022) pretrain-
ing approach and only introduce a few architectural
changes to improve the overall performance of the
model. We use the Transformer model (Vaswani
et al.,2017) trained with ELECTRA (Clark et al.,
2020b) style of replace token detection (RTD) on
both monolingual (MRTD) and bitext (TRTD) data.
In the current setup of training, we use two Trans-
former encoders in conjunction: a generator
G
and
a discriminator
D
, where the generator
G
is trained
with masked language modeling objective (MLM;
Devlin et al. (2019)) and the discriminator is trained
on replaced token detection objective (RTD; Clark
et al. (2020b) on all the tokens passing through the
generator.
In addition to using the Gated Relative Position
Bias introduced in Chi et al. (2022), we do not mask
the [CLS] token and flip bitext language order with
probability p= 0.5for the TRTD task.
5 Experiments
Baselines: We compare the cross-lingual perfor-
mance of our proposed model against 3 popular
cross-lingual models: XLM-R, mT5 and XLM-E
(across all model size variations). Note that Chi
et al. (2022) use a 250k vocabulary size for XLM-
E
Base
and 500k vocabulary for their large and XL
variants. As a follow-up, we re-train XLM-E
Base
with the same vocabulary as used by XY-LENT
for a fair comparison. Thus all references to XLM-
E
Base
refer to the re-trained model variant with a
500k vocabulary size.
1
For our downstream En-
glish evaluation (§6.3), we compare against the
SoTA English model METRO-LM(Bajaj et al.,
2022). Note that Bajaj et al. (2022) also train the
models in an ELECTRA style framework, thereby
allowing for a fair comparison.
Pretraining Data:
For our monolingual data,
we follow Chi et al. (2022) and use the CC-100
dataset
2
(Conneau et al.,2020;Wenzek et al.,2020)
which contains texts in 100 languages collected
from Common Crawl. As mentioned in (§3.1), we
explore the utility of the CCMatrix and the mul-
tiCCAligned X-Y aligned bitext data. CCMatrix
consists of 1015 language pairs (97 unique lan-
1
We also ablate out the impact of the vocabulary change,
with Table 2showing that this yields a 1.5 pt gain on XNLI.
2http://data.statmt.org/cc-100/
摘要:

BeyondEnglish-CentricBitextsforBetterMultilingualLanguageRepresentationLearningBarunPatra,SakshamSinghal,ShaohanHuang,ZewenChi,LiDong,FuruWei,VishravChaudhary,XiaSongMicrosoft{bapatra,saksingh,shaohanh,v-zewenchi,lidong1,fuwei,vchaudhary,xiaso}@microsoft.comAbstractInthispaper,weelaborateuponreci...

展开>> 收起<<
Beyond English-Centric Bitexts for Better Multilingual Language Representation Learning Barun Patra Saksham Singhal Shaohan Huang.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:735.86KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注