Beyond English-Centric Bitexts for Better Multilingual Language Representation Learning Barun Patra Saksham Singhal Shaohan Huang

2025-05-06 0 0 735.86KB 18 页 10玖币

侵权投诉

Beyond English-Centric Bitexts for Better Multilingual Language

Representation Learning

Barun Patra∗

, Saksham Singhal∗

, Shaohan Huang∗

Zewen Chi,Li Dong,Furu Wei,Vishrav Chaudhary,Xia Song

Microsoft

{bapatra, saksingh, shaohanh, v-zewenchi, lidong1,

fuwei, vchaudhary, xiaso}@microsoft.com

Abstract

In this paper, we elaborate upon recipes

for building multilingual representation mod-

els that are not only competitive with exist-

ing state-of-the-art models but are also more

parameter efﬁcient, thereby promoting bet-

ter adoption in resource-constrained scenarios

and practical applications. We show that go-

ing beyond English-centric bitexts, coupled

with a novel sampling strategy aimed at re-

ducing under-utilization of training data, sub-

stantially boosts performance across model

sizes for both Electra and MLM pre-training

objectives. We introduce XY-LENT:X-Y

bitext enhanced Language ENcodings using

Transformers which not only achieves state-of-

the-art performance over 5 cross-lingual tasks

within all model size bands, is also compet-

itive across bands. Our XY-LENTXL vari-

ant outperforms XLM-RXXL and exhibits com-

petitive performance with mT5XXL while be-

ing 5x and 6x smaller respectively. We then

show that our proposed method helps ame-

liorate the curse of multilinguality, with the

XY-LENTXL achieving 99.3% GLUE perfor-

mance and 98.5% SQuAD 2.0 performance

compared to a SoTA English only model in the

same size band. We then analyze our models

performance on extremely low resource lan-

guages and posit that scaling alone may not

be sufﬁcient for improving the performance in

this scenario.

1 Introduction

Recent advancements in Natural Language Pro-

cessing (NLP) have been a direct consequence of

leveraging foundational models (Bommasani et al.,

2021), pretrained on a large text corpora in a self-

supervised fashion. This has also been the case

for multilingual NLP where pre-trained models

like multilingual BERT (mBERT) (Devlin,2018;

Devlin et al.,2019), XLM (Conneau and Lample,

∗Equal Contribution.

228 229 230 231 232 233 234

XXL

# Parameters

XNLI Accuracy

XYLENT

XLM-E

XLM-R

mT5

Figure 1: The proposed XY-LENT model (green line)

achieves SoTA performance within all band sizes and

is competitive performance across larger model-size

bands. The parameter efﬁciency of XY-LENTXL par-

ticularly stands out, outperforming XLM-RXXL and be-

ing competitive with mT5XXL while being 5x and 6x

smaller than them respectively. We also present the per-

formance of XLM-E which used as a baseline in this

paper.

2019), XLM-Roberta (Conneau et al.,2020), XLM-

Electra (Chi et al.,2022) and mT5 (Xue et al.,2021)

have all shown non-trivial performance gains, es-

pecially in the setup of zero-shot transfer, and have

been the work-horse for a diverse number of mul-

tilingual tasks. Given their ubiquitous applicabil-

ity in zero-shot downstream scenarios, improving

the quality and enabling their usage in resource-

constrained applications is also an important vein

of research which we explore in this paper.

A source of improvement for these models has

been leveraging bitext data for better representation

learning (Conneau and Lample,2019;Chi et al.,

2022). Most prior work, however, has focused

on leveraging English-centric (EN-X) bitext data.

Contemporaneously, the related area of Massively

arXiv:2210.14867v1 [cs.CL] 26 Oct 2022

Multilingual Machine Translation (a single model

for translating between different pairs of languages,

eg: Aharoni et al. (2019); Zhang et al. (2020); Fan

et al. (2021)) has shown tremendous progress, with

Fan et al. (2021) showing that a crucial aspect of

this improvement has been moving beyond EN-X

parallel corpora and leveraging web-based mined

X-Y bitexts spanning 1000s of translation directions

(Schwenk et al.,2021a;El-Kishky et al.,2020;

Schwenk et al.,2021b). This makes a compelling

case to explore if leveraging X-Y bitexts can also

improve multilingual representation learning.

In this work, we introduce

XY-LENT

(pro-

nounced as "

Excellent

"):

X-Y

bitext enhanced

anguage

codings using

ransformers. We ﬁrst

identify problems with using the commonly used

sampling strategy proposed in Fan et al. (2021),

showing that it induces sparse sampling distribu-

tions leading to under-utilization of data, and thus

propose a novel strategy to mitigate this issue

(§3.2). We then propose leveraging X-Y bitexts

in conjunction with the improved sampling strat-

egy, as well as a VoCAP (Zheng et al.,2021) style

sentencepiece vocabulary re-construction for im-

proving multilingual representation learning (§3.1).

We show that our proposed method improves per-

formance across all model size bands (§6). Addi-

tionally, we show that the performance gains hold

for both Masked Language Models (MLM) and

ELECTRA style models, affording an almost 12x

speedup in training for the former (§6.2). We sys-

tematically analyse the impact of model scaling

with respect to the curse of multilinguality (Con-

neau et al.,2020) to observe that the gap between

current English only SoTA models and multilingual

models can be considerably reduced (§6.3). Our

analysis reveals that XY-LENT improves perfor-

mance across language families (§6.4) and helps

reduce the cross-lingual transfer gap in multilingual

tasks (§6.5). We then demonstrate that the training

dynamics of such models can be used to better un-

derstand the underlying datasets and use it to ﬁnd

interesting defects in them (§6.6). Finally, we show

some limitations of such multilingual representa-

tional models vis-à-vis extremely low resource lan-

guages, identifying potential shortcomings that are

not addressed with scaling of such models, as well

as issues around catastrophic forgetting in the way

current models are used for domain adaptation.

In doing so, we establish state of the art on 5 mul-

tilingual downstream tasks (XNLI, PAWS-X, TY-

DIQA, XQuAD and MLQA) within a model size

band, and achieve competitive performance across

size bands, thereby showing for the ﬁrst time (to

the best of our knowledge) an interesting notion of

parameter efﬁciency: XY-LENT

outperforms

XLM-R

XXL

(Goyal et al.,2021) and performs com-

petitively with mT5

XXL

(Xue et al.,2021), whilst

being 5x and 6x smaller respectively (Figure 1).

Furthermore, our proposed model reduces the gap

for English speciﬁc tasks: XY-LENT

achieves

99.3% GLUE performance and 98.5% SQuAD 2.0

performance compared to a SoTA English only

model in the same size band.

2 Related Work

Large scale self-supervised learning has emerged

as a prominent way of building cross-lingual lan-

guage models that can be adapted for numer-

ous multilingual downstream applications. Es-

pecially for building multilingual encoder trans-

former (Vaswani et al.,2017) models, two popular

paradigms have been Masked language modeling

(MLM; Devlin et al. (2019); Conneau et al. (2020))

and pre-training encoders as discriminators (ELEC-

TRA; Clark et al. (2020b); Chi et al. (2022)), with

the latter showing considerable compute efﬁciency.

These approaches can further be improved by lever-

aging parallel corpora in different ways: Conneau

and Lample (2019) propose a Translation Language

Modeling task (TLM) wherein the model predicts

masked tokens in concatenated translation pairs,

Chi et al. (2022) propose a Translation Replaced

Token Detection (TRTD) task, an analogous task

for Electra-style models. Other approaches include

using bitexts to construct code-switched sequences

as inputs during pre-training (ALM; Yang et al.

(2020)) and for contrastive learning (InfoXLM; Chi

et al. (2021a)), or using token-level alignments in

parallel data to improve cross-lingual modeling

(Hu et al.,2021;Chi et al.,2021b,inter alia). How-

ever, all the aforementioned works rely on English-

centric bitexts.

Fan et al. (2021) show that moving beyond EN-X

bitexts for Massively Multilingual Machine Trans-

lation affords substantial improvements over ap-

proaches that rely solely on English-centric data

(Aharoni et al.,2019;Zhang et al.,2020). The pri-

mary factor responsible for this improvement has

been the curation of X-Y aligned bitext data, con-

structed by mining bitexts from publicly available

web data (Schwenk et al.,2021a;El-Kishky et al.,

2020;Schwenk et al.,2021b). The dataset construc-

tion either follows a local mining approach (ﬁrst

aligning documents using heuristics, and then min-

ing parallel bitexts from the aligned documents;

used in CCAligned (El-Kishky et al.,2020)), or

a global mining approach (all bitexts are embed-

ded in a common vector space, and then aligned

candidates are found by looking at the normalized

nearest neighbors; used in CCMatrix (Schwenk

et al.,2021b)). Due to the added supervision of

document alignment, the local mining approaches

tend to be less noisy; albeit at the cost of diver-

sity. Fan et al. (2021) also propose a sampling

strategy for leveraging the X-Y bitexts, wherein

the marginals are constrained to be similar to what

is used for En-X bitexts, and show their proposed

method improves over uniform sampling. How-

ever, as we show in (§3.2), their proposed strategy

has the undesirable artefact of inducing extremely

sparse solutions, thereby resulting in data wastage.

3 Leveraging Many-to-Many Bitexts

3.1 Dataset

Prior representation learning works usually con-

sider English-centric (EN-X) bitexts to improve

model quality. Thus, given the emergence of min-

ing based approaches for extracting parallel bitexts

from large monolingual datasets that are approxi-

mate translations of each other and are multi-way

aligned (the source and target languages are not

restricted to be English only), in this work we ex-

plore leveraging these many-to-many (X-Y) bitext

datasets for better representation learning. We con-

sider two such publicly available datasets: CCMa-

trix and multiCCAligned.

3.2 Sampling Distribution

A common method used for balancing training

data for the EN-X framework is using a temper-

ature based exponential sampling approach (Aha-

roni et al.,2019), wherein the probability of sam-

pling a language is chosen from a temperature

smoothed distribution to downsample high re-

source languages, whilst upsampling low resource

languages. This work was extended by Fan et al.

(2021), wherein the authors propose Sinkhorn Tem-

perature sampling: given a joint probability matrix

across

L×L

language pairs (

being the number

of unique languages), and the marginal distribution

of the

languages, the authors estimate a sam-

pling distribution P∗as:

max

PTr(PQ)|P1L=p

T=P>1L(1)

where

is the trace operator. The primary ad-

vantage of using this is that

P∗

can be efﬁciently

estimated with the Sinkhorn-Knopp algorithm and

also allows us to set the marginal to be the temper-

ature sampled based distribution which we know

works well in practice. The authors found this to

work better than uniform sampling.

However, in practice, we observed this to gener-

ate extremely sparse sampling distributions: Figure

2a show the sparsity induced by the naive applica-

tion of Eq. 1.

We note that one potential way of overcoming

the above issue is by modifying the optimization

problem to also maximize the entropy of

. Con-

sequently, we propose the following modiﬁed opti-

mization objective :

P∗=argminPTr (P(−log Q)) − H (P)

|P1L=p

T=P>Q)|P1L=p

T=P>1L

(2)

where

H(P)

denotes the entropy of

and

KL(P||Q)

denotes the Kullback-Leibler diver-

gence between Pand Q.

This can be solved by using the Sinkhorn-Knopp

algorithm for the entropic regularized optimal trans-

port problem (Cuturi,2013), by setting the cost

matrix to be

−log(Q+)

(in practice, since

can

have zero entries,



is used for smoothing). Since

the cost of assigning a non-zero probability value

to a zero entry is extremely high (

−log ()

), we

never observe any entry of

P∗

to be non-zero if

it’s corresponding entry in

was zero. In addi-

tion, since Eq. 2also maximizes the entropy of

it encourages its entries to be non-sparse, thereby

avoiding the problem present in the solution of Eq.

1. In practice, we did not see this losing out on any

data: if

was non-zero, then

P∗

was also non-zero

(Figure 2b).

3.3 Vocabulary Construction

We construct our vocabulary using Sentence Piece

Models (SPM) (Kudo and Richardson,2018) which

cater to language speciﬁc complexities (tokeniza-

tion, accent removal, etc. ). We increase the vo-

cabulary size to 500k tokens to better serve the

varied scripts encountered while working in the

(a) M2M 100 Sampling (b) Proposed Sampling

Figure 2: Density plots for our probability distributions for sampling strategies for M2M 100 and our proposed

sampling strategy for the 21 languages considered in downstream tasks. For similar plot for all the languages, see

Figure 6b in the Appendix

multilingual setting. For this construction, we fol-

low the VoCAP algorithm (Zheng et al.,2021)

to quantify the vocabulary capacity for each lan-

guage separately and account for varied corpora

sizes across languages. Better capacity allocation

leads to smaller representative sequences (espe-

cially for mid and low resource languages) which

in-turn improves the computational efﬁciency of

the model. Increasing the size of the vocabulary,

however, comes at the cost of inﬂating the model

parameters which is particularly observed in the

case of XY-LENT

Base

and XY-LENT

Large

where

the embeddings layer constitute 80.5% and 62.9%

of the total parameters respectively.

4 Pretraining Details

We follow the XLM-E (Chi et al.,2022) pretrain-

ing approach and only introduce a few architectural

changes to improve the overall performance of the

model. We use the Transformer model (Vaswani

et al.,2017) trained with ELECTRA (Clark et al.,

2020b) style of replace token detection (RTD) on

both monolingual (MRTD) and bitext (TRTD) data.

In the current setup of training, we use two Trans-

former encoders in conjunction: a generator

and

a discriminator

, where the generator

is trained

with masked language modeling objective (MLM;

Devlin et al. (2019)) and the discriminator is trained

on replaced token detection objective (RTD; Clark

et al. (2020b) on all the tokens passing through the

generator.

In addition to using the Gated Relative Position

Bias introduced in Chi et al. (2022), we do not mask

the [CLS] token and ﬂip bitext language order with

probability p= 0.5for the TRTD task.

5 Experiments

Baselines: We compare the cross-lingual perfor-

mance of our proposed model against 3 popular

cross-lingual models: XLM-R, mT5 and XLM-E

(across all model size variations). Note that Chi

et al. (2022) use a 250k vocabulary size for XLM-

Base

and 500k vocabulary for their large and XL

variants. As a follow-up, we re-train XLM-E

Base

with the same vocabulary as used by XY-LENT

for a fair comparison. Thus all references to XLM-

Base

refer to the re-trained model variant with a

500k vocabulary size.

For our downstream En-

glish evaluation (§6.3), we compare against the

SoTA English model METRO-LM(Bajaj et al.,

2022). Note that Bajaj et al. (2022) also train the

models in an ELECTRA style framework, thereby

allowing for a fair comparison.

Pretraining Data:

For our monolingual data,

we follow Chi et al. (2022) and use the CC-100

dataset

(Conneau et al.,2020;Wenzek et al.,2020)

which contains texts in 100 languages collected

from Common Crawl. As mentioned in (§3.1), we

explore the utility of the CCMatrix and the mul-

tiCCAligned X-Y aligned bitext data. CCMatrix

consists of 1015 language pairs (97 unique lan-

We also ablate out the impact of the vocabulary change,

with Table 2showing that this yields a 1.5 pt gain on XNLI.

2http://data.statmt.org/cc-100/

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BeyondEnglish-CentricBitextsforBetterMultilingualLanguageRepresentationLearningBarunPatra,SakshamSinghal,ShaohanHuang,ZewenChi,LiDong,FuruWei,VishravChaudhary,XiaSongMicrosoft{bapatra,saksingh,shaohanh,v-zewenchi,lidong1,fuwei,vchaudhary,xiaso}@microsoft.comAbstractInthispaper,weelaborateuponreci...

展开>> 收起<<

Beyond English-Centric Bitexts for Better Multilingual Language Representation Learning Barun Patra Saksham Singhal Shaohan Huang.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Beyond English-Centric Bitexts for Better Multilingual Language Representation Learning Barun Patra Saksham Singhal Shaohan Huang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: