ENTITY CS Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching Chenxi Whitehouse12 Fenia Christopoulou2 Ignacio Iacobacci2

2025-05-06 0 0 728.12KB 17 页 10玖币

侵权投诉

ENTITYCS: Improving Zero-Shot Cross-lingual Transfer

with Entity-Centric Code Switching

Chenxi Whitehouse1,2,*, Fenia Christopoulou2, Ignacio Iacobacci2

1City, University of London

2Huawei Noah’s Ark Lab, London, UK

chenxi.whitehouse@city.ac.uk

{efstathia.christopoulou, ignacio.iacoboacci}@huawei.com

Abstract

Accurate alignment between languages is

fundamental for improving cross-lingual pre-

trained language models (XLMs). Moti-

vated by the natural phenomenon of code-

switching (CS) in multilingual speakers, CS

has been used as an effective data augmen-

tation method that offers language alignment

at the word- or phrase-level, in contrast to

sentence-level via parallel instances. Exist-

ing approaches either use dictionaries or par-

allel sentences with word alignment to gen-

erate CS data by randomly switching words

in a sentence. However, such methods can

be suboptimal as dictionaries disregard seman-

tics, and syntax might become invalid after ran-

dom word switching. In this work, we pro-

pose ENTITYCS, a method that focuses on

ENTITY-level Code-Switching to capture ﬁne-

grained cross-lingual semantics without cor-

rupting syntax. We use Wikidata and English

Wikipedia to construct an entity-centric CS

corpus by switching entities to their counter-

parts in other languages. We further propose

entity-oriented masking strategies during in-

termediate model training on the ENTITYCS

corpus for improving entity prediction. Eval-

uation of the trained models on four entity-

centric downstream tasks shows consistent im-

provements over the baseline with a notable in-

crease of 10% in Fact Retrieval. We release the

corpus and models to assist research on code-

switching and enriching XLMs with external

knowledge1.

1 Introduction

Cross-lingual pre-trained Language Models

(XLMs) such as mBERT (Devlin et al.,2019)

and XLM-R (Conneau et al.,2020a), have

∗

Work conducted as Research Intern at Huawei Noah’s Ark

Lab, London, UK.

Code and models are available at

https://github.

com/huawei-noah/noah-research/tree/master/

NLP/EntityCS.

She was studying [[ computer science ]] and [[ electrical engineering ]] .

Q21198

en: Computer Science

zh: 计算机科学

hi: कूटर िवान

fr: Informatique

ar: بﻮﺳﺎﺤﻟا ﻢﻠﻋ

el: Επιστήμη Υπολογιστών

. . .

Q43035

en: Electrical Engineering

el: Ηλεκτρολογία

zh: 电气工程学

hi: िवद् युत अिभयाकी

fr: Électrotechnique

ar: ﺔﯿﺋﺎﺑﺮﮫﻛ ﺔﺳﺪﻨھ

. . .

She was studying

<e>

कूटर िवान

</e>

and

<e>

िवद

्युत अिभयाकी

</e>

She was studying

<e>

计算机科学

</e>

and

<e>

电气工程学

</e>

She was studying

<e>

Informatique

</e>

and

<e>

Électrotechnique

</e>

computer science Q21198

electrical engineering Q43035

She was studying

<e>

computer science

</e>

and

<e>

electrical engineering

</e>

. . .

Figure 1: Illustration of generating ENTITYCS sen-

tences from an English sentence extracted from

Wikipedia. Entities in double square brackets indicate

wikilinks.

achieved state-of-the-art zero-shot cross-lingual

transferability across diverse Natural Language

Understanding (NLU) tasks. Such models

have been particularly enhanced with the use

of bilingual parallel sentences together with

alignment methods (Yang et al.,2020;Chi et al.,

2021a;Hu et al.,2021;Gritta and Iacobacci,

2021;Feng et al.,2022). However, obtaining

high-quality parallel data is expensive, especially

for low-resource languages. Therefore, alternative

data augmentation approaches have been proposed,

one of which is Code Switching.

Code Switching (CS) is a phenomenon when

multilingual speakers alternate words between lan-

guages when they speak, which has been studied

for many years (Gumperz,1977;Khanuja et al.,

2020;Do˘

gruöz et al.,2021). Code-switched sen-

tences consist of words or phrases in different lan-

guages, therefore they capture the semantics of

ﬁner-grained cross-lingual expressions compared

to parallel sentences, and have been used for mul-

tilingual intermediate training (Yang et al.,2020)

and ﬁne-tuning (Qin et al.,2020;Krishnan et al.,

arXiv:2210.12540v2 [cs.CL] 13 Feb 2023

2021). Nevertheless, since manually creating large-

scale CS datasets is costly and only a few natural

CS texts exist (Barik et al.,2019;Xiang et al.,2020;

Chakravarthi et al.,2020;Lovenia et al.,2021), re-

search has turned to automatic CS data generation.

Some of those approaches generate CS data via

dictionaries, usually ignoring ambiguity (Qin et al.,

2020;Conneau et al.,2020b). Others require paral-

lel data and an alignment method to match words

or phrases between languages (Yang et al.,2020;

Rizvi et al.,2021). In both cases, what is switched

is chosen randomly, potentially resulting in syntac-

tically odd sentences or switching to words with

little semantic content (e.g. conjunctions).

On the contrary, entities contain external knowl-

edge and do not alter sentence syntax if replaced

with other entities, which mitigates the need for any

parallel data or word alignment tools. Motivated

by this, we propose ENTITYCS, a Code-Switching

method that focuses on ENTITIES, as illustrated in

Figure 1. Resources such as Wikipedia and Wiki-

data offer rich cross-lingual entity-level informa-

tion, which has been proven beneﬁcial in XLMs

pre-training (Jiang et al.,2020;Calixto et al.,2021;

Jiang et al.,2022). We use such resources to gen-

erate an entity-based CS corpus for the intermedi-

ate training of XLMs. Entities in wikilinks

are

switched to their counterparts in other languages

retrieved from the Wikidata Knowledge Base (KB),

thus alleviating ambiguity.

Using the ENTITYCS corpus, we propose a se-

ries of masking strategies that focus on enhancing

Entity Prediction (EP) for better cross-lingual en-

tity representations. We evaluate the models on

entity-centric downstream tasks including Named

Entity Recognition (NER), Fact Retrieval, Slot Fill-

ing (SF) and Word Sense Disambiguation (WSD).

Extensive experiments demonstrate that our models

outperform the baseline on zero-shot cross-lingual

transfer, with +2.8% improvement on NER, sur-

passing the prior best result that uses large amounts

of parallel data, +10.0% on Fact Retrieval, +2.4%

on Slot Filling, and +1.3% on WSD.

The main contributions of this work include:

a) construction of an entity-level CS corpus, EN-

TITYCS, based solely on the English Wikipedia

and Wikidata, mitigating the need for parallel data,

word-alignment methods or dictionaries; b) a se-

ries of intermediate training objectives, focusing

2https://en.wikipedia.org/wiki/Help:

Link#Wikilinks_(internal_links)

STATISTIC COUNT

Languages 93

English Sentences 54,469,214

English Entities 104,593,076

Average Sentence Length 23.37

Average Entities per Sentence 2

CS Sentences per EN Sentence ≤5

CS Sentences 231,124,422

CS Entities 420,907,878

Table 1: Statistics of the ENTITYCS Corpus.

on Entity Prediction; c) improvement of zero-shot

performance on NER, Fact Retrieval, Slot Filling

and WSD; d) further analysis of model errors, the

behaviour of different masking strategies through-

out training as well as impact across languages,

demonstrating how our models particularly beneﬁt

non-Latin script languages.

2 Methodology

We introduce the details of the ENTITYCS corpus

construction, as well as different entity-oriented

masking strategies used in our experiments.

2.1 ENTITYCS Corpus Construction

Wikipedia is a multilingual online encyclopedia

available in more than 300 languages3. Structured

data of Wikipedia articles are stored in Wikidata,

a multilingual document-oriented database. With

more than six million articles, English Wikipedia

has the potential to serve as a rich resource for

generating CS data. We use English Wikipedia

and leverage entity information from Wikidata to

construct an entity-based CS corpus.

To achieve this, we make use of wikilinks in

Wikipedia, i.e. links from one page to another. We

use the English Wikipedia dump

and extract raw

text with WikiExtractor

while keeping track of

wikilinks. Wikilinks are typically surrounded by

square brackets in Wikipedia dump, in the format

of [[entity |display text]], where entity is the title

of the target Wikipedia page it links to, and display

text corresponds to what is displayed in the cur-

rent article. We then employ SpaCy

for sentence

segmentation. Since we are interested in creating

3https://en.wikipedia.org/wiki/Wikipedia

4https://dumps.wikimedia.org/enwiki/

latest/ (Nov 2021 version).

5https://github.com/attardi/

wikiextractor

6https://spacy.io/

entity-level CS instances, we only keep sentences

containing at least one wikilink. Sentences longer

than 128 words are also removed. This results in

54.5M English sentences and 104M entities in the

ﬁnal ENTITYCS corpus.

As illustrated in Figure 1, given an English sen-

tence with wikilinks, we ﬁrst map the entity in

each wikilink to its corresponding Wikidata ID and

retrieve its available translations from Wikidata.

For each sentence, we check which languages have

translations for all entities in that sentence, and con-

sider those as candidates for code-switching. We se-

lect 92 target languages in total, which are the over-

lap between the available languages in Wikidata

and XLM-R (Conneau et al.,2020a) (the model

we use for intermediate training). We ensure all

entities are code-switched to the same target lan-

guage in a single sentence, avoiding noise from

including too many languages. To control the size

of the corpus, we generate up to ﬁve ENTITYCS

sentences for each English sentence. In particu-

lar, if fewer than ﬁve languages have translations

available for all the entities in a sentence, we create

ENTITYCS instances with all of them. Otherwise,

we randomly select ﬁve target languages from the

candidates. If no candidate languages can be found,

we do not code-switch the sentence. Instead, we

keep it as part of the English corpus. Finally, we

surround each entity with entity indicators (

<e>

</e>

). The statistics of the ENTITYCS corpus are

summarised in Table 1. A histogram of the number

of sentences and entities per language is shown

in Appendix A.

2.2 Masking Strategies

To test the effectiveness of intermediate training on

the generated ENTITYCS corpus, we experiment

with several training objectives using an existing

pre-trained language model. Firstly, we employ

the conventional 80-10-10 MLM objective, where

15% of sentence subwords are considered mask-

ing candidates. From those, we replace subwords

with

[MASK]

80% of the time, with Random sub-

words (from the entire vocabulary) 10% of the time,

and leave the remaining 10% unchanged (Same).

To integrate entity-level cross-lingual knowledge

into the model, we propose Entity Prediction ob-

jectives, where we only mask subwords belonging

to an entity. By predicting the masked entities in

ENTITYCS sentences, we expect the model to cap-

ture the semantics of the same entity in different

nique_She _was _study _and _É lec troing [MASK] [MASK]

que_Informati niquetrolec _É

(a) Whole Entity Prediction (WEP)

nique_She _was _study _and troing que

[MASK]

que_Informati nique _É lec

[MASK][MASK]

(b) Partial Entity Prediction (PEP)

nique_She _με _and _É troing que

[MASK]

_Informati lec

[MASK]

nique_was

[MASK]

_study

Figure 2: Illustration of the proposed masking strate-

gies. Random subwords are chosen from the entire vo-

cabulary, and thus can be from different languages. (c)

shows a case where “study” is replaced with a Greek

subword.

MASKING

STRATEGY

ENTITY (%) NON-ENTITY (%)

pMASK RND SAME pMASK RND SAME

MLM 15 80 10 10 15 80 10 10

WEP 100 80 0 20 0 - - -

PEPMRS 100 80 10 10 0 - - -

PEPMS 100 80 0 10 0 - - -

PEPM100 80 0 0 0 - - -

+MLM

WEP 50 80 0 20 15 80 10 10

PEPMRS 50 80 10 10 15 80 10 10

PEPMS 50 80 0 10 15 80 10 10

PEPM50 80 0 0 15 80 10 10

Table 2: Summary of the proposed masking strategies.

pcorresponds to the probability of choosing candidate

items (entity/non-entity subwords) for masking. MASK,

RND,SAME represent the percentage of replacing a can-

didate with Mask, Random or the Same item. When

combining WEP/PEP with MLM (+MLM), we lower p

to 50%.

languages. Two different masking strategies are

proposed for predicting entities: Whole Entity Pre-

diction (WEP) and Partial Entity Prediction (PEP).

In WEP, motivated by Sun et al. (2019) where

whole word masking is also adopted, we consider

all the words (and consequently subwords) inside

an entity as masking candidates. Then, 80% of the

time we mask every subword inside an entity, and

20% of the time we keep the subwords intact. Note

that, as our goal is to predict the entire masked

entity, we do not allow replacing with Random

subwords, since it can introduce noise and result

in the model predicting incorrect entities. After

entities are masked, we remove the entity indicators

<e>

</e>

from the sentences before feeding them

to the model. Figure 2a shows an example of WEP.

For PEP, we also consider all entities as masking

candidates. In contrast to WEP, we do not force

subwords belonging to one entity to be either all

masked or all unmasked. Instead, each individual

entity subword is masked 80% of the time. For

the remaining 20% of the masking candidates, we

experiment with three different replacements. First,

PEP

MRS

, corresponds to the conventional 80-10-

10 masking strategy, where 10% of the remaining

subwords are replaced with Random subwords and

the other 10% are kept unchanged. In the second

setting, PEP

, we remove the 10% Random sub-

words substitution, i.e. we predict the 80% masked

subwords and 10% Same subwords from the mask-

ing candidates. In the third setting, PEP

, we

further remove the 10% Same subwords prediction,

essentially predicting only the masked subwords.

An example of PEP is illustrated in Figure 2b.

Prior work has proven it is effective to combine

Entity Prediction with MLM for cross-lingual trans-

fer (Jiang et al.,2020), therefore we investigate the

combination of the Entity Prediction objectives to-

gether with MLM on non-entity subwords. Specif-

ically, when combined with MLM, we lower the

entity masking probability (

) to 50% to roughly

keep the same overall masking percentage. Fig-

ure 2c illustrates an example of PEP combined

with MLM on non-entity subwords. A summary of

the masking strategies is shown in Table 2, along

with the corresponding masking percentages.

3 Experimental Setup

After preparing the ENTITYCS corpus, we further

train an XLM with WEP, PEP, MLM and the

joint objectives. We use the sampling strategy pro-

posed by Conneau and Lample (2019), where high-

resource languages are down-sampled and low-

resource languages get sampled more frequently.

Since recent studies on pre-trained language en-

coders have shown that semantic features are high-

lighted in higher layers (Tenney et al.,2019;Rogers

et al.,2020), we only train the embedding layer

and the last two layers of the model

(similarly to

Calixto et al. (2021)). We randomly choose 100

sentences from each language to serve as a valida-

tion set, on which we measure the perplexity every

10K training steps. Details of parameters used for

intermediate training can be found in Appendix C.

3.1 Downstream Tasks

As the ENTITYCS corpus is constructed with code-

switching at the entity level, we expect our mod-

Preliminary experiments where we updated the entire net-

work revealed the model suffered from catastrophic forgetting.

els to mostly improve entity-centric tasks. Thus,

we choose the following datasets: WikiAnn (Pan

et al.,2017) for NER, X-FACTR (Jiang et al.,2020)

for Fact Retrieval, MultiATIS++ (Xu et al.,2020)

and MTOP (Li et al.,2021) for Slot Filling, and

XL-WiC (Raganato et al.,2020) for WSD

. More

details on the datasets can be found in Appendix B.

After intermediate training on the ENTITYCS

corpus, we evaluate the zero-shot cross-lingual

transfer of the models on each task by ﬁne-tuning

task-speciﬁc English training data. For NER we

use the checkpoint with the lowest validation set

perplexity during intermediate training. Similarly,

for the probing dataset X-FACTR (only consisting

of a test set), we probe models with the lowest per-

plexity and report the maximum accuracy score for

all, single- and multi-token entities between the

two proposed decoding methods (independent and

conﬁdence-based) from the original paper (Jiang

et al.,2020). For MultiATIS++, MTOP, and XL-

WiC datasets, we choose the checkpoints with the

best performance on the English validation set

For all experiments, except X-FACTR, we ﬁne-

tune models with ﬁve random seeds and report

average and standard deviation.

3.2 Pre-Training Languages

Given the size of the ENTITYCS corpus, we pri-

marily select a subset from the total 93 languages,

that covers most of the languages used in the down-

stream tasks. This subset contains 39 languages,

from WikiAnn, excluding Yoruba

. We train

XLM-R-base (Conneau and Lample,2019) on this

subset, then ﬁne-tune the new checkpoints on the

English training set of each dataset and evaluate all

of the available languages.

4 Main Results

Results are reported in Table 3 where we com-

pare models trained on the ENTITYCS corpus with

MLM, WEP, PEP

and PEP

+MLM masking

strategies. For MultiATIS++ and MTOP, we report

results of training only Slot Filling (SF), as well as

joint training of Slot Filling and Intent Classiﬁca-

tion (SF/Intent).

The result reported on the XL-WiC for prior work is

our re-implementation based on

https://github.com/

pasinit/xlwic-runs.

We observed performance drop for those tasks at later check-

points.

Yoruba is not included in the ENTITYCS corpus, as we only

consider languages XLM-R is pre-trained on.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ENTITYCS:ImprovingZero-ShotCross-lingualTransferwithEntity-CentricCodeSwitchingChenxiWhitehouse1,2,*,FeniaChristopoulou2,IgnacioIacobacci21City,UniversityofLondon2HuaweiNoah'sArkLab,London,UKchenxi.whitehouse@city.ac.uk{efstathia.christopoulou,ignacio.iacoboacci}@huawei.comAbstractAccuratealignmentb...

展开>> 收起<<

ENTITY CS Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching Chenxi Whitehouse12 Fenia Christopoulou2 Ignacio Iacobacci2.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ENTITY CS Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching Chenxi Whitehouse12 Fenia Christopoulou2 Ignacio Iacobacci2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: