ENTITY CS Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching Chenxi Whitehouse12 Fenia Christopoulou2 Ignacio Iacobacci2

2025-05-06 0 0 728.12KB 17 页 10玖币
侵权投诉
ENTITYCS: Improving Zero-Shot Cross-lingual Transfer
with Entity-Centric Code Switching
Chenxi Whitehouse1,2,*, Fenia Christopoulou2, Ignacio Iacobacci2
1City, University of London
2Huawei Noah’s Ark Lab, London, UK
chenxi.whitehouse@city.ac.uk
{efstathia.christopoulou, ignacio.iacoboacci}@huawei.com
Abstract
Accurate alignment between languages is
fundamental for improving cross-lingual pre-
trained language models (XLMs). Moti-
vated by the natural phenomenon of code-
switching (CS) in multilingual speakers, CS
has been used as an effective data augmen-
tation method that offers language alignment
at the word- or phrase-level, in contrast to
sentence-level via parallel instances. Exist-
ing approaches either use dictionaries or par-
allel sentences with word alignment to gen-
erate CS data by randomly switching words
in a sentence. However, such methods can
be suboptimal as dictionaries disregard seman-
tics, and syntax might become invalid after ran-
dom word switching. In this work, we pro-
pose ENTITYCS, a method that focuses on
ENTITY-level Code-Switching to capture fine-
grained cross-lingual semantics without cor-
rupting syntax. We use Wikidata and English
Wikipedia to construct an entity-centric CS
corpus by switching entities to their counter-
parts in other languages. We further propose
entity-oriented masking strategies during in-
termediate model training on the ENTITYCS
corpus for improving entity prediction. Eval-
uation of the trained models on four entity-
centric downstream tasks shows consistent im-
provements over the baseline with a notable in-
crease of 10% in Fact Retrieval. We release the
corpus and models to assist research on code-
switching and enriching XLMs with external
knowledge1.
1 Introduction
Cross-lingual pre-trained Language Models
(XLMs) such as mBERT (Devlin et al.,2019)
and XLM-R (Conneau et al.,2020a), have
Work conducted as Research Intern at Huawei Noah’s Ark
Lab, London, UK.
1
Code and models are available at
https://github.
com/huawei-noah/noah-research/tree/master/
NLP/EntityCS.
She was studying [[ computer science ]] and [[ electrical engineering ]] .
Q21198
en: Computer Science
zh: 计算机科学
hi: कूटर िवान
fr: Informatique
ar: بﻮﺳﺎﺤﻟا ﻢﻠﻋ
el: Επιστήμη Υπολογιστών
. . .
Q43035
en: Electrical Engineering
el: Ηλεκτρολογία
zh: 电气工程学
hi: िवद् युत अिभयाकी
fr: Électrotechnique
ar: ﺔﯿﺋﺎﺑﺮﮫﻛ ﺔﺳﺪﻨھ
. . .
She was studying
<e>
कूटर िवान
</e>
and
<e>
िवद
युत अिभयाकी
</e>
.
She was studying
<e>
计算机科学
</e>
and
<e>
电气工程学
</e>
.
She was studying
<e>
Informatique
</e>
and
<e>
Électrotechnique
</e>
.
computer science Q21198
electrical engineering Q43035
She was studying
<e>
computer science
</e>
and
<e>
electrical engineering
</e>
.
. . .
Figure 1: Illustration of generating ENTITYCS sen-
tences from an English sentence extracted from
Wikipedia. Entities in double square brackets indicate
wikilinks.
achieved state-of-the-art zero-shot cross-lingual
transferability across diverse Natural Language
Understanding (NLU) tasks. Such models
have been particularly enhanced with the use
of bilingual parallel sentences together with
alignment methods (Yang et al.,2020;Chi et al.,
2021a;Hu et al.,2021;Gritta and Iacobacci,
2021;Feng et al.,2022). However, obtaining
high-quality parallel data is expensive, especially
for low-resource languages. Therefore, alternative
data augmentation approaches have been proposed,
one of which is Code Switching.
Code Switching (CS) is a phenomenon when
multilingual speakers alternate words between lan-
guages when they speak, which has been studied
for many years (Gumperz,1977;Khanuja et al.,
2020;Do˘
gruöz et al.,2021). Code-switched sen-
tences consist of words or phrases in different lan-
guages, therefore they capture the semantics of
finer-grained cross-lingual expressions compared
to parallel sentences, and have been used for mul-
tilingual intermediate training (Yang et al.,2020)
and fine-tuning (Qin et al.,2020;Krishnan et al.,
arXiv:2210.12540v2 [cs.CL] 13 Feb 2023
2021). Nevertheless, since manually creating large-
scale CS datasets is costly and only a few natural
CS texts exist (Barik et al.,2019;Xiang et al.,2020;
Chakravarthi et al.,2020;Lovenia et al.,2021), re-
search has turned to automatic CS data generation.
Some of those approaches generate CS data via
dictionaries, usually ignoring ambiguity (Qin et al.,
2020;Conneau et al.,2020b). Others require paral-
lel data and an alignment method to match words
or phrases between languages (Yang et al.,2020;
Rizvi et al.,2021). In both cases, what is switched
is chosen randomly, potentially resulting in syntac-
tically odd sentences or switching to words with
little semantic content (e.g. conjunctions).
On the contrary, entities contain external knowl-
edge and do not alter sentence syntax if replaced
with other entities, which mitigates the need for any
parallel data or word alignment tools. Motivated
by this, we propose ENTITYCS, a Code-Switching
method that focuses on ENTITIES, as illustrated in
Figure 1. Resources such as Wikipedia and Wiki-
data offer rich cross-lingual entity-level informa-
tion, which has been proven beneficial in XLMs
pre-training (Jiang et al.,2020;Calixto et al.,2021;
Jiang et al.,2022). We use such resources to gen-
erate an entity-based CS corpus for the intermedi-
ate training of XLMs. Entities in wikilinks
2
are
switched to their counterparts in other languages
retrieved from the Wikidata Knowledge Base (KB),
thus alleviating ambiguity.
Using the ENTITYCS corpus, we propose a se-
ries of masking strategies that focus on enhancing
Entity Prediction (EP) for better cross-lingual en-
tity representations. We evaluate the models on
entity-centric downstream tasks including Named
Entity Recognition (NER), Fact Retrieval, Slot Fill-
ing (SF) and Word Sense Disambiguation (WSD).
Extensive experiments demonstrate that our models
outperform the baseline on zero-shot cross-lingual
transfer, with +2.8% improvement on NER, sur-
passing the prior best result that uses large amounts
of parallel data, +10.0% on Fact Retrieval, +2.4%
on Slot Filling, and +1.3% on WSD.
The main contributions of this work include:
a) construction of an entity-level CS corpus, EN-
TITYCS, based solely on the English Wikipedia
and Wikidata, mitigating the need for parallel data,
word-alignment methods or dictionaries; b) a se-
ries of intermediate training objectives, focusing
2https://en.wikipedia.org/wiki/Help:
Link#Wikilinks_(internal_links)
STATISTIC COUNT
Languages 93
English Sentences 54,469,214
English Entities 104,593,076
Average Sentence Length 23.37
Average Entities per Sentence 2
CS Sentences per EN Sentence 5
CS Sentences 231,124,422
CS Entities 420,907,878
Table 1: Statistics of the ENTITYCS Corpus.
on Entity Prediction; c) improvement of zero-shot
performance on NER, Fact Retrieval, Slot Filling
and WSD; d) further analysis of model errors, the
behaviour of different masking strategies through-
out training as well as impact across languages,
demonstrating how our models particularly benefit
non-Latin script languages.
2 Methodology
We introduce the details of the ENTITYCS corpus
construction, as well as different entity-oriented
masking strategies used in our experiments.
2.1 ENTITYCS Corpus Construction
Wikipedia is a multilingual online encyclopedia
available in more than 300 languages3. Structured
data of Wikipedia articles are stored in Wikidata,
a multilingual document-oriented database. With
more than six million articles, English Wikipedia
has the potential to serve as a rich resource for
generating CS data. We use English Wikipedia
and leverage entity information from Wikidata to
construct an entity-based CS corpus.
To achieve this, we make use of wikilinks in
Wikipedia, i.e. links from one page to another. We
use the English Wikipedia dump
4
and extract raw
text with WikiExtractor
5
while keeping track of
wikilinks. Wikilinks are typically surrounded by
square brackets in Wikipedia dump, in the format
of [[entity |display text]], where entity is the title
of the target Wikipedia page it links to, and display
text corresponds to what is displayed in the cur-
rent article. We then employ SpaCy
6
for sentence
segmentation. Since we are interested in creating
3https://en.wikipedia.org/wiki/Wikipedia
4https://dumps.wikimedia.org/enwiki/
latest/ (Nov 2021 version).
5https://github.com/attardi/
wikiextractor
6https://spacy.io/
entity-level CS instances, we only keep sentences
containing at least one wikilink. Sentences longer
than 128 words are also removed. This results in
54.5M English sentences and 104M entities in the
final ENTITYCS corpus.
As illustrated in Figure 1, given an English sen-
tence with wikilinks, we first map the entity in
each wikilink to its corresponding Wikidata ID and
retrieve its available translations from Wikidata.
For each sentence, we check which languages have
translations for all entities in that sentence, and con-
sider those as candidates for code-switching. We se-
lect 92 target languages in total, which are the over-
lap between the available languages in Wikidata
and XLM-R (Conneau et al.,2020a) (the model
we use for intermediate training). We ensure all
entities are code-switched to the same target lan-
guage in a single sentence, avoiding noise from
including too many languages. To control the size
of the corpus, we generate up to five ENTITYCS
sentences for each English sentence. In particu-
lar, if fewer than five languages have translations
available for all the entities in a sentence, we create
ENTITYCS instances with all of them. Otherwise,
we randomly select five target languages from the
candidates. If no candidate languages can be found,
we do not code-switch the sentence. Instead, we
keep it as part of the English corpus. Finally, we
surround each entity with entity indicators (
<e>
,
</e>
). The statistics of the ENTITYCS corpus are
summarised in Table 1. A histogram of the number
of sentences and entities per language is shown
in Appendix A.
2.2 Masking Strategies
To test the effectiveness of intermediate training on
the generated ENTITYCS corpus, we experiment
with several training objectives using an existing
pre-trained language model. Firstly, we employ
the conventional 80-10-10 MLM objective, where
15% of sentence subwords are considered mask-
ing candidates. From those, we replace subwords
with
[MASK]
80% of the time, with Random sub-
words (from the entire vocabulary) 10% of the time,
and leave the remaining 10% unchanged (Same).
To integrate entity-level cross-lingual knowledge
into the model, we propose Entity Prediction ob-
jectives, where we only mask subwords belonging
to an entity. By predicting the masked entities in
ENTITYCS sentences, we expect the model to cap-
ture the semantics of the same entity in different
nique_She _was _study _and lec troing [MASK] [MASK]
que_Informati niquetrolec
(a) Whole Entity Prediction (WEP)
nique_She _was _study _and troing que
[MASK]
que_Informati nique lec
[MASK][MASK]
(b) Partial Entity Prediction (PEP)
nique_She _με _and troing que
[MASK]
_Informati lec
[MASK]
nique_was
[MASK]
_study
(c) Partial Entity Prediction with MLM (PEP+MLM)
Figure 2: Illustration of the proposed masking strate-
gies. Random subwords are chosen from the entire vo-
cabulary, and thus can be from different languages. (c)
shows a case where “study” is replaced with a Greek
subword.
MASKING
STRATEGY
ENTITY (%) NON-ENTITY (%)
pMASK RND SAME pMASK RND SAME
MLM 15 80 10 10 15 80 10 10
WEP 100 80 0 20 0 - - -
PEPMRS 100 80 10 10 0 - - -
PEPMS 100 80 0 10 0 - - -
PEPM100 80 0 0 0 - - -
+MLM
WEP 50 80 0 20 15 80 10 10
PEPMRS 50 80 10 10 15 80 10 10
PEPMS 50 80 0 10 15 80 10 10
PEPM50 80 0 0 15 80 10 10
Table 2: Summary of the proposed masking strategies.
pcorresponds to the probability of choosing candidate
items (entity/non-entity subwords) for masking. MASK,
RND,SAME represent the percentage of replacing a can-
didate with Mask, Random or the Same item. When
combining WEP/PEP with MLM (+MLM), we lower p
to 50%.
languages. Two different masking strategies are
proposed for predicting entities: Whole Entity Pre-
diction (WEP) and Partial Entity Prediction (PEP).
In WEP, motivated by Sun et al. (2019) where
whole word masking is also adopted, we consider
all the words (and consequently subwords) inside
an entity as masking candidates. Then, 80% of the
time we mask every subword inside an entity, and
20% of the time we keep the subwords intact. Note
that, as our goal is to predict the entire masked
entity, we do not allow replacing with Random
subwords, since it can introduce noise and result
in the model predicting incorrect entities. After
entities are masked, we remove the entity indicators
<e>
,
</e>
from the sentences before feeding them
to the model. Figure 2a shows an example of WEP.
For PEP, we also consider all entities as masking
candidates. In contrast to WEP, we do not force
subwords belonging to one entity to be either all
masked or all unmasked. Instead, each individual
entity subword is masked 80% of the time. For
the remaining 20% of the masking candidates, we
experiment with three different replacements. First,
PEP
MRS
, corresponds to the conventional 80-10-
10 masking strategy, where 10% of the remaining
subwords are replaced with Random subwords and
the other 10% are kept unchanged. In the second
setting, PEP
MS
, we remove the 10% Random sub-
words substitution, i.e. we predict the 80% masked
subwords and 10% Same subwords from the mask-
ing candidates. In the third setting, PEP
M
, we
further remove the 10% Same subwords prediction,
essentially predicting only the masked subwords.
An example of PEP is illustrated in Figure 2b.
Prior work has proven it is effective to combine
Entity Prediction with MLM for cross-lingual trans-
fer (Jiang et al.,2020), therefore we investigate the
combination of the Entity Prediction objectives to-
gether with MLM on non-entity subwords. Specif-
ically, when combined with MLM, we lower the
entity masking probability (
p
) to 50% to roughly
keep the same overall masking percentage. Fig-
ure 2c illustrates an example of PEP combined
with MLM on non-entity subwords. A summary of
the masking strategies is shown in Table 2, along
with the corresponding masking percentages.
3 Experimental Setup
After preparing the ENTITYCS corpus, we further
train an XLM with WEP, PEP, MLM and the
joint objectives. We use the sampling strategy pro-
posed by Conneau and Lample (2019), where high-
resource languages are down-sampled and low-
resource languages get sampled more frequently.
Since recent studies on pre-trained language en-
coders have shown that semantic features are high-
lighted in higher layers (Tenney et al.,2019;Rogers
et al.,2020), we only train the embedding layer
and the last two layers of the model
7
(similarly to
Calixto et al. (2021)). We randomly choose 100
sentences from each language to serve as a valida-
tion set, on which we measure the perplexity every
10K training steps. Details of parameters used for
intermediate training can be found in Appendix C.
3.1 Downstream Tasks
As the ENTITYCS corpus is constructed with code-
switching at the entity level, we expect our mod-
7
Preliminary experiments where we updated the entire net-
work revealed the model suffered from catastrophic forgetting.
els to mostly improve entity-centric tasks. Thus,
we choose the following datasets: WikiAnn (Pan
et al.,2017) for NER, X-FACTR (Jiang et al.,2020)
for Fact Retrieval, MultiATIS++ (Xu et al.,2020)
and MTOP (Li et al.,2021) for Slot Filling, and
XL-WiC (Raganato et al.,2020) for WSD
8
. More
details on the datasets can be found in Appendix B.
After intermediate training on the ENTITYCS
corpus, we evaluate the zero-shot cross-lingual
transfer of the models on each task by fine-tuning
task-specific English training data. For NER we
use the checkpoint with the lowest validation set
perplexity during intermediate training. Similarly,
for the probing dataset X-FACTR (only consisting
of a test set), we probe models with the lowest per-
plexity and report the maximum accuracy score for
all, single- and multi-token entities between the
two proposed decoding methods (independent and
confidence-based) from the original paper (Jiang
et al.,2020). For MultiATIS++, MTOP, and XL-
WiC datasets, we choose the checkpoints with the
best performance on the English validation set
9
.
For all experiments, except X-FACTR, we fine-
tune models with five random seeds and report
average and standard deviation.
3.2 Pre-Training Languages
Given the size of the ENTITYCS corpus, we pri-
marily select a subset from the total 93 languages,
that covers most of the languages used in the down-
stream tasks. This subset contains 39 languages,
from WikiAnn, excluding Yoruba
10
. We train
XLM-R-base (Conneau and Lample,2019) on this
subset, then fine-tune the new checkpoints on the
English training set of each dataset and evaluate all
of the available languages.
4 Main Results
Results are reported in Table 3 where we com-
pare models trained on the ENTITYCS corpus with
MLM, WEP, PEP
MS
and PEP
MS
+MLM masking
strategies. For MultiATIS++ and MTOP, we report
results of training only Slot Filling (SF), as well as
joint training of Slot Filling and Intent Classifica-
tion (SF/Intent).
8
The result reported on the XL-WiC for prior work is
our re-implementation based on
https://github.com/
pasinit/xlwic-runs.
9
We observed performance drop for those tasks at later check-
points.
10
Yoruba is not included in the ENTITYCS corpus, as we only
consider languages XLM-R is pre-trained on.
摘要:

ENTITYCS:ImprovingZero-ShotCross-lingualTransferwithEntity-CentricCodeSwitchingChenxiWhitehouse1,2,*,FeniaChristopoulou2,IgnacioIacobacci21City,UniversityofLondon2HuaweiNoah'sArkLab,London,UKchenxi.whitehouse@city.ac.uk{efstathia.christopoulou,ignacio.iacoboacci}@huawei.comAbstractAccuratealignmentb...

展开>> 收起<<
ENTITY CS Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching Chenxi Whitehouse12 Fenia Christopoulou2 Ignacio Iacobacci2.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:728.12KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注