
masked or all unmasked. Instead, each individual
entity subword is masked 80% of the time. For
the remaining 20% of the masking candidates, we
experiment with three different replacements. First,
PEP
MRS
, corresponds to the conventional 80-10-
10 masking strategy, where 10% of the remaining
subwords are replaced with Random subwords and
the other 10% are kept unchanged. In the second
setting, PEP
MS
, we remove the 10% Random sub-
words substitution, i.e. we predict the 80% masked
subwords and 10% Same subwords from the mask-
ing candidates. In the third setting, PEP
M
, we
further remove the 10% Same subwords prediction,
essentially predicting only the masked subwords.
An example of PEP is illustrated in Figure 2b.
Prior work has proven it is effective to combine
Entity Prediction with MLM for cross-lingual trans-
fer (Jiang et al.,2020), therefore we investigate the
combination of the Entity Prediction objectives to-
gether with MLM on non-entity subwords. Specif-
ically, when combined with MLM, we lower the
entity masking probability (
p
) to 50% to roughly
keep the same overall masking percentage. Fig-
ure 2c illustrates an example of PEP combined
with MLM on non-entity subwords. A summary of
the masking strategies is shown in Table 2, along
with the corresponding masking percentages.
3 Experimental Setup
After preparing the ENTITYCS corpus, we further
train an XLM with WEP, PEP, MLM and the
joint objectives. We use the sampling strategy pro-
posed by Conneau and Lample (2019), where high-
resource languages are down-sampled and low-
resource languages get sampled more frequently.
Since recent studies on pre-trained language en-
coders have shown that semantic features are high-
lighted in higher layers (Tenney et al.,2019;Rogers
et al.,2020), we only train the embedding layer
and the last two layers of the model
7
(similarly to
Calixto et al. (2021)). We randomly choose 100
sentences from each language to serve as a valida-
tion set, on which we measure the perplexity every
10K training steps. Details of parameters used for
intermediate training can be found in Appendix C.
3.1 Downstream Tasks
As the ENTITYCS corpus is constructed with code-
switching at the entity level, we expect our mod-
7
Preliminary experiments where we updated the entire net-
work revealed the model suffered from catastrophic forgetting.
els to mostly improve entity-centric tasks. Thus,
we choose the following datasets: WikiAnn (Pan
et al.,2017) for NER, X-FACTR (Jiang et al.,2020)
for Fact Retrieval, MultiATIS++ (Xu et al.,2020)
and MTOP (Li et al.,2021) for Slot Filling, and
XL-WiC (Raganato et al.,2020) for WSD
8
. More
details on the datasets can be found in Appendix B.
After intermediate training on the ENTITYCS
corpus, we evaluate the zero-shot cross-lingual
transfer of the models on each task by fine-tuning
task-specific English training data. For NER we
use the checkpoint with the lowest validation set
perplexity during intermediate training. Similarly,
for the probing dataset X-FACTR (only consisting
of a test set), we probe models with the lowest per-
plexity and report the maximum accuracy score for
all, single- and multi-token entities between the
two proposed decoding methods (independent and
confidence-based) from the original paper (Jiang
et al.,2020). For MultiATIS++, MTOP, and XL-
WiC datasets, we choose the checkpoints with the
best performance on the English validation set
9
.
For all experiments, except X-FACTR, we fine-
tune models with five random seeds and report
average and standard deviation.
3.2 Pre-Training Languages
Given the size of the ENTITYCS corpus, we pri-
marily select a subset from the total 93 languages,
that covers most of the languages used in the down-
stream tasks. This subset contains 39 languages,
from WikiAnn, excluding Yoruba
10
. We train
XLM-R-base (Conneau and Lample,2019) on this
subset, then fine-tune the new checkpoints on the
English training set of each dataset and evaluate all
of the available languages.
4 Main Results
Results are reported in Table 3 where we com-
pare models trained on the ENTITYCS corpus with
MLM, WEP, PEP
MS
and PEP
MS
+MLM masking
strategies. For MultiATIS++ and MTOP, we report
results of training only Slot Filling (SF), as well as
joint training of Slot Filling and Intent Classifica-
tion (SF/Intent).
8
The result reported on the XL-WiC for prior work is
our re-implementation based on
https://github.com/
pasinit/xlwic-runs.
9
We observed performance drop for those tasks at later check-
points.
10
Yoruba is not included in the ENTITYCS corpus, as we only
consider languages XLM-R is pre-trained on.