
A Simple and Effective Method to Improve Zero-Shot Cross-Lingual
Transfer Learning
Kunbo Ding1∗
, Weijie Liu1,2 †
, Yuejian Fang1, Weiquan Mao2, Zhe Zhao2
Tao Zhu2, Haoyan Liu2, Rong Tian2, Yiren Chen2
1Peking University, Beijing, China 2Tencent Research, Beijing, China
kunbo_ding@stu.pku.edu.cn, dataliu@pku.edu.cn, fangyj@ss.pku.edu.cn
{weiquanmao, nlpzhezhao, mardozhu, haoyanliu, rometian, yirenchen}@tencent.com
Abstract
Existing zero-shot cross-lingual transfer meth-
ods rely on parallel corpora or bilingual dic-
tionaries, which are expensive and impracti-
cal for low-resource languages. To disen-
gage from these dependencies, researchers
have explored training multilingual models on
English-only resources and transferring them
to low-resource languages. However, its ef-
fect is limited by the gap between embed-
ding clusters of different languages. To ad-
dress this issue, we propose Embedding-Push,
Attention-Pull, and Robust targets to trans-
fer English embeddings to virtual multilingual
embeddings without semantic loss, thereby im-
proving cross-lingual transferability. Experi-
mental results on mBERT and XLM-R demon-
strate that our method significantly outper-
forms previous works on the zero-shot cross-
lingual text classification task and can obtain a
better multilingual alignment.
1 Introduction
In recent years, advances in multilingual models
such as mBERT (Devlin et al.,2019), XLM (Con-
neau and Lample,2019), XLM-R (Conneau et al.,
2020), etc., after being fine-tuned with annotated
data, have enabled significant improvements in
many cross-lingual tasks. However, due to the
lack of annotated data, some tasks in low-resource
languages have not enjoyed this technological ad-
vancement. To solve this issue, the academic
and industrial community began to focus on zero-
shot cross-lingual transfer learning (Huang et al.,
2019;Artetxe et al.,2020), which aims to fine-tune
multilingual models with annotated data in high-
resource languages and obtain a nice performance
in low-resource language tasks.
Some works aligned word embeddings between
high- and low-resource languages through addi-
tional parallel sentence pairs (Artetxe and Schwenk,
∗∗ Contribution during internship at Tencent Inc.
†† Corresponding author: Weijie Liu.
(b) mBERTbase
(Cao et al.)
(a) mBERTbase
(Libovicky et al.)
English
Bavarian
Low Saxon
Japanese
West Frisian
Dutch
Spanish
Scots French
German
Australian
Chinese
…
…
…
…
Irish
(d) Ours
(c) Zero-shot with Smoothing
(Huang et al.)
processinglanguagenatural
处理语言自然
próiseáilteanganádúrtha
English Synonyms for “natural”
English Synonyms for “language”
English Synonyms for “processing”
Figure 1: (a) Different languages clusters in mBERT.
(b) The relative positions of "nature", "language" and
"processing" are similar in English, Chinese and Irish
(Cao et al.,2020). (c) Using synonym augmentation to
train a robust region covering words in other languages.
(d) We align different languages and construct a suit-
able robust region by pushing the embeddings away
and pulling the relative distance among words.
2019;Wei et al.,2021;Chi et al.,2021;Pan et al.,
2021) or bilingual dictionaries (Cao et al.,2020;
Qin et al.,2020;Liu et al.,2020), so that high-
resource fine-tuned models can be transferred to
low-resource languages. Although this approach
has achieved excellent results in many languages,
parallel corpora and bilingual dictionaries are still
prohibitively expensive, rendering it impracticable
in some minority languages.
To disengage from the dependence on parallel
corpora or bilingual dictionaries (Wu and Dredze,
2019;Hu et al.,2020), some studies have found
that syntactic features in high-resource languages
can improve zero-shot cross-lingual transfer learn-
ing (Meng et al.,2019;Subburathinam et al.,2019;
Ahmad et al.,2021a,b). Libovický et al. (2020)
found that the embeddings of different languages
are clustered according to their language families,
as shown in Figure 1a and 1b, which demonstrated
that different languages are not aligned perfectly
arXiv:2210.09934v1 [cs.CL] 18 Oct 2022