A Simple and Effective Method to Improve Zero-Shot Cross-Lingual Transfer Learning Kunbo Ding1 Weijie Liu12y Yuejian Fang1 Weiquan Mao2 Zhe Zhao2

2025-04-30 0 0 702.44KB 9 页 10玖币
侵权投诉
A Simple and Effective Method to Improve Zero-Shot Cross-Lingual
Transfer Learning
Kunbo Ding1
, Weijie Liu1,2
, Yuejian Fang1, Weiquan Mao2, Zhe Zhao2
Tao Zhu2, Haoyan Liu2, Rong Tian2, Yiren Chen2
1Peking University, Beijing, China 2Tencent Research, Beijing, China
kunbo_ding@stu.pku.edu.cn, dataliu@pku.edu.cn, fangyj@ss.pku.edu.cn
{weiquanmao, nlpzhezhao, mardozhu, haoyanliu, rometian, yirenchen}@tencent.com
Abstract
Existing zero-shot cross-lingual transfer meth-
ods rely on parallel corpora or bilingual dic-
tionaries, which are expensive and impracti-
cal for low-resource languages. To disen-
gage from these dependencies, researchers
have explored training multilingual models on
English-only resources and transferring them
to low-resource languages. However, its ef-
fect is limited by the gap between embed-
ding clusters of different languages. To ad-
dress this issue, we propose Embedding-Push,
Attention-Pull, and Robust targets to trans-
fer English embeddings to virtual multilingual
embeddings without semantic loss, thereby im-
proving cross-lingual transferability. Experi-
mental results on mBERT and XLM-R demon-
strate that our method significantly outper-
forms previous works on the zero-shot cross-
lingual text classification task and can obtain a
better multilingual alignment.
1 Introduction
In recent years, advances in multilingual models
such as mBERT (Devlin et al.,2019), XLM (Con-
neau and Lample,2019), XLM-R (Conneau et al.,
2020), etc., after being fine-tuned with annotated
data, have enabled significant improvements in
many cross-lingual tasks. However, due to the
lack of annotated data, some tasks in low-resource
languages have not enjoyed this technological ad-
vancement. To solve this issue, the academic
and industrial community began to focus on zero-
shot cross-lingual transfer learning (Huang et al.,
2019;Artetxe et al.,2020), which aims to fine-tune
multilingual models with annotated data in high-
resource languages and obtain a nice performance
in low-resource language tasks.
Some works aligned word embeddings between
high- and low-resource languages through addi-
tional parallel sentence pairs (Artetxe and Schwenk,
Contribution during internship at Tencent Inc.
Corresponding author: Weijie Liu.
(b) mBERTbase
(Cao et al.)
(a) mBERTbase
(Libovicky et al.)
English
Bavarian
Low Saxon
Japanese
West Frisian
Dutch
Spanish
Scots French
German
Australian
Chinese
Irish
(d) Ours
(c) Zero-shot with Smoothing
(Huang et al.)
processinglanguagenatural
处理语言自然
próiseáilteanganádúrtha
English Synonyms for “natural”
English Synonyms for “language”
English Synonyms for “processing”
Figure 1: (a) Different languages clusters in mBERT.
(b) The relative positions of "nature", "language" and
"processing" are similar in English, Chinese and Irish
(Cao et al.,2020). (c) Using synonym augmentation to
train a robust region covering words in other languages.
(d) We align different languages and construct a suit-
able robust region by pushing the embeddings away
and pulling the relative distance among words.
2019;Wei et al.,2021;Chi et al.,2021;Pan et al.,
2021) or bilingual dictionaries (Cao et al.,2020;
Qin et al.,2020;Liu et al.,2020), so that high-
resource fine-tuned models can be transferred to
low-resource languages. Although this approach
has achieved excellent results in many languages,
parallel corpora and bilingual dictionaries are still
prohibitively expensive, rendering it impracticable
in some minority languages.
To disengage from the dependence on parallel
corpora or bilingual dictionaries (Wu and Dredze,
2019;Hu et al.,2020), some studies have found
that syntactic features in high-resource languages
can improve zero-shot cross-lingual transfer learn-
ing (Meng et al.,2019;Subburathinam et al.,2019;
Ahmad et al.,2021a,b). Libovický et al. (2020)
found that the embeddings of different languages
are clustered according to their language families,
as shown in Figure 1a and 1b, which demonstrated
that different languages are not aligned perfectly
arXiv:2210.09934v1 [cs.CL] 18 Oct 2022
in mBERT (Deshpande et al.,2021). Huang et al.
(2021) tried adversarial training and randomized
smoothing with English synonym augmentation to
build robust regions for embeddings in the mul-
tilingual models, as illustrated in Figure 1c. In
this way, models can output similar predictions for
different language embeddings in the same robust
region even they are not well aligned. However, the
transferability of English synonym augmentation
is limited because its robust region remains close
to the English cluster, as shown in Figure 1c.
In this work, we select English as a high-
resource language and follow the studies that do not
require additional parallel corpora or bilingual dic-
tionaries to improve cross-lingual transfer learning
performance with minimal cost. For this purpose,
three strategies are proposed to enlarge the robust
region of English embeddings. The first strategy
is called Embedding-Push, which pushes the em-
bedding of English to other language clusters. The
second is Attention-Pull, which constrains the rela-
tive position of the word embeddings to prevent the
meaning from straying. The last strategy, named
Robust target, introduces a Virtual Multilingual
Embedding (VME) to help the model build a suit-
able robust region, as shown in Figure 1d.
Experimental results on mBERT and XLM-R
demonstrate that our method effectively improves
the zero-shot cross-lingual transfer on classification
tasks and outperforms a series of previous works.
In addition, case studies show that our method im-
proves the model through multilingual word align-
ment. Compared with existing works, our method
has the following advantages. First, our method
only needs English resources, which is suitable for
low-resource languages. Second, our method can
induce alignments in many languages without spec-
ifying the target language. Finally, our method is
simple to implement and achieves effective experi-
mental results. Our code is publicly available1.
2 Method
Given an English training batch
B
, for a specific
x∈ B
consisting of words (
x1, x2, x3
), we first fol-
low Huang et al. (2021) to generate an augmented
example
xa= (xa
1, xa
2, xa
3)
by randomly replacing
xi
with
xa
i
from the pre-defined English synonym
set (Alzantot et al.,2018). Then, we introduce three
objective functions to get the Virtual Multilingual
Embedding (VME) that provides a suitable robust
1https://github.com/KB-Ding/EAR
Embedding-push
Attention-pull
classify
Transformer
Embedding
Transformer
!"# $!, $", …
classify
!"# $!
#, $"
#, …
Transformer
Embedding
Transformer
processinglanguage 处理语言natural自然
are Virtual Multilingual Embeddings
,,
Figure 2: The two networks have tied weights. VMEs
expand robust regions (orange circle) by aligning
semantic-similar words in other languages. Note that
VMEs do not specify the target language but improve
multilingual performance, as shown in section 3.3.
region for zero-shot cross-lingual classification task
as shown in Figure 2. We describe the details in
the following subsections.
2.1 Embedding-push target
The Embedding-Push target aims to make English
embeddings leave their original cluster and robust
region by pushing away (
x,xa
) in the embedding
space. The pushed embedding can be viewed as
the VME. The loss function is (1).
`EP T =1
|B| X
x∈B
(M(Ex)M(Exa))2(1)
where
Ex, Exa
denote the embedding output of
x
and xa,Mis the mean-pooling method.
2.2 Attention-pull target
The self-attention matrices contain rich linguistic
information (Clark et al.,2019) and can be re-
garded as a 1-hop graph attention between the hid-
den states of words (Vaswani et al.,2017;
?
). The
attention matrix represents the information trans-
fer score between each pair of words, we regard it
as the pulling force, so the attention matrix deter-
mines the relative linguistic positions of words in
a sentence. We introduce the Attention-Pull target
to encourage the relative linguistic position among
(
xa
1, xa
2, xa
3
) to be similar to (
x1, x2, x3
) by fitting
the middle layer multi-head attention matrices, as
(2).
`AP T =1
|B|HX
x∈B
H
X
i
Ai
xAi
xa2(2)
摘要:

ASimpleandEffectiveMethodtoImproveZero-ShotCross-LingualTransferLearningKunboDing1,WeijieLiu1,2y,YuejianFang1,WeiquanMao2,ZheZhao2TaoZhu2,HaoyanLiu2,RongTian2,YirenChen21PekingUniversity,Beijing,China2TencentResearch,Beijing,Chinakunbo_ding@stu.pku.edu.cn,dataliu@pku.edu.cn,fangyj@ss.pku.edu.cn{wei...

展开>> 收起<<
A Simple and Effective Method to Improve Zero-Shot Cross-Lingual Transfer Learning Kunbo Ding1 Weijie Liu12y Yuejian Fang1 Weiquan Mao2 Zhe Zhao2.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:702.44KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注