A Simple and Effective Method to Improve Zero-Shot Cross-Lingual Transfer Learning Kunbo Ding1 Weijie Liu12y Yuejian Fang1 Weiquan Mao2 Zhe Zhao2

2025-04-30 1 0 702.44KB 9 页 10玖币

侵权投诉

A Simple and Effective Method to Improve Zero-Shot Cross-Lingual

Transfer Learning

Kunbo Ding1∗

, Weijie Liu1,2 †

, Yuejian Fang1, Weiquan Mao2, Zhe Zhao2

Tao Zhu2, Haoyan Liu2, Rong Tian2, Yiren Chen2

1Peking University, Beijing, China 2Tencent Research, Beijing, China

kunbo_ding@stu.pku.edu.cn, dataliu@pku.edu.cn, fangyj@ss.pku.edu.cn

{weiquanmao, nlpzhezhao, mardozhu, haoyanliu, rometian, yirenchen}@tencent.com

Abstract

Existing zero-shot cross-lingual transfer meth-

ods rely on parallel corpora or bilingual dic-

tionaries, which are expensive and impracti-

cal for low-resource languages. To disen-

gage from these dependencies, researchers

have explored training multilingual models on

English-only resources and transferring them

to low-resource languages. However, its ef-

fect is limited by the gap between embed-

ding clusters of different languages. To ad-

dress this issue, we propose Embedding-Push,

Attention-Pull, and Robust targets to trans-

fer English embeddings to virtual multilingual

embeddings without semantic loss, thereby im-

proving cross-lingual transferability. Experi-

mental results on mBERT and XLM-R demon-

strate that our method signiﬁcantly outper-

forms previous works on the zero-shot cross-

lingual text classiﬁcation task and can obtain a

better multilingual alignment.

1 Introduction

In recent years, advances in multilingual models

such as mBERT (Devlin et al.,2019), XLM (Con-

neau and Lample,2019), XLM-R (Conneau et al.,

2020), etc., after being ﬁne-tuned with annotated

data, have enabled signiﬁcant improvements in

many cross-lingual tasks. However, due to the

lack of annotated data, some tasks in low-resource

languages have not enjoyed this technological ad-

vancement. To solve this issue, the academic

and industrial community began to focus on zero-

shot cross-lingual transfer learning (Huang et al.,

2019;Artetxe et al.,2020), which aims to ﬁne-tune

multilingual models with annotated data in high-

resource languages and obtain a nice performance

in low-resource language tasks.

Some works aligned word embeddings between

high- and low-resource languages through addi-

tional parallel sentence pairs (Artetxe and Schwenk,

∗∗ Contribution during internship at Tencent Inc.

†† Corresponding author: Weijie Liu.

(b) mBERTbase

(Cao et al.)

(a) mBERTbase

(Libovicky et al.)

English

Bavarian

Low Saxon

Japanese

West Frisian

Dutch

Spanish

Scots French

German

Australian

Chinese

…

Irish

(d) Ours

(Huang et al.)

processinglanguagenatural

处理语言自然

próiseáilteanganádúrtha

English Synonyms for “natural”

English Synonyms for “language”

English Synonyms for “processing”

Figure 1: (a) Different languages clusters in mBERT.

(b) The relative positions of "nature", "language" and

"processing" are similar in English, Chinese and Irish

(Cao et al.,2020). (c) Using synonym augmentation to

train a robust region covering words in other languages.

(d) We align different languages and construct a suit-

able robust region by pushing the embeddings away

and pulling the relative distance among words.

2019;Wei et al.,2021;Chi et al.,2021;Pan et al.,

2021) or bilingual dictionaries (Cao et al.,2020;

Qin et al.,2020;Liu et al.,2020), so that high-

resource ﬁne-tuned models can be transferred to

low-resource languages. Although this approach

has achieved excellent results in many languages,

parallel corpora and bilingual dictionaries are still

prohibitively expensive, rendering it impracticable

in some minority languages.

To disengage from the dependence on parallel

corpora or bilingual dictionaries (Wu and Dredze,

2019;Hu et al.,2020), some studies have found

that syntactic features in high-resource languages

can improve zero-shot cross-lingual transfer learn-

ing (Meng et al.,2019;Subburathinam et al.,2019;

Ahmad et al.,2021a,b). Libovický et al. (2020)

found that the embeddings of different languages

are clustered according to their language families,

as shown in Figure 1a and 1b, which demonstrated

that different languages are not aligned perfectly

arXiv:2210.09934v1 [cs.CL] 18 Oct 2022

in mBERT (Deshpande et al.,2021). Huang et al.

(2021) tried adversarial training and randomized

smoothing with English synonym augmentation to

build robust regions for embeddings in the mul-

tilingual models, as illustrated in Figure 1c. In

this way, models can output similar predictions for

different language embeddings in the same robust

region even they are not well aligned. However, the

transferability of English synonym augmentation

is limited because its robust region remains close

to the English cluster, as shown in Figure 1c.

In this work, we select English as a high-

resource language and follow the studies that do not

require additional parallel corpora or bilingual dic-

tionaries to improve cross-lingual transfer learning

performance with minimal cost. For this purpose,

three strategies are proposed to enlarge the robust

region of English embeddings. The ﬁrst strategy

is called Embedding-Push, which pushes the em-

bedding of English to other language clusters. The

second is Attention-Pull, which constrains the rela-

tive position of the word embeddings to prevent the

meaning from straying. The last strategy, named

Robust target, introduces a Virtual Multilingual

Embedding (VME) to help the model build a suit-

able robust region, as shown in Figure 1d.

Experimental results on mBERT and XLM-R

demonstrate that our method effectively improves

the zero-shot cross-lingual transfer on classiﬁcation

tasks and outperforms a series of previous works.

In addition, case studies show that our method im-

proves the model through multilingual word align-

ment. Compared with existing works, our method

has the following advantages. First, our method

only needs English resources, which is suitable for

low-resource languages. Second, our method can

induce alignments in many languages without spec-

ifying the target language. Finally, our method is

simple to implement and achieves effective experi-

mental results. Our code is publicly available1.

2 Method

Given an English training batch

, for a speciﬁc

x∈ B

consisting of words (

x1, x2, x3

), we ﬁrst fol-

low Huang et al. (2021) to generate an augmented

example

xa= (xa

1, xa

2, xa

by randomly replacing

with

from the pre-deﬁned English synonym

set (Alzantot et al.,2018). Then, we introduce three

objective functions to get the Virtual Multilingual

Embedding (VME) that provides a suitable robust

1https://github.com/KB-Ding/EAR

Embedding-push

Attention-pull

classify

Transformer

Embedding

Transformer

!"# $!, $", …

…

classify

!"# $!

#, $"

#, …

Transformer

Embedding

Transformer

…

processinglanguage 处理语言natural自然

are Virtual Multilingual Embeddings

Figure 2: The two networks have tied weights. VMEs

expand robust regions (orange circle) by aligning

semantic-similar words in other languages. Note that

VMEs do not specify the target language but improve

multilingual performance, as shown in section 3.3.

region for zero-shot cross-lingual classiﬁcation task

as shown in Figure 2. We describe the details in

the following subsections.

2.1 Embedding-push target

The Embedding-Push target aims to make English

embeddings leave their original cluster and robust

region by pushing away (

x,xa

) in the embedding

space. The pushed embedding can be viewed as

the VME. The loss function is (1).

`EP T =−1

|B| X

x∈B

(M(Ex)−M(Exa))2(1)

where

Ex, Exa

denote the embedding output of

and xa,Mis the mean-pooling method.

2.2 Attention-pull target

The self-attention matrices contain rich linguistic

information (Clark et al.,2019) and can be re-

garded as a 1-hop graph attention between the hid-

den states of words (Vaswani et al.,2017;

). The

attention matrix represents the information trans-

fer score between each pair of words, we regard it

as the pulling force, so the attention matrix deter-

mines the relative linguistic positions of words in

a sentence. We introduce the Attention-Pull target

to encourage the relative linguistic position among

(

1, xa

2, xa

) to be similar to (

x1, x2, x3

) by ﬁtting

the middle layer multi-head attention matrices, as

(2).

`AP T =1

|B|HX

x∈B

Ai

x−Ai

xa2(2)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ASimpleandEffectiveMethodtoImproveZero-ShotCross-LingualTransferLearningKunboDing1,WeijieLiu1,2y,YuejianFang1,WeiquanMao2,ZheZhao2TaoZhu2,HaoyanLiu2,RongTian2,YirenChen21PekingUniversity,Beijing,China2TencentResearch,Beijing,Chinakunbo_ding@stu.pku.edu.cn,dataliu@pku.edu.cn,fangyj@ss.pku.edu.cn{wei...

展开>> 收起<<

A Simple and Effective Method to Improve Zero-Shot Cross-Lingual Transfer Learning Kunbo Ding1 Weijie Liu12y Yuejian Fang1 Weiquan Mao2 Zhe Zhao2.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Simple and Effective Method to Improve Zero-Shot Cross-Lingual Transfer Learning Kunbo Ding1 Weijie Liu12y Yuejian Fang1 Weiquan Mao2 Zhe Zhao2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: