Robust Unsupervised Cross-Lingual Word Embedding using Domain Flow Interpolation Liping Tang1Zhen Li2Zhiquan Luo2Helen Meng13

2025-05-03 0 0 377.44KB 14 页 10玖币

侵权投诉

Robust Unsupervised Cross-Lingual Word Embedding

using Domain Flow Interpolation

Liping Tang1Zhen Li2Zhiquan Luo2Helen Meng1,3

1Centre for Perceptual and Interactive Intelligence

2The Chinese University of Hong Kong, Shenzhen

3The Chinese University of Hong Kong

lptang@cpii.hk, {zhenli, zqluo}@cuhk.edu.cn, hmmeng@cuhk.edu.hk

Abstract

This paper investigates an unsupervised ap-

proach towards deriving a universal, cross-

lingual word embedding space, where words

with similar semantics from different lan-

guages are close to one another. Previous ad-

versarial approaches have shown promising re-

sults in inducing cross-lingual word embed-

ding without parallel data. However, the train-

ing stage shows instability for distant language

pairs. Instead of mapping the source language

space directly to the target language space, we

propose to make use of a sequence of interme-

diate spaces for smooth bridging. Each inter-

mediate space may be conceived as a pseudo-

language space and is introduced via simple

linear interpolation. This approach is mod-

eled after domain ﬂow in computer vision, but

with a modiﬁed objective function. Experi-

ments on intrinsic Bilingual Dictionary Induc-

tion tasks show that the proposed approach

can improve the robustness of adversarial mod-

els with comparable and even better preci-

sion. Further experiments on the downstream

task of Cross-Lingual Natural Language Infer-

ence show that the proposed model achieves

signiﬁcant performance improvement for dis-

tant language pairs in downstream tasks com-

pared to state-of-the-art adversarial and non-

adversarial models.

1 Introduction

Learning cross-lingual word embedding (CLWE)

is a fundamental step towards deriving a univer-

sal embedding space in which words with simi-

lar semantics from different languages are close

to one another. CLWE has also shown effective-

ness in knowledge transfer between languages for

many natural language processing tasks, including

Named Entity Recognition (Guo et al.,2015), Ma-

chine Translation (Gu et al.,2018), and Information

Retrieval (Vulic and Moens,2015).

Inspired by Mikolov et al. (2013), recent CLWE

models have been dominated by mapping-based

methods (Ruder et al.,2019;Glavas et al.,2019;

Vulic et al.,2019). They map monolingual word

embeddings into a shared space via linear map-

pings, assuming that different word embedding

spaces are nearly isomorphic. By leveraging a

seed dictionary of 5000 word pairs, Mikolov et al.

(2013) induces CLWEs by solving a least-squares

problem. Subsequent works (Xing et al.,2015;

Artetxe et al.,2016;Smith et al.,2017;Joulin et al.,

2018) propose to improve the model by normaliz-

ing the embedding vectors, imposing an orthogonal-

ity constraint on the linear mapping, and modifying

the objective function. Following work has shown

that reliable projections can be learned from weak

supervision by utilizing shared numerals(Artetxe

et al.,2017), cognates (Smith et al.,2017), or iden-

tical strings (Søgaard et al.,2018).

Moreover, several fully unsupervised approaches

have been recently proposed to induce CLWEs by

adversarial training (Zhang et al.,2017a;Zhang

et al.,2017b;Lample et al.,2018). State-of-the-art

unsupervised adversarial approaches (Lample et al.,

2018) have achieved very promising results and

even outperform supervised approaches in some

cases. However, the main drawback of adversarial

approaches lies in their instability on distant lan-

guage pairs (Søgaard et al.,2018), inspiring the

proposition of non-adversarial approaches (Hoshen

and Wolf,2018;Artetxe et al.,2018b). In partic-

ular, Artetxe et al. (2018b) (VecMap) have shown

strong robustness on several language pairs. How-

ever, it still fails on 87 out of 210 distant language

pairs (Vulic et al.,2019).

Subsequently, Li et al. (2020) proposed Itera-

tive Dimension Reduction to improve the robust-

ness of VecMap. On the other hand, Mohiuddin

and Joty (2019) revisited adversarial models and

add two regularization terms that yield improved

results. However, the problem of instability still

remains. For instance, our experiments show that

the improved version(Mohiuddin and Joty,2019)

arXiv:2210.03319v1 [cs.CL] 7 Oct 2022

still fails in inducing reliable English-to-Japanese

and English-to-Chinese CLWE space.

In this paper, we focus on the challenging task

of unsupervised CLWE on distant language pairs.

Due to the high precision achieved by adversarial

models, we revisit adversarial models and propose

to improve their robustness. We adopt the network

architecture from Mohiuddin and Joty (2019) but

treat the unsupervised CLWE task as a domain

adaptation problem. Our approach is inspired by

the idea of domain ﬂow in computer vision that has

been shown to be effective for domain adaptation

tasks. Gong et al. (2019) introduced intermedi-

ate domains to generate images of intermediate

styles. They added an intermediate domain vari-

able on the input of the generator via conditional

instance normalization. The intermediate domains

can smoothly bridge the gap between source and

target domains to ease the domain adaptation task.

Inspired by this idea, we adapt domain ﬂow for

our task by introducing intermediate domains via

simple linear interpolations. Speciﬁcally, rather

than mapping the source language space directly

to the target language space, we map the source

language space to intermediate ones. Each interme-

diate space may be conceived as a pseudo-language

space and is introduced as a linear interpolation of

the source and target language space. We then en-

gage the intermediate space approaching the target

language space gradually. Consequently, the gap

between the source language space and the target

space could be smoothly bridged by the sequence

of intermediate spaces. We have also modiﬁed the

objective functions of the original domain ﬂow for

our task.

We evaluate the proposed model on both intrin-

sic and downstream tasks. Experiments on intrinsic

Bilingual Dictionary Induction (BLI) tasks show

that our method can signiﬁcantly improve the ro-

bustness of adversarial models. Simultaneously,

it could achieve comparable or even better preci-

sion compared with the state-of-the-art adversarial

and non-adversarial models. Although BLI is a

standard evaluation task for CLWEs, the perfor-

mance on the BDI task might not correlate with

performance in downstream tasks(Glavas et al.,

2019). Following previous works (Glavas et al.,

2019;Doval et al.,2020;Ormazabal et al.,2021),

we choose Cross-Lingual Natural Language Infer-

ence (XNLI), a language understanding task, as

the downstream task to further evaluate the pro-

posed model. Experiments on the XNLI task show

that the proposed model achieves higher accuracy

on distant language pairs compared to baselines,

which validates the importance of robustness of

CLWE models in downstream tasks and demon-

strates the effectiveness of the proposed model.

2 Proposed Model

Our model is implemented based on the network

structure from Mohiuddin and Joty (2019), which

implements a cycleGAN on the latent word rep-

resentations transformed by autoencoders. In our

model, the source language space corresponds to

the source domain

and the target language space

corresponds to the target domain T.

2.1 Introducing Intermediate Domains

Let

z∈[0,1]

and denote the intermediate domain

M(z)

, similar to Gong et al. (2019).

M(0)

cor-

responds to the source domain

, and

M(1)

corre-

sponds to the target domain

. By varying

from

0 to 1, we can obtain a sequence of intermediate

domains from

, referred to as domain ﬂow.

There are many possible paths from

and we

expect M(z)to be the shortest one.

Moreover, given any z, we expect the distance

between

and

M(z)

to be proportional to the dis-

tance between Sand Tby z, or equivalently,

dist (PS, PM(z))

dist (PT, PM(z))=z

1−z.(1)

Thus ﬁnding the shortest path from

, i.e.,

the sequence of

M(z)

, leads to minimizing the

following loss:

L=z·dist (PT, PM(z))+(1−z)·dist (PS, PM(z)).

(2)

We use the adversarial loss in GAN (Goodfel-

low et al.,2014) to model the distance between

distributions, similar to Gong et al. (2019).

2.2 Implementation of Generators

Suppose

is sampled from the source domain

and yis sampled from the target domain T.

The generator

GST

in our model transfers data

from the source domain to an intermediate domain

instead of the target domain. Denote

Z= [0,1]

then GST is a mapping from S × Z to M(z).

To ensure the generator to be a linear transfor-

mation, we consider our generator as

GST (x, z) = WST (z)·x+ (1 −z)·x.(3)

In this setup,

GST (x,0) = x

and

GST (x,1) =

WST (z)·x

. We adopt

WST (z)

as a simple scale

multiplication on a matrix, i.e.,

WST (z) = z·WST ,(4)

where

WST

is the ﬁnal transformation matrix that

we are interested in. Finally, our intermediate map-

pings become

GST (x, z) = z·WST ·x+ (1 −z)·x.(5)

These intermediate mappings are simple linear in-

terpolations between the data from source domain

and that from the pseudo target domain

WST ·x

The generator GT S (y, z)can be deﬁned similarly.

2.3 The Domain Flow Model

The discriminator

is used to distinguish

and

M(z)

, and

is used to distinguish

and

M(z)

Using the adversarial loss as the distribution dis-

tance measure, we obtain the adversarial losses

between M(z)and Sas

Ladv (GST , DS) = Ex∼PS[log (DS(x))]

+Ex∼PS[log (1 −DS(GST (x, z)))] .(6)

Similarly, the adversarial losses between

M(z)

and Tcan be written as

Ladv (GST , DT) = Ey∼PT[log (DT(y))]

+Ex∼PS[log (1 −DT(GST (x, z)))] .(7)

Deploying the above losses as

dist (PS, PM(z))

and

dist (PT, PM(z))

in Eq. (2), we can derive the

following loss

Ladv(GST ,DS, DT) = z· Ladv (GST , DT)

+ (1 −z)· Ladv (GST , DS).(8)

Consider the other direction from

M(1−z)

we can deﬁne similar loss

Ladv(GT S , DS, DT)

Then the total adversarial loss is

Ladv =Ladv(GST , DS, DT)

+Ladv(GT S , DS, DT).(9)

Modiﬁcation of Adversarial Loss

In the loss

discussed above,

is trained to assign a high

value (i.e. 1) to

and assign a low value (i.e. 0)

GST (x, z)

, and similar for

. But when

small,

GST (x, z)

is close to the data from source

domain and it will be too aggressive if we train the

discriminator to assign 0 to it. In our model, we

train the discriminator

to assign

1−z

instead

of 0 to

GST (x, z)

. When

z= 0

GST (x, z) = x

and the discriminator

is trained to assign 1 to it.

GST

and

GT S

are trained to fool the discriminator

, trying to make

DS(GT S (y, z))

close to 1 and

DS(GST (x, z)) close to z.

Besides the adversarial loss, the cycle consis-

tency loss in the cycle GAN here is deﬁned as:

Lcyc(GST , GST )

=Ex∼PSkGT S (GST (x, z), z)−xk2

+Ey∼PTkGST (GT S (y, z), z)−yk2.

(10)

Based on the model structure from Mohiuddin

and Joty (2019), we deploy the domain ﬂow on

the latent space obtained from two autoencoders,

i.e., replace

and

with

EncS(x)

and

EncT(y)

above losses. An additional reconstruction loss in

the autoencoders is deﬁned as:

Lrec =Ex∼PSkDecS(EncS(x)) −xk2

+Ey∼PTkDecT(EncT(y)) −yk2.(11)

Then the total loss is

L=Ladv +λ1·Lcyc +λ2·Lrec,(12)

where λ1and λ2are two hyperparameters.

Choice of z

In our model,

is sampled from

a beta distribution

f(z, α, β) = 1

B(α,β)zα−1(1 −

z)β−1

, where

B(·,·)

is the Beta function,

is ﬁxed

to be 1, and

is set as a function of the training

iterations. Speciﬁcally,

α=et−0.5T

0.25T

, where

the current iteration and

is the total number of

iterations. In this setting,

tends to be more likely

to be small values at the beginning, and gradually

shift to larger values during training. In practice,

we set

z= 1

in the last several epochs to ﬁne-

tune the model. For the case of running 10 epochs

using our proposed model, the intermediate domain

variable z is ﬁxed to be 1 in the last 3 epochs, trying

to ﬁne-tune our proposed model. For other cases,

i.e., when running 20 and 30 epochs, z is ﬁxed to

be 1 in the last 5 epochs.

3 Bilingual Lexicon Induction

3.1 Experimental Setup

Bilingual Lexical Induction (BLI) has become the

de facto standard evaluation for mapping-based

CLWEs (Ruder et al.,2019;Glavas et al.,2019;

Vulic et al.,2019). Given a shared CLWE space

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RobustUnsupervisedCross-LingualWordEmbeddingusingDomainFlowInterpolationLipingTang1ZhenLi2ZhiquanLuo2HelenMeng1,31CentreforPerceptualandInteractiveIntelligence2TheChineseUniversityofHongKong,Shenzhen3TheChineseUniversityofHongKonglptang@cpii.hk,{zhenli,zqluo}@cuhk.edu.cn,hmmeng@cuhk.edu.hkAbstractTh...

展开>> 收起<<

Robust Unsupervised Cross-Lingual Word Embedding using Domain Flow Interpolation Liping Tang1Zhen Li2Zhiquan Luo2Helen Meng13.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Robust Unsupervised Cross-Lingual Word Embedding using Domain Flow Interpolation Liping Tang1Zhen Li2Zhiquan Luo2Helen Meng13

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: