Robust Unsupervised Cross-Lingual Word Embedding using Domain Flow Interpolation Liping Tang1Zhen Li2Zhiquan Luo2Helen Meng13

2025-05-03 0 0 377.44KB 14 页 10玖币
侵权投诉
Robust Unsupervised Cross-Lingual Word Embedding
using Domain Flow Interpolation
Liping Tang1Zhen Li2Zhiquan Luo2Helen Meng1,3
1Centre for Perceptual and Interactive Intelligence
2The Chinese University of Hong Kong, Shenzhen
3The Chinese University of Hong Kong
lptang@cpii.hk, {zhenli, zqluo}@cuhk.edu.cn, hmmeng@cuhk.edu.hk
Abstract
This paper investigates an unsupervised ap-
proach towards deriving a universal, cross-
lingual word embedding space, where words
with similar semantics from different lan-
guages are close to one another. Previous ad-
versarial approaches have shown promising re-
sults in inducing cross-lingual word embed-
ding without parallel data. However, the train-
ing stage shows instability for distant language
pairs. Instead of mapping the source language
space directly to the target language space, we
propose to make use of a sequence of interme-
diate spaces for smooth bridging. Each inter-
mediate space may be conceived as a pseudo-
language space and is introduced via simple
linear interpolation. This approach is mod-
eled after domain flow in computer vision, but
with a modified objective function. Experi-
ments on intrinsic Bilingual Dictionary Induc-
tion tasks show that the proposed approach
can improve the robustness of adversarial mod-
els with comparable and even better preci-
sion. Further experiments on the downstream
task of Cross-Lingual Natural Language Infer-
ence show that the proposed model achieves
significant performance improvement for dis-
tant language pairs in downstream tasks com-
pared to state-of-the-art adversarial and non-
adversarial models.
1 Introduction
Learning cross-lingual word embedding (CLWE)
is a fundamental step towards deriving a univer-
sal embedding space in which words with simi-
lar semantics from different languages are close
to one another. CLWE has also shown effective-
ness in knowledge transfer between languages for
many natural language processing tasks, including
Named Entity Recognition (Guo et al.,2015), Ma-
chine Translation (Gu et al.,2018), and Information
Retrieval (Vulic and Moens,2015).
Inspired by Mikolov et al. (2013), recent CLWE
models have been dominated by mapping-based
methods (Ruder et al.,2019;Glavas et al.,2019;
Vulic et al.,2019). They map monolingual word
embeddings into a shared space via linear map-
pings, assuming that different word embedding
spaces are nearly isomorphic. By leveraging a
seed dictionary of 5000 word pairs, Mikolov et al.
(2013) induces CLWEs by solving a least-squares
problem. Subsequent works (Xing et al.,2015;
Artetxe et al.,2016;Smith et al.,2017;Joulin et al.,
2018) propose to improve the model by normaliz-
ing the embedding vectors, imposing an orthogonal-
ity constraint on the linear mapping, and modifying
the objective function. Following work has shown
that reliable projections can be learned from weak
supervision by utilizing shared numerals(Artetxe
et al.,2017), cognates (Smith et al.,2017), or iden-
tical strings (Søgaard et al.,2018).
Moreover, several fully unsupervised approaches
have been recently proposed to induce CLWEs by
adversarial training (Zhang et al.,2017a;Zhang
et al.,2017b;Lample et al.,2018). State-of-the-art
unsupervised adversarial approaches (Lample et al.,
2018) have achieved very promising results and
even outperform supervised approaches in some
cases. However, the main drawback of adversarial
approaches lies in their instability on distant lan-
guage pairs (Søgaard et al.,2018), inspiring the
proposition of non-adversarial approaches (Hoshen
and Wolf,2018;Artetxe et al.,2018b). In partic-
ular, Artetxe et al. (2018b) (VecMap) have shown
strong robustness on several language pairs. How-
ever, it still fails on 87 out of 210 distant language
pairs (Vulic et al.,2019).
Subsequently, Li et al. (2020) proposed Itera-
tive Dimension Reduction to improve the robust-
ness of VecMap. On the other hand, Mohiuddin
and Joty (2019) revisited adversarial models and
add two regularization terms that yield improved
results. However, the problem of instability still
remains. For instance, our experiments show that
the improved version(Mohiuddin and Joty,2019)
arXiv:2210.03319v1 [cs.CL] 7 Oct 2022
still fails in inducing reliable English-to-Japanese
and English-to-Chinese CLWE space.
In this paper, we focus on the challenging task
of unsupervised CLWE on distant language pairs.
Due to the high precision achieved by adversarial
models, we revisit adversarial models and propose
to improve their robustness. We adopt the network
architecture from Mohiuddin and Joty (2019) but
treat the unsupervised CLWE task as a domain
adaptation problem. Our approach is inspired by
the idea of domain flow in computer vision that has
been shown to be effective for domain adaptation
tasks. Gong et al. (2019) introduced intermedi-
ate domains to generate images of intermediate
styles. They added an intermediate domain vari-
able on the input of the generator via conditional
instance normalization. The intermediate domains
can smoothly bridge the gap between source and
target domains to ease the domain adaptation task.
Inspired by this idea, we adapt domain flow for
our task by introducing intermediate domains via
simple linear interpolations. Specifically, rather
than mapping the source language space directly
to the target language space, we map the source
language space to intermediate ones. Each interme-
diate space may be conceived as a pseudo-language
space and is introduced as a linear interpolation of
the source and target language space. We then en-
gage the intermediate space approaching the target
language space gradually. Consequently, the gap
between the source language space and the target
space could be smoothly bridged by the sequence
of intermediate spaces. We have also modified the
objective functions of the original domain flow for
our task.
We evaluate the proposed model on both intrin-
sic and downstream tasks. Experiments on intrinsic
Bilingual Dictionary Induction (BLI) tasks show
that our method can significantly improve the ro-
bustness of adversarial models. Simultaneously,
it could achieve comparable or even better preci-
sion compared with the state-of-the-art adversarial
and non-adversarial models. Although BLI is a
standard evaluation task for CLWEs, the perfor-
mance on the BDI task might not correlate with
performance in downstream tasks(Glavas et al.,
2019). Following previous works (Glavas et al.,
2019;Doval et al.,2020;Ormazabal et al.,2021),
we choose Cross-Lingual Natural Language Infer-
ence (XNLI), a language understanding task, as
the downstream task to further evaluate the pro-
posed model. Experiments on the XNLI task show
that the proposed model achieves higher accuracy
on distant language pairs compared to baselines,
which validates the importance of robustness of
CLWE models in downstream tasks and demon-
strates the effectiveness of the proposed model.
2 Proposed Model
Our model is implemented based on the network
structure from Mohiuddin and Joty (2019), which
implements a cycleGAN on the latent word rep-
resentations transformed by autoencoders. In our
model, the source language space corresponds to
the source domain
S
and the target language space
corresponds to the target domain T.
2.1 Introducing Intermediate Domains
Let
z[0,1]
and denote the intermediate domain
as
M(z)
, similar to Gong et al. (2019).
M(0)
cor-
responds to the source domain
S
, and
M(1)
corre-
sponds to the target domain
T
. By varying
z
from
0 to 1, we can obtain a sequence of intermediate
domains from
S
to
T
, referred to as domain flow.
There are many possible paths from
S
to
T
and we
expect M(z)to be the shortest one.
Moreover, given any z, we expect the distance
between
S
and
M(z)
to be proportional to the dis-
tance between Sand Tby z, or equivalently,
dist (PS, PM(z))
dist (PT, PM(z))=z
1z.(1)
Thus finding the shortest path from
S
to
T
, i.e.,
the sequence of
M(z)
, leads to minimizing the
following loss:
L=z·dist (PT, PM(z))+(1z)·dist (PS, PM(z)).
(2)
We use the adversarial loss in GAN (Goodfel-
low et al.,2014) to model the distance between
distributions, similar to Gong et al. (2019).
2.2 Implementation of Generators
Suppose
x
is sampled from the source domain
S
and yis sampled from the target domain T.
The generator
GST
in our model transfers data
from the source domain to an intermediate domain
instead of the target domain. Denote
Z= [0,1]
,
then GST is a mapping from S × Z to M(z).
To ensure the generator to be a linear transfor-
mation, we consider our generator as
GST (x, z) = WST (z)·x+ (1 z)·x.(3)
In this setup,
GST (x,0) = x
and
GST (x,1) =
WST (z)·x
. We adopt
WST (z)
as a simple scale
multiplication on a matrix, i.e.,
WST (z) = z·WST ,(4)
where
WST
is the final transformation matrix that
we are interested in. Finally, our intermediate map-
pings become
GST (x, z) = z·WST ·x+ (1 z)·x.(5)
These intermediate mappings are simple linear in-
terpolations between the data from source domain
x
and that from the pseudo target domain
WST ·x
.
The generator GT S (y, z)can be defined similarly.
2.3 The Domain Flow Model
The discriminator
DS
is used to distinguish
S
and
M(z)
, and
DT
is used to distinguish
T
and
M(z)
.
Using the adversarial loss as the distribution dis-
tance measure, we obtain the adversarial losses
between M(z)and Sas
Ladv (GST , DS) = ExPS[log (DS(x))]
+ExPS[log (1 DS(GST (x, z)))] .(6)
Similarly, the adversarial losses between
M(z)
and Tcan be written as
Ladv (GST , DT) = EyPT[log (DT(y))]
+ExPS[log (1 DT(GST (x, z)))] .(7)
Deploying the above losses as
dist (PS, PM(z))
and
dist (PT, PM(z))
in Eq. (2), we can derive the
following loss
Ladv(GST ,DS, DT) = z· Ladv (GST , DT)
+ (1 z)· Ladv (GST , DS).(8)
Consider the other direction from
T
to
M(1z)
,
we can define similar loss
Ladv(GT S , DS, DT)
.
Then the total adversarial loss is
Ladv =Ladv(GST , DS, DT)
+Ladv(GT S , DS, DT).(9)
Modification of Adversarial Loss
In the loss
discussed above,
DS
is trained to assign a high
value (i.e. 1) to
x
and assign a low value (i.e. 0)
to
GST (x, z)
, and similar for
DT
. But when
z
is
small,
GST (x, z)
is close to the data from source
domain and it will be too aggressive if we train the
discriminator to assign 0 to it. In our model, we
train the discriminator
DS
to assign
1z
instead
of 0 to
GST (x, z)
. When
z= 0
,
GST (x, z) = x
and the discriminator
DS
is trained to assign 1 to it.
GST
and
GT S
are trained to fool the discriminator
DS
, trying to make
DS(GT S (y, z))
close to 1 and
DS(GST (x, z)) close to z.
Besides the adversarial loss, the cycle consis-
tency loss in the cycle GAN here is defined as:
Lcyc(GST , GST )
=ExPSkGT S (GST (x, z), z)xk2
+EyPTkGST (GT S (y, z), z)yk2.
(10)
Based on the model structure from Mohiuddin
and Joty (2019), we deploy the domain flow on
the latent space obtained from two autoencoders,
i.e., replace
x
and
y
with
EncS(x)
and
EncT(y)
in
above losses. An additional reconstruction loss in
the autoencoders is defined as:
Lrec =ExPSkDecS(EncS(x)) xk2
+EyPTkDecT(EncT(y)) yk2.(11)
Then the total loss is
L=Ladv +λ1·Lcyc +λ2·Lrec,(12)
where λ1and λ2are two hyperparameters.
Choice of z
In our model,
z
is sampled from
a beta distribution
f(z, α, β) = 1
B(α,β)zα1(1
z)β1
, where
B(·,·)
is the Beta function,
β
is fixed
to be 1, and
α
is set as a function of the training
iterations. Specifically,
α=et0.5T
0.25T
, where
t
is
the current iteration and
T
is the total number of
iterations. In this setting,
z
tends to be more likely
to be small values at the beginning, and gradually
shift to larger values during training. In practice,
we set
z= 1
in the last several epochs to fine-
tune the model. For the case of running 10 epochs
using our proposed model, the intermediate domain
variable z is fixed to be 1 in the last 3 epochs, trying
to fine-tune our proposed model. For other cases,
i.e., when running 20 and 30 epochs, z is fixed to
be 1 in the last 5 epochs.
3 Bilingual Lexicon Induction
3.1 Experimental Setup
Bilingual Lexical Induction (BLI) has become the
de facto standard evaluation for mapping-based
CLWEs (Ruder et al.,2019;Glavas et al.,2019;
Vulic et al.,2019). Given a shared CLWE space
摘要:

RobustUnsupervisedCross-LingualWordEmbeddingusingDomainFlowInterpolationLipingTang1ZhenLi2ZhiquanLuo2HelenMeng1,31CentreforPerceptualandInteractiveIntelligence2TheChineseUniversityofHongKong,Shenzhen3TheChineseUniversityofHongKonglptang@cpii.hk,{zhenli,zqluo}@cuhk.edu.cn,hmmeng@cuhk.edu.hkAbstractTh...

展开>> 收起<<
Robust Unsupervised Cross-Lingual Word Embedding using Domain Flow Interpolation Liping Tang1Zhen Li2Zhiquan Luo2Helen Meng13.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:377.44KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注