IsoVec Controlling the Relative Isomorphism of Word Embedding Spaces Kelly Marchisio Neha Verma Kevin Duh andPhilipp Koehn Johns Hopkins University

2025-05-05 0 0 561.44KB 15 页 10玖币

侵权投诉

IsoVec: Controlling the Relative Isomorphism of Word Embedding Spaces

Kelly Marchisio, Neha Verma, Kevin Duh, and Philipp Koehn

Johns Hopkins University

{kmarc, nverma7}@jhu.edu, kevinduh@cs.jhu.edu, phi@jhu.edu

Abstract

The ability to extract high-quality translation

dictionaries from monolingual word embed-

ding spaces depends critically on the geomet-

ric similarity of the spaces—their degree of

“isomorphism.” We address the root-cause

of faulty cross-lingual mapping: that word

embedding training resulted in the underly-

ing spaces being non-isomorphic. We incor-

porate global measures of isomorphism di-

rectly into the Skip-gram loss function, suc-

cessfully increasing the relative isomorphism

of trained word embedding spaces and improv-

ing their ability to be mapped to a shared cross-

lingual space. The result is improved bilin-

gual lexicon induction in general data con-

ditions, under domain mismatch, and with

training algorithm dissimilarities. We re-

lease IsoVec at

https://github.com/

kellymarchisio/isovec.

1 Introduction

The task of extracting a translation dictionary from

word embedding spaces, called “bilingual lexicon

induction” (BLI), is a common task in the natural

language processing literature. Bilingual dictionar-

ies are useful in their own right as linguistic re-

sources, and automatically generated dictionaries

may be particularly helpful for low-resource lan-

guages for which human-curated dictionaries are

unavailable. BLI is also used as an extrinsic eval-

uation task to assess the quality of cross-lingual

spaces. If a high-quality translation dictionary can

be automatically extracted from a shared embed-

ding space, intuition says that the space is high-

quality and useful for downstream tasks.

“Mapping-based” methods are one way to cre-

ate cross-lingual embedding spaces. Separately-

trained monolingual embeddings are mapped to a

shared space by applying a linear transformation to

one or both spaces, after which a bilingual lexicon

can be extracted via nearest-neighbor search (e.g.,

Mikolov et al.,2013b;Lample et al.,2018;Artetxe

et al.,2018b;Joulin et al.,2018;Patra et al.,2019).

Mapping methods are effective for closely-

related languages with embedding spaces trained

on high-quality, domain-matched data even with-

out supervision, but critically rely on the “approxi-

mate isomorphism assumption”—that monolingual

embedding spaces are geometrically similar.

Prob-

lematically, researchers have observed that the iso-

morphism assumption weakens substantially as lan-

guages and domains become dissimilar, leading

to failure precisely where unsupervised methods

might be helpful (e.g. Søgaard et al.,2018;Ormaz-

abal et al.,2019;Glavaš et al.,2019;Vuli´

c et al.,

2019;Patra et al.,2019;Marchisio et al.,2020).

Existing work attributes non-isomorphism to lin-

guistic, algorithmic, data size, or domain differ-

ences in training data for source and target lan-

guages. From Søgaard et al. (2018), “the perfor-

mance of unsupervised BDI [BLI] depends heavily

on... language pair, the comparability of the mono-

lingual corpora, and the parameters of the word

embedding algorithms.” Several authors found that

unsupervised machine translation methods suffer

under similar data shifts (Marchisio et al.,2020;

Kim et al.,2020;Marie and Fujita,2020).

While such factors do result in low isomorphism

of spaces trained with traditional methods, we

needn’t resign ourselves to the mercy of the ge-

ometry a training methodology naturally produces.

While multiple works post-process embeddings or

map non-linearly, we control similarity explicitly

during embedding training by incorporating ﬁve

global metrics of isomorphism into the Skip-gram

loss function. Our three supervised and two unsu-

pervised losses gain some control of the relative iso-

morphism of word embedding spaces, compensat-

In formal mathematicals, “isomorphic” requires two ob-

jects to have an invertible correspondence between them. Re-

searchers in NLP loosen the deﬁnition to “geometrically sim-

ilar”, and consider degrees of similarity. We might say that

space X is more isomorphic to space Y than is space Z.

arXiv:2210.05098v3 [cs.CL] 4 Jul 2023

Proc-L2 = 1.33

RSIM-U =

Correlation(edges_Source, edges_Ref)

Loss = LossSkip-gram + LossIsomorphism

Source

Embeddings

(Trained)

Reference

Embeddings

(Fixed)

Joint

Embedding

Space

Figure 1: Proposed Method. Loss is a weighted combination of Skip-gram with negative sampling loss (seen left

with a reproduction of the familiar image from Mikolov et al. (2013a) for reader recognizability) and an isomorphism

loss (seen right, ours) calculated in relation to a ﬁxed reference space. Gray boxes are two possibilities explored in

this work: Proc-L2 (supervised) where

LISO

is calculated over given seed translations, and RSIM-U (unsupervised).

ing for data mismatch and creating spaces that are

linearly mappable where previous methods failed.

2 Related Work

Cross-Lingual Word Embeddings There is a

broad literature on creating cross-lingual word

embedding spaces. Two major paradigms are

“mapping-based” methods which ﬁnd a linear trans-

formation to map monolingual embedding spaces

to a shared space (e.g., Artetxe et al.,2016,2017;

Alvarez-Melis and Jaakkola,2018;Doval et al.,

2018;Jawanpuria et al.,2019), and “joint-training”

which, as stated in the enlightening survey by

Ruder et al. (2019), “minimize the source and tar-

get language monolingual losses jointly with the

cross-lingual regularization term” (e.g. Luong et al.,

2015,Ruder et al. (2019) for a review). Gouws

et al. (2015) train Skip-gram for source and target

languages simultaneously, enforcing an L2 loss for

known translation. Wang et al. (2020) compare and

combine joint and mapping approaches.

More recently, researchers have explored mas-

sively multilingual language models (Devlin et al.,

2019;Conneau et al.,2020). While these have been

shown to possess some inherent cross-lingual trans-

fer ability (Wu and Dredze,2019), another line

of work focuses on improving their cross-lingual

representations with explicit cross-lingual signal

(Wang et al.,2019;Liu et al.,2019;Cao et al.,2020;

Kulshreshtha et al.,2020;Wu and Dredze,2020).

Recently, Li et al. (2022) combined static and pre-

trained multilingual embeddings for BLI.

Handling Non-Isomorphism Miceli Barone

(2016) explore whether comparable corpora induce

embedding spaces which are approximately iso-

morphic. Ormazabal et al. (2019) compare cross-

lingual word embeddings induced via mapping

methods and jointly-trained embeddings from Lu-

ong et al. (2015), ﬁnding that the latter are bet-

ter in measures of isomorphism and BLI preci-

sion. Nakashole and Flauger (2018) argue that

word embedding spaces are not globally linearly-

mappable. Others use non-linear mappings (e.g.

Mohiuddin et al.,2020;Glavaš and Vuli´

c,2020) or

post-process embeddings after training to improve

quality (e.g. Peng et al.,2021;Faruqui et al.,2015;

Mu and Viswanath,2018). Eder et al. (2021) ini-

tialize a target embedding space with vectors from

a higher-resource source space, then train the low-

resource target. Zhang et al. (2017) minimize earth

mover’s distance over 50-dimensional pretrained

word2vec embeddings. Ormazabal et al. (2021)

learn source embeddings in reference to ﬁxed tar-

get embeddings given known or hypothesized trans-

lation pairs induced during via self-learning.

Examining & Exploiting Embedding Geometry

Emerging literature examines geometric proper-

ties of embedding spaces. In addition to isomor-

phism, some examine isotropy (e.g. Mimno and

Thompson,2017;Mu and Viswanath,2018;Etha-

yarajh,2019;Rajaee and Pilehvar,2022;Rudman

et al.,2022). Li et al. (2020) transform the seman-

tic space from masked language models into an

isotropic Gaussian distribution from a non-smooth

anisotropic space. Su et al. (2021) apply whitening

and dimensionality reduction to improve isotropy.

Zhang et al. (2022) inject isotropy into a variational

autoencoder, and Ethayarajh and Jurafsky (2021)

recommend “adding an anisotropy penalty to the

language modelling objective” as future work.

3 Background

We discuss the mathematical background used

in our methods. Throughout,

X∈Rn×d

and

Y∈Rm×d

are the source and target word

embedding spaces of

-dimensional word vec-

tors, respectively. We may assume seed pairs

{(x0, y0),(x1, y1), ...(xs, ys)}are given.

3.1 The Orthogonal Procrustes Problem

Schönemann (1966) derived the solution to the or-

thogonal Procrustes problem, whose goal is to ﬁnd

the linear transformation Wthat solves:

arg min

W∈Rd×d,W TW=I||XW −Y||2

The solution is

W=V UT

, where

UΣVT

is the

singular value decomposition of

YTX

. If

is a

matrix of vectors corresponding to seed words

{(x0, y0),(x1, y1),...,(xs, ys)}

and

is a ma-

trix of the corresponding

, then

is the linear

transformation that minimizes the difference be-

tween the vector representations of known pairs.

3.2 Embedding Space Mapping with VecMap

We use the popular VecMap

toolkit for embedding

space mapping, which can be run in supervised,

semi-supervised, and unsupervised modes. As of

the time of its writing, Glavaš et al. (2019) deem

VecMap the most robust unsupervised method.

First, source and target word embeddings are

unit-normed, mean-centered, and unit-normed

again (Zhang et al.,2019). The bilingual lexicon is

induced by whitening each space and then solving

a variant of the orthogonal Procrustes problem.

Spaces are reweighted and dewhitened, and transla-

tion pairs are extracted via nearest-neighbor search

from the mapped embedding spaces. See the origi-

nal works and implementation for details (Artetxe

et al.,2018a,b).

Unsupervised and semi-supervised modes uti-

lize the same framework as supervised mode, but

2https://github.com/artetxem/vecmap

3See Appendix A.1,A.2 for details

with an iterative self-learning procedure that repeat-

edly solves the orthogonal Procrustes problem over

hypothesized translations. On each iteration, new

hypotheses are extracted. The modes differ only in

how they induce the initial hypothesis seed pairs.

In semi-supervised mode, this is a given input seed

dictionary. In unsupervised mode, similarity ma-

trices

Mx=XXT

and

Mz=ZZT

are created

over the ﬁrst

vocabulary words.

Word

is the

assumed translation of

if vector

Mzj

is most

similar to

Mxi

compared to all others in

. See

Artetxe et al. (2018b) for details.

3.3 Isomorphism Metrics

In NLP, relative isomorphism is often measured by

Relational Similarity, Eigenvector Similarity, and

Gromov-Hausdorff Distance. We describe these

metrics in detail in this section.

Relational Similarity Given seed translation

pairs, calculate pairwise cosine similarities:

cos(x0, x1)cos(y0, y1)

cos(x0, x2)cos(y0, y2)

cos(x0, x3)cos(y0, y3)

. . . . . .

cos(x1, x0)cos(y1, y0)

cos(x1, x2)cos(y1, y2)

. . . . . .

cos(xs, xs)cos(ys, ys)

The Pearson’s correlation between the lists of co-

sine similarities is known as Relational Similarity

(Vuli´

c et al.,2020;Zhang et al.,2019).

Eigenvector Similarity (Søgaard et al.,2018)

measures isomorphism between two spaces based

on the Laplacian spectra of their

-nearest neigh-

bor (

-NN) graphs. For seeds

{x0, x1, . . . , xs}

and

{y0, y1, . . . , ys}

, we compute unweighted

NN graphs

and

, then compute the Graph

Laplacians (

) for both graphs (the degree ma-

trix minus the adjacency matrix:

LG=DG−AG

We then compute the eigenvalues of

LGX

and

LGY

namely

{λLGX(i)}

and

{λLGY(i)}

. We select

min(lX, lY)

where

is the maximum

such that

the ﬁrst

eigenvalues of

LGX

sum to less than 90%

of the total sum of the eigenvalues. EVS is the sum

of squared differences between the partial spectra:

EVS =

i=1

(λLGX(i)−λLGY(i))2

4Default: 4000

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IsoVec:ControllingtheRelativeIsomorphismofWordEmbeddingSpacesKellyMarchisio,NehaVerma,KevinDuh,andPhilippKoehnJohnsHopkinsUniversity{kmarc,nverma7}@jhu.edu,kevinduh@cs.jhu.edu,phi@jhu.eduAbstractTheabilitytoextracthigh-qualitytranslationdictionariesfrommonolingualwordembed-dingspacesdependscriticall...

展开>> 收起<<

IsoVec Controlling the Relative Isomorphism of Word Embedding Spaces Kelly Marchisio Neha Verma Kevin Duh andPhilipp Koehn Johns Hopkins University.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

IsoVec Controlling the Relative Isomorphism of Word Embedding Spaces Kelly Marchisio Neha Verma Kevin Duh andPhilipp Koehn Johns Hopkins University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: