IsoVec Controlling the Relative Isomorphism of Word Embedding Spaces Kelly Marchisio Neha Verma Kevin Duh andPhilipp Koehn Johns Hopkins University

2025-05-05 0 0 561.44KB 15 页 10玖币
侵权投诉
IsoVec: Controlling the Relative Isomorphism of Word Embedding Spaces
Kelly Marchisio, Neha Verma, Kevin Duh, and Philipp Koehn
Johns Hopkins University
{kmarc, nverma7}@jhu.edu, kevinduh@cs.jhu.edu, phi@jhu.edu
Abstract
The ability to extract high-quality translation
dictionaries from monolingual word embed-
ding spaces depends critically on the geomet-
ric similarity of the spaces—their degree of
“isomorphism. We address the root-cause
of faulty cross-lingual mapping: that word
embedding training resulted in the underly-
ing spaces being non-isomorphic. We incor-
porate global measures of isomorphism di-
rectly into the Skip-gram loss function, suc-
cessfully increasing the relative isomorphism
of trained word embedding spaces and improv-
ing their ability to be mapped to a shared cross-
lingual space. The result is improved bilin-
gual lexicon induction in general data con-
ditions, under domain mismatch, and with
training algorithm dissimilarities. We re-
lease IsoVec at
https://github.com/
kellymarchisio/isovec.
1 Introduction
The task of extracting a translation dictionary from
word embedding spaces, called “bilingual lexicon
induction” (BLI), is a common task in the natural
language processing literature. Bilingual dictionar-
ies are useful in their own right as linguistic re-
sources, and automatically generated dictionaries
may be particularly helpful for low-resource lan-
guages for which human-curated dictionaries are
unavailable. BLI is also used as an extrinsic eval-
uation task to assess the quality of cross-lingual
spaces. If a high-quality translation dictionary can
be automatically extracted from a shared embed-
ding space, intuition says that the space is high-
quality and useful for downstream tasks.
“Mapping-based” methods are one way to cre-
ate cross-lingual embedding spaces. Separately-
trained monolingual embeddings are mapped to a
shared space by applying a linear transformation to
one or both spaces, after which a bilingual lexicon
can be extracted via nearest-neighbor search (e.g.,
Mikolov et al.,2013b;Lample et al.,2018;Artetxe
et al.,2018b;Joulin et al.,2018;Patra et al.,2019).
Mapping methods are effective for closely-
related languages with embedding spaces trained
on high-quality, domain-matched data even with-
out supervision, but critically rely on the “approxi-
mate isomorphism assumption”—that monolingual
embedding spaces are geometrically similar.
1
Prob-
lematically, researchers have observed that the iso-
morphism assumption weakens substantially as lan-
guages and domains become dissimilar, leading
to failure precisely where unsupervised methods
might be helpful (e.g. Søgaard et al.,2018;Ormaz-
abal et al.,2019;Glavaš et al.,2019;Vuli´
c et al.,
2019;Patra et al.,2019;Marchisio et al.,2020).
Existing work attributes non-isomorphism to lin-
guistic, algorithmic, data size, or domain differ-
ences in training data for source and target lan-
guages. From Søgaard et al. (2018), “the perfor-
mance of unsupervised BDI [BLI] depends heavily
on... language pair, the comparability of the mono-
lingual corpora, and the parameters of the word
embedding algorithms. Several authors found that
unsupervised machine translation methods suffer
under similar data shifts (Marchisio et al.,2020;
Kim et al.,2020;Marie and Fujita,2020).
While such factors do result in low isomorphism
of spaces trained with traditional methods, we
needn’t resign ourselves to the mercy of the ge-
ometry a training methodology naturally produces.
While multiple works post-process embeddings or
map non-linearly, we control similarity explicitly
during embedding training by incorporating five
global metrics of isomorphism into the Skip-gram
loss function. Our three supervised and two unsu-
pervised losses gain some control of the relative iso-
morphism of word embedding spaces, compensat-
1
In formal mathematicals, “isomorphic” requires two ob-
jects to have an invertible correspondence between them. Re-
searchers in NLP loosen the definition to “geometrically sim-
ilar”, and consider degrees of similarity. We might say that
space X is more isomorphic to space Y than is space Z.
arXiv:2210.05098v3 [cs.CL] 4 Jul 2023
21
1
+
Proc-L2 = 1.33
RSIM-U =
Correlation(edges_Source, edges_Ref)
Loss = LossSkip-gram + LossIsomorphism
Source
Embeddings
(Trained)
Reference
Embeddings
(Fixed)
Joint
Embedding
Space
Figure 1: Proposed Method. Loss is a weighted combination of Skip-gram with negative sampling loss (seen left
with a reproduction of the familiar image from Mikolov et al. (2013a) for reader recognizability) and an isomorphism
loss (seen right, ours) calculated in relation to a fixed reference space. Gray boxes are two possibilities explored in
this work: Proc-L2 (supervised) where
LISO
is calculated over given seed translations, and RSIM-U (unsupervised).
ing for data mismatch and creating spaces that are
linearly mappable where previous methods failed.
2 Related Work
Cross-Lingual Word Embeddings There is a
broad literature on creating cross-lingual word
embedding spaces. Two major paradigms are
“mapping-based” methods which find a linear trans-
formation to map monolingual embedding spaces
to a shared space (e.g., Artetxe et al.,2016,2017;
Alvarez-Melis and Jaakkola,2018;Doval et al.,
2018;Jawanpuria et al.,2019), and “joint-training”
which, as stated in the enlightening survey by
Ruder et al. (2019), “minimize the source and tar-
get language monolingual losses jointly with the
cross-lingual regularization term” (e.g. Luong et al.,
2015,Ruder et al. (2019) for a review). Gouws
et al. (2015) train Skip-gram for source and target
languages simultaneously, enforcing an L2 loss for
known translation. Wang et al. (2020) compare and
combine joint and mapping approaches.
More recently, researchers have explored mas-
sively multilingual language models (Devlin et al.,
2019;Conneau et al.,2020). While these have been
shown to possess some inherent cross-lingual trans-
fer ability (Wu and Dredze,2019), another line
of work focuses on improving their cross-lingual
representations with explicit cross-lingual signal
(Wang et al.,2019;Liu et al.,2019;Cao et al.,2020;
Kulshreshtha et al.,2020;Wu and Dredze,2020).
Recently, Li et al. (2022) combined static and pre-
trained multilingual embeddings for BLI.
Handling Non-Isomorphism Miceli Barone
(2016) explore whether comparable corpora induce
embedding spaces which are approximately iso-
morphic. Ormazabal et al. (2019) compare cross-
lingual word embeddings induced via mapping
methods and jointly-trained embeddings from Lu-
ong et al. (2015), finding that the latter are bet-
ter in measures of isomorphism and BLI preci-
sion. Nakashole and Flauger (2018) argue that
word embedding spaces are not globally linearly-
mappable. Others use non-linear mappings (e.g.
Mohiuddin et al.,2020;Glavaš and Vuli´
c,2020) or
post-process embeddings after training to improve
quality (e.g. Peng et al.,2021;Faruqui et al.,2015;
Mu and Viswanath,2018). Eder et al. (2021) ini-
tialize a target embedding space with vectors from
a higher-resource source space, then train the low-
resource target. Zhang et al. (2017) minimize earth
mover’s distance over 50-dimensional pretrained
word2vec embeddings. Ormazabal et al. (2021)
learn source embeddings in reference to fixed tar-
get embeddings given known or hypothesized trans-
lation pairs induced during via self-learning.
Examining & Exploiting Embedding Geometry
Emerging literature examines geometric proper-
ties of embedding spaces. In addition to isomor-
phism, some examine isotropy (e.g. Mimno and
Thompson,2017;Mu and Viswanath,2018;Etha-
yarajh,2019;Rajaee and Pilehvar,2022;Rudman
et al.,2022). Li et al. (2020) transform the seman-
tic space from masked language models into an
isotropic Gaussian distribution from a non-smooth
anisotropic space. Su et al. (2021) apply whitening
and dimensionality reduction to improve isotropy.
Zhang et al. (2022) inject isotropy into a variational
autoencoder, and Ethayarajh and Jurafsky (2021)
recommend “adding an anisotropy penalty to the
language modelling objective” as future work.
3 Background
We discuss the mathematical background used
in our methods. Throughout,
XRn×d
and
YRm×d
are the source and target word
embedding spaces of
d
-dimensional word vec-
tors, respectively. We may assume seed pairs
{(x0, y0),(x1, y1), ...(xs, ys)}are given.
3.1 The Orthogonal Procrustes Problem
Schönemann (1966) derived the solution to the or-
thogonal Procrustes problem, whose goal is to find
the linear transformation Wthat solves:
arg min
WRd×d,W TW=I||XW Y||2
F
The solution is
W=V UT
, where
UΣVT
is the
singular value decomposition of
YTX
. If
X
is a
matrix of vectors corresponding to seed words
xi
in
{(x0, y0),(x1, y1),...,(xs, ys)}
and
Y
is a ma-
trix of the corresponding
yi
, then
W
is the linear
transformation that minimizes the difference be-
tween the vector representations of known pairs.
3.2 Embedding Space Mapping with VecMap
We use the popular VecMap
2
toolkit for embedding
space mapping, which can be run in supervised,
semi-supervised, and unsupervised modes. As of
the time of its writing, Glavaš et al. (2019) deem
VecMap the most robust unsupervised method.
First, source and target word embeddings are
unit-normed, mean-centered, and unit-normed
again (Zhang et al.,2019). The bilingual lexicon is
induced by whitening each space and then solving
a variant of the orthogonal Procrustes problem.
3
Spaces are reweighted and dewhitened, and transla-
tion pairs are extracted via nearest-neighbor search
from the mapped embedding spaces. See the origi-
nal works and implementation for details (Artetxe
et al.,2018a,b).
Unsupervised and semi-supervised modes uti-
lize the same framework as supervised mode, but
2https://github.com/artetxem/vecmap
3See Appendix A.1,A.2 for details
with an iterative self-learning procedure that repeat-
edly solves the orthogonal Procrustes problem over
hypothesized translations. On each iteration, new
hypotheses are extracted. The modes differ only in
how they induce the initial hypothesis seed pairs.
In semi-supervised mode, this is a given input seed
dictionary. In unsupervised mode, similarity ma-
trices
Mx=XXT
and
Mz=ZZT
are created
over the first
n
vocabulary words.
4
Word
zj
is the
assumed translation of
xi
if vector
Mzj
is most
similar to
Mxi
compared to all others in
Mz
. See
Artetxe et al. (2018b) for details.
3.3 Isomorphism Metrics
In NLP, relative isomorphism is often measured by
Relational Similarity, Eigenvector Similarity, and
Gromov-Hausdorff Distance. We describe these
metrics in detail in this section.
Relational Similarity Given seed translation
pairs, calculate pairwise cosine similarities:
cos(x0, x1)cos(y0, y1)
cos(x0, x2)cos(y0, y2)
cos(x0, x3)cos(y0, y3)
. . . . . .
cos(x1, x0)cos(y1, y0)
cos(x1, x2)cos(y1, y2)
. . . . . .
cos(xs, xs)cos(ys, ys)
The Pearson’s correlation between the lists of co-
sine similarities is known as Relational Similarity
(Vuli´
c et al.,2020;Zhang et al.,2019).
Eigenvector Similarity (Søgaard et al.,2018)
measures isomorphism between two spaces based
on the Laplacian spectra of their
k
-nearest neigh-
bor (
k
-NN) graphs. For seeds
{x0, x1, . . . , xs}
and
{y0, y1, . . . , ys}
, we compute unweighted
k
-
NN graphs
GX
and
GY
, then compute the Graph
Laplacians (
LG
) for both graphs (the degree ma-
trix minus the adjacency matrix:
LG=DGAG
).
We then compute the eigenvalues of
LGX
and
LGY
,
namely
{λLGX(i)}
and
{λLGY(i)}
. We select
l=
min(lX, lY)
where
lX
is the maximum
l
such that
the first
l
eigenvalues of
LGX
sum to less than 90%
of the total sum of the eigenvalues. EVS is the sum
of squared differences between the partial spectra:
EVS =
l
X
i=1
(λLGX(i)λLGY(i))2
4Default: 4000
摘要:

IsoVec:ControllingtheRelativeIsomorphismofWordEmbeddingSpacesKellyMarchisio,NehaVerma,KevinDuh,andPhilippKoehnJohnsHopkinsUniversity{kmarc,nverma7}@jhu.edu,kevinduh@cs.jhu.edu,phi@jhu.eduAbstractTheabilitytoextracthigh-qualitytranslationdictionariesfrommonolingualwordembed-dingspacesdependscriticall...

展开>> 收起<<
IsoVec Controlling the Relative Isomorphism of Word Embedding Spaces Kelly Marchisio Neha Verma Kevin Duh andPhilipp Koehn Johns Hopkins University.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:561.44KB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注