Bilingual Lexicon Induction for Low-Resource Languages using Graph Matching via Optimal Transport Kelly Marchisio1 Ali Saad-Eldin3 Kevin Duh14

2025-05-06 0 0 854.71KB 17 页 10玖币
侵权投诉
Bilingual Lexicon Induction for Low-Resource Languages
using Graph Matching via Optimal Transport
Kelly Marchisio1, Ali Saad-Eldin3, Kevin Duh1,4,
Carey Priebe2,4, Philipp Koehn1
Department of 1Computer Science, 2Department of Applied Mathematics and Statistics,
3Department of Biomedical Engineering, 4Human Language Technology Center of Excellence
Johns Hopkins University
{kmarc,asaadel1}@jhu.edu
kevinduh@cs.jhu.edu, {cep, phi}@jhu.edu
Abstract
Bilingual lexicons form a critical component
of various natural language processing applica-
tions, including unsupervised and semisuper-
vised machine translation and crosslingual in-
formation retrieval. We improve bilingual lexi-
con induction performance across 40 language
pairs with a graph-matching method based on
optimal transport. The method is especially
strong with low amounts of supervision.
1 Introduction
Bilingual lexicon induction (BLI) from word em-
bedding spaces is a popular task with a large body
of existing literature (e.g. Mikolov et al.,2013;
Artetxe et al.,2018;Conneau et al.,2018;Patra
et al.,2019;Shi et al.,2021). The goal is to ex-
tract a dictionary of translation pairs given sepa-
rate language-specific embedding spaces, which
can then be used to bootstrap downstream tasks
such as cross-lingual information retrieval and
unsupervised/semi-supervised machine translation.
A great challenge across NLP is maintaining
performance in low-resource scenarios. A com-
mon criticism of the BLI and low-resource MT
literature is that while claims are made about di-
verse and under-resourced languages, research is
often performed on down-sampled corpora of high-
resource, highly-related languages on similar do-
mains (Artetxe et al.,2020). Such corpora are not
good proxies for true low-resource languages ow-
ing to data challenges such as dissimilar scripts,
domain shift, noise, and lack of sufficient bitext
(Marchisio et al.,2020). These differences can lead
to dissimilarity between the embedding spaces (de-
creasing isometry), causing BLI to fail (Søgaard
et al.,2018;Nakashole and Flauger,2018;Ormaz-
abal et al.,2019;Glavaš et al.,2019;Vuli´
c et al.,
2019;Patra et al.,2019;Marchisio et al.,2020).
There are two axes by which a language dataset
is considered “low-resource". First, the language it-
self may be a low-resource language: one for which
little bitext and/or monolingual text exists. Even
for high-resource languages, the long tail of words
may have poorly trained word embeddings due rar-
ity in the dataset (Gong et al.,2018;Czarnowska
et al.,2019). In the data-poor setting of true
low-resource languages, a great majority of words
have little representation in the corpus, resulting in
poorly-trained embeddings for a large proportion
of them. The second axis is low-supervision. Here,
there are few ground-truth examples from which to
learn. For BLI from word embedding spaces, low-
supervision means there are few seeds from which
to induce a relationship between spaces, regardless
of the quality of the spaces themselves.
We bring a new algorithm for graph-matching
based on optimal transport (OT) to the NLP and
BLI literature. We evaluate using 40 language
pairs under varying amounts of supervision. The
method works strikingly well across language pairs,
especially in low-supervision contexts. As low-
supervision on low-resource languages reflects the
real-world use case for BLI, this is an encouraging
development on realistic scenarios.
2 Background
The typical baseline approach for BLI from word
embedding spaces assumes that spaces can be
mapped via linear transformation. Such methods
typically involve solutions to the Procrustes prob-
lem (see Gower et al. (2004) for a review). Alter-
natively, a graph-based view considers words as
nodes in undirected weighted graphs, where edges
are the distance between words. Methods taking
this view do not assume a linear mapping of the
spaces exists, allowing for more flexible matching.
BLI from word embedding spaces
Assume
separately-trained monolingual word embedding
spaces:
XRn×d
,
YRm×d
where
n
/
m
are
the source/target language vocabulary sizes and
d
is
the embedding dimension. We build the matrices
X
arXiv:2210.14378v1 [cs.CL] 25 Oct 2022
and
Y
of seeds from
X
and
Y
, respectively, such
that given
s
seed pairs
(x1, y1),(x2, y2), ...(xs, ys)
,
the first row of
X
is
x1
, the second row is
x2
, etc.
We build
Y
analogously for the
y
-component of
each seed pair. The goal is to recover matches for
the X\Xand/or Y\Ynon-seed words.
Procrustes
Many BLI methods use solutions to
the Procrustes problem (e.g. Artetxe et al.,2019b;
Conneau et al.,2018;Patra et al.,2019). These
compute the optimal transform Wto map seeds:
min
WRd×d||XW Y||2
F(1)
Once solved for
W
, then
XW
and
Y
live in a
shared space and translation pairs can be extracted
via nearest-neighbor search. Constrained to the
space of orthogonal matrices, Equation 1has a
simple closed-form solution (Schönemann,1966):
W=VUTUΣV=SVD(YTX)
Graph View
Here, words are nodes in mono-
lingual graphs
Gx,GyRn×n
, and cells in
Gx,Gy
are edge weights representing distance be-
tween words. As is common in NLP, we use co-
sine similarity. The objective function is Equa-
tion 2, where
Π
is the set of
permutation matri-
ces
.
1
Intuitively,
PGyPT
finds the optimal rela-
beling of
Gy
to align with
Gx
. This “minimizes
edge-disagreements" between
Gx
and
Gy
. This
graph-matching objective is NP-Hard. Equation 3
is equivalent.
min
PΠ||GxPGyPT||2
F(2)
max
PΠtrace(GxTPGyPT)(3)
Ex. Take source words
x1
,
x2
. We wish to recover
valid translations
yx1
,
yx2
. If
distance(x1, x2)=
distance(yx1, yx2)
, a solution
P
can have an edge-
disagreement of
0
here. We then extract
yx1, yx2
as translations of
x1
,
x2
. In reality, though, it is un-
likely that
distance(x1, x2) = distance(yx1, yx2)
.
Because Equation 2finds the ideal
P
to minimize
edge disagreements over the entire graphs, we hope
that nodes paired by
P
are valid translations. If
Gx
and
Gy
are isomorphic and there is a unique solu-
tion, then Pcorrectly recovers all translations.
Graph-matching is an active research field and
is computationally prohibitive on large graphs,
1
A permutation matrix represents a one-to-one mapping:
There is a single 1in each row and column, and 0elsewhere.
but approximation algorithms exist. BLI involves
matching large, non-isomorphic graphs—among
the greatest challenges for graph-matching.
2.1 FAQ Algorithm for Graph Matching
Vogelstein et al. (2015)’s Fast Approximate
Quadratic Assignment Problem algorithm (FAQ)
uses gradient ascent to approximate a solution to
Equation 2. Motivated by “connectonomics" in
neuroscience (the study of brain graphs with bio-
logical [groups of] neurons as nodes and neuronal
connections as edges), FAQ was designed to per-
form accurately and efficiently on large graphs.
FAQ relaxes the search space of Equation 3to al-
low any doubly-stochastic matrix (the set
D
). Each
cell in a doubly-stochastic matrix is a non-negative
real number and each row/column sums to 1. The
set
D
thus contains
Π
but is much larger. Relaxing
the search space makes it easier to optimize Equa-
tion 3via gradient ascent/descent.
2
FAQ solves
the objective with the Frank-Wolfe method (Frank
et al.,1956) then projects back to a permutation
matrix.
Algorithm 1is FAQ;
f(P)
=
trace(GxTPGyPT)
. These may be built as
Gx=XXT
and
Gy=YYT
.
Gx
and
Gy
need
not have the same dimensionality. Step 2 finds a
permutation matrix approximation
Q{i}
to
P{i}
in
the direction of the gradient. Finding such a P re-
quires approximation when P is high-dimensional.
Here, it is solved via the
Hungarian Algorithm
(Kuhn,1955;Jonker and Volgenant,1987), whose
solution is a permutation matrix. Finally,
Pn
is projected back onto to the space of permuta-
tion matrices. Seeded Graph Matching (SGM;
Fishkind et al.,2019) is a variant of FAQ allow-
ing for supervision, and was recently shown to be
effective for BLI by Marchisio et al. (2021). The
interested reader may find Vogelstein et al. (2015)
and Fishkind et al. (2019) enlightening for descrip-
tive intuitions of the FAQ and SGM algorithms.3
Strengths/Weaknesses
FAQ/SGM perform well
solving the exact graph-matching problem: where
graphs are isomorphic and a full matching exists.
In reality, large graphs are rarely isomorphic. For
BLI, languages have differing vocabulary size, syn-
2
“descent" for the Quadratic Assignment Problem, “as-
cent" for the Graph Matching Problem. The optimization ob-
jectives are equivalent: See Vogelstein et al. (2015) for a proof.
3
’Section 3: Fast Approximate QAP Algorithm’ (Vogel-
stein et al.,2015), ’Section 2.2. From FAQ to SGM’ (Fishkind
et al.,2019).
Algorithm 1 FAQ Algorithm for Graph Matching
Let: Gx,GyRn×n,P{0}∈ D (dbl-stoch.)
while stopping criterion not met do
1. Calculate f(P{i}):
f(P{i}) = GxP{i}GT
y+GT
xP{i}Gy
2. Q{i}=permutation matrix approx. to f(P{i})
via Hungarian Algorithm
3. Calculate step size:
arg max
α[0,1]
f(αP {i}+ (1 α)Q{i})
4. Update P{i+1}:= αP {i}+ (1 α)Q{i}
end while
return
permutation matrix approx. to
P{n}
via Hung. Alg.
onyms/antonyms, and idiosyncratic concepts; it is
more natural to assume that an exact matching be-
tween word spaces does not exist, and that mul-
tiple matchings may be equally valid. This is an
inexact graph-matching problem. FAQ generally
performs poorly finding non-seeded inexact match-
ings (Saad-Eldin et al.,2021).
2.2 GOAT
Graph Matching via OptimAl Transport (GOAT)
(Saad-Eldin et al.,2021) is a new graph-matching
method which uses advances in OT. Similar to
SGM, GOAT amends FAQ and can use seeds.
GOAT has been successful for the inexact graph-
matching problem on non-isomorphic graphs:
whereas FAQ rapidly fails on non-isomorphic
graphs, GOAT maintains strong performance.
Optimal Transport
OT is an optimization prob-
lem concerned with the most efficient way to trans-
fer probability mass from distribution
µ
to distri-
bution
v
. Discrete
4
OT minimizes the inner prod-
uct of a transportation “plan" matrix
P
with a cost
matrix
C
, as in Equation 4.
,·i
is the Frobenius
inner product.
P= arg min
P∈U(r,c)
hP, Ci(4)
P
is an element of the “transportation polytope"
U(r, c)
—the set of matrices whose rows sum to
r
and columns sum to
c
. The Hungarian Algorithm
approximately solves OT, but the search space is
restricted to permutation matrices.
Sinkhorn: Lightspeed OT
Cuturi (2013) intro-
duce Sinkhorn distance, an approximation of OT
distance that can be solved quickly and accurately
by adding an entropy penalty
h
to Equation 4.
4As ours is, as we compute over matrices.
Adding
h
makes the objective easier and more effi-
cient to compute, and encourages “intermediary"
solutions similar to that seen in the
Intuition
sub-
section.
Pλ= arg min
P∈U(r,c)
hP, Ci − 1
λh(P)(5)
As
λ→ ∞
,
Pλ
approaches the ideal transportation
matrix
P
.Cuturi (2013) show that Equation 5can
be computed using Sinkhorn’s algorithm (Sinkhorn,
1967). The interested reader can see details of
the algorithm in Cuturi (2013); Peyré and Cuturi
(2019). Unlike the Hungarian Algorithm, Sinkhorn
has no restriction to a permutation matrix solution
and can be solved over any U(r, c).
Sinkhorn in GOAT
GOAT uses Cuturi (2013)’s
algorithm to solve Equation 5over
U(1,1)
, the set
of doubly-stochastic matrices
D
. They call this
the “doubly stochastic OT problem", and the algo-
rithm that solves it “Lightspeed Optimal Transport"
(LOT). Although Sinkhorn distance was created for
efficiency, Saad-Eldin et al. (2021) find that using
the matrix
Pλ
that minimizes Sinkhorn distance
also improves matching performance on large and
non-isometric graphs. Algorithm 2is GOAT.
Algorithm 2 GOAT
Let: Gx,GyRn×n,P{0}∈ D (dbl-stoch.)
while stopping criterion not met do
1. Calculate f(P{i}):
f(P{i}) = GxP{i}GT
y+GT
xP{i}Gy
2. Q{i}=dbl-stoch. approx. to f(P{i})via LOT.
3. Calculate step size:
arg max
α[0,1]
f(αP {i}+ (1 α)Q{i})
4. Update P{i+1}:= αP {i}+ (1 α)Q{i}
end while
return
permutation matrix approx. to
P{n}
via Hung. Alg.
Intuition
The critical difference between
SGM/FAQ and GOAT is how each calculates step
direction based on the gradient. Under the hood,
each algorithm maximizes
trace(QTf(P{i})
to
compute
Q{i}
(the step direction) in Step 2 of their
respective algorithms. See Saad-Eldin et al. (2021)
or Fishkind et al. (2019) for a derivation. FAQ uses
the Hungarian Algorithm and GOAT uses LOT.
For
f(P{i})
below, there are two valid permu-
tation matrices
Q1
and
Q2
that maximize the trace.
When multiple solutions exist, the Hungarian Algo-
rithm chooses one arbitrarily. Thus, updates of
P
in FAQ are constrained to be permutation matrices.
f(P{i}) =
030
212
000
Q1=
010
001
100
, Q2=
0 1 0
1 0 0
0 0 1
trace(QT
1f(P{i})) = trace(QT
2f(P{i})) = 5
Concerningly, Saad-Eldin et al. (2021) find that
seed order influences the solution in a popular im-
plementation of the Hungarian Algorithm. Since
BLI is a high-dimensional many-to-many task, ar-
bitrary choices could meaningfully affect the result.
GOAT, on the other hand, can step in the direc-
tion of a doubly-stochastic matrix. Saad-Eldin et al.
(2021) prove that given multiple permutation matri-
ces that equally approximate the gradient at
P{i}
,
any convex linear combination is a doubly stochas-
tic matrix that equally approximates the gradient:
Pλ=
n
X
i
λiPis.t. λ1+...+λn= 1; λi[0,1]
Pλ
is a weighted combination of many valid
solutions—obviating the need to arbitrarily select
one for the gradient-based update. LOT’s output
of a doubly-stochastic matrix in Step 2 is similar
to finding a
Pλ
in that it needn’t discretize to a sin-
gle permutation matrix. In this way, GOAT can
be thought of as taking a step that incorporates
many possible permutation solutions. For instance,
GOAT may select
Qds =1
2Q1+1
2Q2
, which also
maximizes trace(QTf(P{i}).
Qds =
010
1
201
2
1
201
2
trace(QT
dsf(P{i})) = 5
Thus whereas FAQ takes non-deterministic
“choppy" update steps, GOAT optimizes smoothly
and deterministically. Figure 1is an illustration.
3 Experimental Setup
We run Procrustes, SGM, and GOAT on 40 lan-
guage pairs. We also run system combination ex-
periments similar to Marchisio et al. (2021). We
evaluate with the standard precision@1 (P@1).
Figure 1: Optimization step of FAQ vs. GOAT. FAQ
arbitrarily chooses the direction of a permutation ma-
trix. GOAT averages permutation matrices to take a
smoother path.
We induce lexicons using
(1)
the closed-form
solution to the orthogonal Procrustes problem of
Equation 1, extracting nearest neighbors using
CSLS (Conneau et al.,2018),
(2)
SGM, solving the
seeded version of Equation 2, and
(3)
GOAT. Word
graphs are Gx=XXT,Gy=YYT.
System Combination
We perform system com-
bination experiments analogous to those of Marchi-
sio et al. (2021), incorporating GOAT. Figure 2
shows the system, which is made of two compo-
nents: GOAT run in forward and reverse directions,
and “Iterative Procrustes with Stochastic-Add"
from Marchisio et al. (2021). This iterative version
of Procrustes runs Procrustes in source
target and
target
source directions and feeds H random hy-
potheses from the intersection of both directions
into another run of Procrustes with the gold seeds.
The process repeats for
I
iterations, adding
H
more
random hypotheses each time until all are chosen.
We set
H= 100
and
I= 5
, as in the original work.
3.1 Data & Software
We use publicly-available fastText word embed-
dings (Bojanowski et al.,2017)
5
which we normal-
ize, mean-center, and renormalize (Artetxe et al.,
2018;Zhang et al.,2019) and bilingual dictionaries
from MUSE
6
filtered to be one-to-one.
7
For lan-
guages with 200,000+ embeddings, we use the first
5https://fasttext.cc/docs/en/pretrained-vectors.html
6https://github.com/facebookresearch/MUSE
7
For each source word, keep the first unused target word.
Targets are in arbitrary order, so this is random sampling.
摘要:

BilingualLexiconInductionforLow-ResourceLanguagesusingGraphMatchingviaOptimalTransportKellyMarchisio1,AliSaad-Eldin3,KevinDuh1,4,CareyPriebe2,4,PhilippKoehn1Departmentof1ComputerScience,2DepartmentofAppliedMathematicsandStatistics,3DepartmentofBiomedicalEngineering,4HumanLanguageTechnologyCenterofEx...

展开>> 收起<<
Bilingual Lexicon Induction for Low-Resource Languages using Graph Matching via Optimal Transport Kelly Marchisio1 Ali Saad-Eldin3 Kevin Duh14.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:854.71KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注