Bilingual Lexicon Induction for Low-Resource Languages using Graph Matching via Optimal Transport Kelly Marchisio1 Ali Saad-Eldin3 Kevin Duh14

2025-05-06 0 0 854.71KB 17 页 10玖币

侵权投诉

Bilingual Lexicon Induction for Low-Resource Languages

using Graph Matching via Optimal Transport

Kelly Marchisio1, Ali Saad-Eldin3, Kevin Duh1,4,

Carey Priebe2,4, Philipp Koehn1

Department of 1Computer Science, 2Department of Applied Mathematics and Statistics,

3Department of Biomedical Engineering, 4Human Language Technology Center of Excellence

Johns Hopkins University

{kmarc,asaadel1}@jhu.edu

kevinduh@cs.jhu.edu, {cep, phi}@jhu.edu

Abstract

Bilingual lexicons form a critical component

of various natural language processing applica-

tions, including unsupervised and semisuper-

vised machine translation and crosslingual in-

formation retrieval. We improve bilingual lexi-

con induction performance across 40 language

pairs with a graph-matching method based on

optimal transport. The method is especially

strong with low amounts of supervision.

1 Introduction

Bilingual lexicon induction (BLI) from word em-

bedding spaces is a popular task with a large body

of existing literature (e.g. Mikolov et al.,2013;

Artetxe et al.,2018;Conneau et al.,2018;Patra

et al.,2019;Shi et al.,2021). The goal is to ex-

tract a dictionary of translation pairs given sepa-

rate language-speciﬁc embedding spaces, which

can then be used to bootstrap downstream tasks

such as cross-lingual information retrieval and

unsupervised/semi-supervised machine translation.

A great challenge across NLP is maintaining

performance in low-resource scenarios. A com-

mon criticism of the BLI and low-resource MT

literature is that while claims are made about di-

verse and under-resourced languages, research is

often performed on down-sampled corpora of high-

resource, highly-related languages on similar do-

mains (Artetxe et al.,2020). Such corpora are not

good proxies for true low-resource languages ow-

ing to data challenges such as dissimilar scripts,

domain shift, noise, and lack of sufﬁcient bitext

(Marchisio et al.,2020). These differences can lead

to dissimilarity between the embedding spaces (de-

creasing isometry), causing BLI to fail (Søgaard

et al.,2018;Nakashole and Flauger,2018;Ormaz-

abal et al.,2019;Glavaš et al.,2019;Vuli´

c et al.,

2019;Patra et al.,2019;Marchisio et al.,2020).

There are two axes by which a language dataset

is considered “low-resource". First, the language it-

self may be a low-resource language: one for which

little bitext and/or monolingual text exists. Even

for high-resource languages, the long tail of words

may have poorly trained word embeddings due rar-

ity in the dataset (Gong et al.,2018;Czarnowska

et al.,2019). In the data-poor setting of true

low-resource languages, a great majority of words

have little representation in the corpus, resulting in

poorly-trained embeddings for a large proportion

of them. The second axis is low-supervision. Here,

there are few ground-truth examples from which to

learn. For BLI from word embedding spaces, low-

supervision means there are few seeds from which

to induce a relationship between spaces, regardless

of the quality of the spaces themselves.

We bring a new algorithm for graph-matching

based on optimal transport (OT) to the NLP and

BLI literature. We evaluate using 40 language

pairs under varying amounts of supervision. The

method works strikingly well across language pairs,

especially in low-supervision contexts. As low-

supervision on low-resource languages reﬂects the

real-world use case for BLI, this is an encouraging

development on realistic scenarios.

2 Background

The typical baseline approach for BLI from word

embedding spaces assumes that spaces can be

mapped via linear transformation. Such methods

typically involve solutions to the Procrustes prob-

lem (see Gower et al. (2004) for a review). Alter-

natively, a graph-based view considers words as

nodes in undirected weighted graphs, where edges

are the distance between words. Methods taking

this view do not assume a linear mapping of the

spaces exists, allowing for more ﬂexible matching.

BLI from word embedding spaces

Assume

separately-trained monolingual word embedding

spaces:

X∈Rn×d

Y∈Rm×d

where

are

the source/target language vocabulary sizes and

the embedding dimension. We build the matrices

arXiv:2210.14378v1 [cs.CL] 25 Oct 2022

and

of seeds from

and

, respectively, such

that given

seed pairs

(x1, y1),(x2, y2), ...(xs, ys)

the ﬁrst row of

, the second row is

, etc.

We build

analogously for the

-component of

each seed pair. The goal is to recover matches for

the X\Xand/or Y\Ynon-seed words.

Procrustes

Many BLI methods use solutions to

the Procrustes problem (e.g. Artetxe et al.,2019b;

Conneau et al.,2018;Patra et al.,2019). These

compute the optimal transform Wto map seeds:

min

W∈Rd×d||XW −Y||2

F(1)

Once solved for

, then

and

live in a

shared space and translation pairs can be extracted

via nearest-neighbor search. Constrained to the

space of orthogonal matrices, Equation 1has a

simple closed-form solution (Schönemann,1966):

W=VUTUΣV=SVD(YTX)

Graph View

Here, words are nodes in mono-

lingual graphs

Gx,Gy∈Rn×n

, and cells in

Gx,Gy

are edge weights representing distance be-

tween words. As is common in NLP, we use co-

sine similarity. The objective function is Equa-

tion 2, where

is the set of

permutation matri-

ces

Intuitively,

PGyPT

ﬁnds the optimal rela-

beling of

to align with

. This “minimizes

edge-disagreements" between

and

. This

graph-matching objective is NP-Hard. Equation 3

is equivalent.

min

P∈Π||Gx−PGyPT||2

F(2)

max

P∈Πtrace(GxTPGyPT)(3)

Ex. Take source words

. We wish to recover

valid translations

yx1

yx2

. If

distance(x1, x2)=

distance(yx1, yx2)

, a solution

can have an edge-

disagreement of

here. We then extract

yx1, yx2

as translations of

. In reality, though, it is un-

likely that

distance(x1, x2) = distance(yx1, yx2)

Because Equation 2ﬁnds the ideal

to minimize

edge disagreements over the entire graphs, we hope

that nodes paired by

are valid translations. If

and

are isomorphic and there is a unique solu-

tion, then Pcorrectly recovers all translations.

Graph-matching is an active research ﬁeld and

is computationally prohibitive on large graphs,

A permutation matrix represents a one-to-one mapping:

There is a single 1in each row and column, and 0elsewhere.

but approximation algorithms exist. BLI involves

matching large, non-isomorphic graphs—among

the greatest challenges for graph-matching.

2.1 FAQ Algorithm for Graph Matching

Vogelstein et al. (2015)’s Fast Approximate

Quadratic Assignment Problem algorithm (FAQ)

uses gradient ascent to approximate a solution to

Equation 2. Motivated by “connectonomics" in

neuroscience (the study of brain graphs with bio-

logical [groups of] neurons as nodes and neuronal

connections as edges), FAQ was designed to per-

form accurately and efﬁciently on large graphs.

FAQ relaxes the search space of Equation 3to al-

low any doubly-stochastic matrix (the set

). Each

cell in a doubly-stochastic matrix is a non-negative

real number and each row/column sums to 1. The

set

thus contains

but is much larger. Relaxing

the search space makes it easier to optimize Equa-

tion 3via gradient ascent/descent.

FAQ solves

the objective with the Frank-Wolfe method (Frank

et al.,1956) then projects back to a permutation

matrix.

Algorithm 1is FAQ;

f(P)

trace(GxTPGyPT)

. These may be built as

Gx=XXT

and

Gy=YYT

and

need

not have the same dimensionality. Step 2 ﬁnds a

permutation matrix approximation

Q{i}

P{i}

the direction of the gradient. Finding such a P re-

quires approximation when P is high-dimensional.

Here, it is solved via the

Hungarian Algorithm

(Kuhn,1955;Jonker and Volgenant,1987), whose

solution is a permutation matrix. Finally,

is projected back onto to the space of permuta-

tion matrices. Seeded Graph Matching (SGM;

Fishkind et al.,2019) is a variant of FAQ allow-

ing for supervision, and was recently shown to be

effective for BLI by Marchisio et al. (2021). The

interested reader may ﬁnd Vogelstein et al. (2015)

and Fishkind et al. (2019) enlightening for descrip-

tive intuitions of the FAQ and SGM algorithms.3

Strengths/Weaknesses

FAQ/SGM perform well

solving the exact graph-matching problem: where

graphs are isomorphic and a full matching exists.

In reality, large graphs are rarely isomorphic. For

BLI, languages have differing vocabulary size, syn-

“descent" for the Quadratic Assignment Problem, “as-

cent" for the Graph Matching Problem. The optimization ob-

jectives are equivalent: See Vogelstein et al. (2015) for a proof.

’Section 3: Fast Approximate QAP Algorithm’ (Vogel-

stein et al.,2015), ’Section 2.2. From FAQ to SGM’ (Fishkind

et al.,2019).

Algorithm 1 FAQ Algorithm for Graph Matching

Let: Gx,Gy∈Rn×n,P{0}∈ D (dbl-stoch.)

while stopping criterion not met do

1. Calculate ∇f(P{i}):

∇f(P{i}) = GxP{i}GT

y+GT

xP{i}Gy

2. Q{i}=permutation matrix approx. to ∇f(P{i})

via Hungarian Algorithm

3. Calculate step size:

arg max

α∈[0,1]

f(αP {i}+ (1 −α)Q{i})

4. Update P{i+1}:= αP {i}+ (1 −α)Q{i}

end while

return

permutation matrix approx. to

P{n}

via Hung. Alg.

onyms/antonyms, and idiosyncratic concepts; it is

more natural to assume that an exact matching be-

tween word spaces does not exist, and that mul-

tiple matchings may be equally valid. This is an

inexact graph-matching problem. FAQ generally

performs poorly ﬁnding non-seeded inexact match-

ings (Saad-Eldin et al.,2021).

2.2 GOAT

Graph Matching via OptimAl Transport (GOAT)

(Saad-Eldin et al.,2021) is a new graph-matching

method which uses advances in OT. Similar to

SGM, GOAT amends FAQ and can use seeds.

GOAT has been successful for the inexact graph-

matching problem on non-isomorphic graphs:

whereas FAQ rapidly fails on non-isomorphic

graphs, GOAT maintains strong performance.

Optimal Transport

OT is an optimization prob-

lem concerned with the most efﬁcient way to trans-

fer probability mass from distribution

to distri-

bution

. Discrete

OT minimizes the inner prod-

uct of a transportation “plan" matrix

with a cost

matrix

, as in Equation 4.

h·,·i

is the Frobenius

inner product.

P∗= arg min

P∈U(r,c)

hP, Ci(4)

is an element of the “transportation polytope"

U(r, c)

—the set of matrices whose rows sum to

and columns sum to

. The Hungarian Algorithm

approximately solves OT, but the search space is

restricted to permutation matrices.

Sinkhorn: Lightspeed OT

Cuturi (2013) intro-

duce Sinkhorn distance, an approximation of OT

distance that can be solved quickly and accurately

by adding an entropy penalty

to Equation 4.

4As ours is, as we compute over matrices.

Adding

makes the objective easier and more efﬁ-

cient to compute, and encourages “intermediary"

solutions similar to that seen in the

Intuition

sub-

section.

Pλ= arg min

P∈U(r,c)

hP, Ci − 1

λh(P)(5)

λ→ ∞

Pλ

approaches the ideal transportation

matrix

P∗

.Cuturi (2013) show that Equation 5can

be computed using Sinkhorn’s algorithm (Sinkhorn,

1967). The interested reader can see details of

the algorithm in Cuturi (2013); Peyré and Cuturi

(2019). Unlike the Hungarian Algorithm, Sinkhorn

has no restriction to a permutation matrix solution

and can be solved over any U(r, c).

Sinkhorn in GOAT

GOAT uses Cuturi (2013)’s

algorithm to solve Equation 5over

U(1,1)

, the set

of doubly-stochastic matrices

. They call this

the “doubly stochastic OT problem", and the algo-

rithm that solves it “Lightspeed Optimal Transport"

(LOT). Although Sinkhorn distance was created for

efﬁciency, Saad-Eldin et al. (2021) ﬁnd that using

the matrix

Pλ

that minimizes Sinkhorn distance

also improves matching performance on large and

non-isometric graphs. Algorithm 2is GOAT.

Algorithm 2 GOAT

Let: Gx,Gy∈Rn×n,P{0}∈ D (dbl-stoch.)

while stopping criterion not met do

1. Calculate ∇f(P{i}):

∇f(P{i}) = GxP{i}GT

y+GT

xP{i}Gy

2. Q{i}=dbl-stoch. approx. to ∇f(P{i})via LOT.

3. Calculate step size:

arg max

α∈[0,1]

f(αP {i}+ (1 −α)Q{i})

4. Update P{i+1}:= αP {i}+ (1 −α)Q{i}

end while

return

permutation matrix approx. to

P{n}

via Hung. Alg.

Intuition

The critical difference between

SGM/FAQ and GOAT is how each calculates step

direction based on the gradient. Under the hood,

each algorithm maximizes

trace(QT∇f(P{i})

compute

Q{i}

(the step direction) in Step 2 of their

respective algorithms. See Saad-Eldin et al. (2021)

or Fishkind et al. (2019) for a derivation. FAQ uses

the Hungarian Algorithm and GOAT uses LOT.

For

∇f(P{i})

below, there are two valid permu-

tation matrices

and

that maximize the trace.

When multiple solutions exist, the Hungarian Algo-

rithm chooses one arbitrarily. Thus, updates of

in FAQ are constrained to be permutation matrices.

∇f(P{i}) = 



030

212

000





Q1=



010

001

100



, Q2=



0 1 0

1 0 0

0 0 1





trace(QT

1∇f(P{i})) = trace(QT

2∇f(P{i})) = 5

Concerningly, Saad-Eldin et al. (2021) ﬁnd that

seed order inﬂuences the solution in a popular im-

plementation of the Hungarian Algorithm. Since

BLI is a high-dimensional many-to-many task, ar-

bitrary choices could meaningfully affect the result.

GOAT, on the other hand, can step in the direc-

tion of a doubly-stochastic matrix. Saad-Eldin et al.

(2021) prove that given multiple permutation matri-

ces that equally approximate the gradient at

P{i}

any convex linear combination is a doubly stochas-

tic matrix that equally approximates the gradient:

Pλ=

λiPis.t. λ1+...+λn= 1; λi∈[0,1]

Pλ

is a weighted combination of many valid

solutions—obviating the need to arbitrarily select

one for the gradient-based update. LOT’s output

of a doubly-stochastic matrix in Step 2 is similar

to ﬁnding a

Pλ

in that it needn’t discretize to a sin-

gle permutation matrix. In this way, GOAT can

be thought of as taking a step that incorporates

many possible permutation solutions. For instance,

GOAT may select

Qds =1

2Q1+1

2Q2

, which also

maximizes trace(QT∇f(P{i}).

Qds =



010

201





trace(QT

ds∇f(P{i})) = 5

Thus whereas FAQ takes non-deterministic

“choppy" update steps, GOAT optimizes smoothly

and deterministically. Figure 1is an illustration.

3 Experimental Setup

We run Procrustes, SGM, and GOAT on 40 lan-

guage pairs. We also run system combination ex-

periments similar to Marchisio et al. (2021). We

evaluate with the standard precision@1 (P@1).

Figure 1: Optimization step of FAQ vs. GOAT. FAQ

arbitrarily chooses the direction of a permutation ma-

trix. GOAT averages permutation matrices to take a

smoother path.

We induce lexicons using

(1)

the closed-form

solution to the orthogonal Procrustes problem of

Equation 1, extracting nearest neighbors using

CSLS (Conneau et al.,2018),

(2)

SGM, solving the

seeded version of Equation 2, and

(3)

GOAT. Word

graphs are Gx=XXT,Gy=YYT.

System Combination

We perform system com-

bination experiments analogous to those of Marchi-

sio et al. (2021), incorporating GOAT. Figure 2

shows the system, which is made of two compo-

nents: GOAT run in forward and reverse directions,

and “Iterative Procrustes with Stochastic-Add"

from Marchisio et al. (2021). This iterative version

of Procrustes runs Procrustes in source

→

target and

target

→

source directions and feeds H random hy-

potheses from the intersection of both directions

into another run of Procrustes with the gold seeds.

The process repeats for

iterations, adding

random hypotheses each time until all are chosen.

We set

H= 100

and

I= 5

, as in the original work.

3.1 Data & Software

We use publicly-available fastText word embed-

dings (Bojanowski et al.,2017)

which we normal-

ize, mean-center, and renormalize (Artetxe et al.,

2018;Zhang et al.,2019) and bilingual dictionaries

from MUSE

ﬁltered to be one-to-one.

For lan-

guages with 200,000+ embeddings, we use the ﬁrst

5https://fasttext.cc/docs/en/pretrained-vectors.html

6https://github.com/facebookresearch/MUSE

For each source word, keep the ﬁrst unused target word.

Targets are in arbitrary order, so this is random sampling.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BilingualLexiconInductionforLow-ResourceLanguagesusingGraphMatchingviaOptimalTransportKellyMarchisio1,AliSaad-Eldin3,KevinDuh1,4,CareyPriebe2,4,PhilippKoehn1Departmentof1ComputerScience,2DepartmentofAppliedMathematicsandStatistics,3DepartmentofBiomedicalEngineering,4HumanLanguageTechnologyCenterofEx...

展开>> 收起<<

Bilingual Lexicon Induction for Low-Resource Languages using Graph Matching via Optimal Transport Kelly Marchisio1 Ali Saad-Eldin3 Kevin Duh14.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bilingual Lexicon Induction for Low-Resource Languages using Graph Matching via Optimal Transport Kelly Marchisio1 Ali Saad-Eldin3 Kevin Duh14

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: