
Bilingual Lexicon Induction for Low-Resource Languages
using Graph Matching via Optimal Transport
Kelly Marchisio1, Ali Saad-Eldin3, Kevin Duh1,4,
Carey Priebe2,4, Philipp Koehn1
Department of 1Computer Science, 2Department of Applied Mathematics and Statistics,
3Department of Biomedical Engineering, 4Human Language Technology Center of Excellence
Johns Hopkins University
{kmarc,asaadel1}@jhu.edu
kevinduh@cs.jhu.edu, {cep, phi}@jhu.edu
Abstract
Bilingual lexicons form a critical component
of various natural language processing applica-
tions, including unsupervised and semisuper-
vised machine translation and crosslingual in-
formation retrieval. We improve bilingual lexi-
con induction performance across 40 language
pairs with a graph-matching method based on
optimal transport. The method is especially
strong with low amounts of supervision.
1 Introduction
Bilingual lexicon induction (BLI) from word em-
bedding spaces is a popular task with a large body
of existing literature (e.g. Mikolov et al.,2013;
Artetxe et al.,2018;Conneau et al.,2018;Patra
et al.,2019;Shi et al.,2021). The goal is to ex-
tract a dictionary of translation pairs given sepa-
rate language-specific embedding spaces, which
can then be used to bootstrap downstream tasks
such as cross-lingual information retrieval and
unsupervised/semi-supervised machine translation.
A great challenge across NLP is maintaining
performance in low-resource scenarios. A com-
mon criticism of the BLI and low-resource MT
literature is that while claims are made about di-
verse and under-resourced languages, research is
often performed on down-sampled corpora of high-
resource, highly-related languages on similar do-
mains (Artetxe et al.,2020). Such corpora are not
good proxies for true low-resource languages ow-
ing to data challenges such as dissimilar scripts,
domain shift, noise, and lack of sufficient bitext
(Marchisio et al.,2020). These differences can lead
to dissimilarity between the embedding spaces (de-
creasing isometry), causing BLI to fail (Søgaard
et al.,2018;Nakashole and Flauger,2018;Ormaz-
abal et al.,2019;Glavaš et al.,2019;Vuli´
c et al.,
2019;Patra et al.,2019;Marchisio et al.,2020).
There are two axes by which a language dataset
is considered “low-resource". First, the language it-
self may be a low-resource language: one for which
little bitext and/or monolingual text exists. Even
for high-resource languages, the long tail of words
may have poorly trained word embeddings due rar-
ity in the dataset (Gong et al.,2018;Czarnowska
et al.,2019). In the data-poor setting of true
low-resource languages, a great majority of words
have little representation in the corpus, resulting in
poorly-trained embeddings for a large proportion
of them. The second axis is low-supervision. Here,
there are few ground-truth examples from which to
learn. For BLI from word embedding spaces, low-
supervision means there are few seeds from which
to induce a relationship between spaces, regardless
of the quality of the spaces themselves.
We bring a new algorithm for graph-matching
based on optimal transport (OT) to the NLP and
BLI literature. We evaluate using 40 language
pairs under varying amounts of supervision. The
method works strikingly well across language pairs,
especially in low-supervision contexts. As low-
supervision on low-resource languages reflects the
real-world use case for BLI, this is an encouraging
development on realistic scenarios.
2 Background
The typical baseline approach for BLI from word
embedding spaces assumes that spaces can be
mapped via linear transformation. Such methods
typically involve solutions to the Procrustes prob-
lem (see Gower et al. (2004) for a review). Alter-
natively, a graph-based view considers words as
nodes in undirected weighted graphs, where edges
are the distance between words. Methods taking
this view do not assume a linear mapping of the
spaces exists, allowing for more flexible matching.
BLI from word embedding spaces
Assume
separately-trained monolingual word embedding
spaces:
X∈Rn×d
,
Y∈Rm×d
where
n
/
m
are
the source/target language vocabulary sizes and
d
is
the embedding dimension. We build the matrices
X
arXiv:2210.14378v1 [cs.CL] 25 Oct 2022