san ‘mountain’), but the issue with toponyms goes
well beyond this since there are nanori pronunci-
ations of kanji that are only found in names (Ogi-
hara,2021): 山also has the nanori pronunciation
taka, for example. The kun-on-nanori variants re-
late to an important property of how kanji are used
in Japanese: among all modern writing systems,
the Japanese use of kanji comes closest to being
semasiographic—i.e. representing meaning rather
than specific morphemes. The common toponym
component kawa ‘river’, is usually written 川, but
can also be written as 河, which also means ‘river’.
That kanji in turn has other pronunciations, such as
kō, a Sino-Japanese word for ‘river’. This freedom
to spell words with a range of kanji that have the
same meaning, or to read kanji with any of a num-
ber of morphemes having the same meaning, is a
particular characteristic of Japanese. Thus, while
reading place names can be tricky in many parts
of the world, the problem is particularly acute in
Japan.
Since the variation is largely unpredictable, one
therefore simply needs to know for a given to-
ponym what the pronunciation is. But once one
knows, for instance, that a name written 上野 is
read as Uwano, as with the Houston case, one
ought to be able to deduce that in the name of the
local 上野第1公園 ‘Uwano First Public Park’, this
is read as Uwano and not Ueno. If one’s digi-
tal assistant is reading this name to you, or needs
to understand your pronunciation of the name, it
needs to know the correct pronunciation. While
one might expect a complete and correct maps
database to have all of this information correctly
entered, in practice maps data contain many errors,
especially for less frequently accessed features.
In this paper we propose a model that learns to
use information from the geographical context to
guide the pronunciation of features. We demon-
strate its application to detecting and correcting er-
rors in Google Maps. In addition, in Section 8we
show that the model can be applied to a different
but structurally similar problem, namely the prob-
lem of cognate reflex prediction in comparative
historical linguistics. In this case the ‘neighbors’
are related word forms in a set of languages from a
given language family, and the pronunciation to be
predicted is the corresponding form in a language
from the same family.
2 Background
Pronouncing written geographical feature names
involves a combination of text normalization (if
the names contain expressions such as numbers
or abbreviations), and word pronunciation, often
termed “grapheme-to-phoneme conversion”. Both
of these are typically cast as sequence-to-sequence
problems, and neural approaches to both are now
common. For neural approaches to grapheme-to-
phoneme conversion see (Yao and Zweig,2015;
Rao et al.,2015;Toshniwal and Livescu,2016;
Peters et al.,2017;Yolchuyeva et al.,2019), and
for text normalization see (Sproat and Jaitly,2017;
Zhang et al.,2019;Yolchuyeva et al.,2018;Pra-
manik and Hussain,2019;Mansfield et al.,2019;
Kawamura et al.,2020;Tran and Bui,2021). For
languages that use the Chinese script, grapheme-
to-phoneme conversion may benefit from the fact
that Chinese characters can mostly be decomposed
into a component that relates to the meaning of the
character and another that relates to the pronunci-
ation. The latter information is potentially useful,
in particular in Chinese and in the Sino-Japanese
readings of characters in Japanese. Recent neural
models that have taken advantage of this include
(Dai and Cai,2017;Nguyen et al.,2020). On the
other hand, it should be pointed out that other more
‘brute force’ decompositions of characters seem to
be useful. Thus Yu et al. (2020) propose a byte de-
composition for (UTF-8) character encodings for
a model that covers a wide variety of languages,
including Chinese and Japanese.
The above approaches generally treat the prob-
lem in isolation in the sense that the problem is cast
as one where the task is to predict a pronunciation
independent of context. Different pronunciations
for the same string in different linguistic contexts
comes under the rubric of homograph disambigua-
tion, and there is a long tradition of work in this
area; for an early example see (Yarowsky,1996)
and for a recent incarnation see (Gorman et al.,
2018). Not surprisingly, there has been recent in-
terest in neural models for predicting homograph
pronunciations: see (Park and Lee,2020;Shi et al.,
2021) for recent examples focused on Mandarin.
The present task is different, since what disam-
biguates the possible pronunciations of Japanese
features is not generally linguistic, but geograph-
ical context, which can be thought of as a way of
biasing the decision as to which pronunciation to
use, given evidence from the local context. Our