Helpful Neighbors Leveraging Neighbors in Geographic Feature Pronunciation Llion JonesRichard SproatHaruko IshikawaAlexander Gutkin

2025-05-06 0 0 3.25MB 16 页 10玖币
侵权投诉
Helpful Neighbors:
Leveraging Neighbors in Geographic Feature Pronunciation
Llion JonesRichard SproatHaruko IshikawaAlexander Gutkin
Google Japan Google UK
{llion,rws,ishikawa,agutkin}@google.com
Abstract
If one sees the place name Houston Mer-
cer Dog Run in New York, how does one
know how to pronounce it? Assuming one
knows that Houston in New York is pro-
nounced /ˈhaʊstən/ and not like the Texas
city (/ˈhjuːstən/), then one can probably guess
that /ˈhaʊstən/ is also used in the name of
the dog park. We present a novel architec-
ture that learns to use the pronunciations of
neighboring names in order to guess the pro-
nunciation of a given target feature. Ap-
plied to Japanese place names, we demon-
strate the utility of the model to finding and
proposing corrections for errors in Google
Maps.
To demonstrate the utility of this approach
to structurally similar problems, we also re-
port on an application to a totally different
task: Cognate reflex prediction in compara-
tive historical linguistics. A version of the
code has been open-sourced.1
1 Introduction
In many parts of the world, pronunciation of
toponyms and establishments can require local
knowledge. Many visitors to New York, for exam-
ple, get tripped up by Houston Street, which they
assume is pronounced the same as the city in Texas.
If they do not know how to pronounce Houston
Street, they would likely also not know how to pro-
nounce the nearby Houston Mercer Dog Run. But
if one knows one, that can (usually) be used as a
clue to how to pronounce the other.
Before we proceed further, a bit of terminology.
Technically, the term toponym refers to the name
of a geographical or administrative feature, such
as a river, lake, town or state. In most of what fol-
lows, we will use the term feature to refer to these
1https://github.com/google-research/
google-research/tree/master/cognate_inpaint_
neighbors
and other entities such as roads, buildings, schools
etc. In practice we will not make a major distinc-
tion between the two, but since there is a sense in
which toponyms are more basic, and the names of
the more general features are often derived from a
toponym (as in the Houston Mercer Dog Run ex-
ample above), we will retain the distinction where
it is needed.
While features cause not infrequent problems in
the US, they become a truly serious issue in Japan.
Japan is notorious for having toponyms whose pro-
nunciation is so unexpected that even native speak-
ers may not know how to pronounce a given case.
Most toponyms in Japanese are written in kanji
(Chinese characters) with a possible intermixing
of one of the two syllabaries, hiragana or katakana.
Thus 上野 Ueno is entirely in kanji; 虎ノ門 Tora
no mon has two kanji and one katakana symbol
(the second); and 吹割の滝 Fukiwari Waterfalls
has three kanji and one hiragana symbol (the third).
Features more generally tend to have more charac-
ters in one of the syllabaries—especially katakana
if, for example, the feature is a building that in-
cludes the name of a company as part of its name.
The syllabaries are basically phonemic scripts
so there is generally no ambiguity in how to pro-
nounce those portions of names, but kanji present a
serious problem in that the pronunciation of a kanji
string in a toponym is frequently something one
just has to know. To take the example 上野 Ueno
above, that pronunciation (for the well-known area
in Tokyo) is indeed the most common one, but
there are places in Japan with the same spelling
but with pronunciations such as Uwano,Kamino,
Wano, among others.2It is well-known that many
kanji have both a native (kun) Japanese pronunci-
ation (e.g. yama ‘mountain’) as well as one or
more Chinese-derived on pronunciations (e.g.
2Different pronunciations of kanji are often referred to as
readings, but in this paper we will use the more general term
pronunciation.
san ‘mountain’), but the issue with toponyms goes
well beyond this since there are nanori pronunci-
ations of kanji that are only found in names (Ogi-
hara,2021): also has the nanori pronunciation
taka, for example. The kun-on-nanori variants re-
late to an important property of how kanji are used
in Japanese: among all modern writing systems,
the Japanese use of kanji comes closest to being
semasiographic—i.e. representing meaning rather
than specific morphemes. The common toponym
component kawa ‘river’, is usually written , but
can also be written as , which also means ‘river’.
That kanji in turn has other pronunciations, such as
, a Sino-Japanese word for ‘river’. This freedom
to spell words with a range of kanji that have the
same meaning, or to read kanji with any of a num-
ber of morphemes having the same meaning, is a
particular characteristic of Japanese. Thus, while
reading place names can be tricky in many parts
of the world, the problem is particularly acute in
Japan.
Since the variation is largely unpredictable, one
therefore simply needs to know for a given to-
ponym what the pronunciation is. But once one
knows, for instance, that a name written 上野 is
read as Uwano, as with the Houston case, one
ought to be able to deduce that in the name of the
local 上野第1公園 ‘Uwano First Public Park’, this
is read as Uwano and not Ueno. If one’s digi-
tal assistant is reading this name to you, or needs
to understand your pronunciation of the name, it
needs to know the correct pronunciation. While
one might expect a complete and correct maps
database to have all of this information correctly
entered, in practice maps data contain many errors,
especially for less frequently accessed features.
In this paper we propose a model that learns to
use information from the geographical context to
guide the pronunciation of features. We demon-
strate its application to detecting and correcting er-
rors in Google Maps. In addition, in Section 8we
show that the model can be applied to a different
but structurally similar problem, namely the prob-
lem of cognate reflex prediction in comparative
historical linguistics. In this case the ‘neighbors’
are related word forms in a set of languages from a
given language family, and the pronunciation to be
predicted is the corresponding form in a language
from the same family.
2 Background
Pronouncing written geographical feature names
involves a combination of text normalization (if
the names contain expressions such as numbers
or abbreviations), and word pronunciation, often
termed “grapheme-to-phoneme conversion”. Both
of these are typically cast as sequence-to-sequence
problems, and neural approaches to both are now
common. For neural approaches to grapheme-to-
phoneme conversion see (Yao and Zweig,2015;
Rao et al.,2015;Toshniwal and Livescu,2016;
Peters et al.,2017;Yolchuyeva et al.,2019), and
for text normalization see (Sproat and Jaitly,2017;
Zhang et al.,2019;Yolchuyeva et al.,2018;Pra-
manik and Hussain,2019;Mansfield et al.,2019;
Kawamura et al.,2020;Tran and Bui,2021). For
languages that use the Chinese script, grapheme-
to-phoneme conversion may benefit from the fact
that Chinese characters can mostly be decomposed
into a component that relates to the meaning of the
character and another that relates to the pronunci-
ation. The latter information is potentially useful,
in particular in Chinese and in the Sino-Japanese
readings of characters in Japanese. Recent neural
models that have taken advantage of this include
(Dai and Cai,2017;Nguyen et al.,2020). On the
other hand, it should be pointed out that other more
‘brute force’ decompositions of characters seem to
be useful. Thus Yu et al. (2020) propose a byte de-
composition for (UTF-8) character encodings for
a model that covers a wide variety of languages,
including Chinese and Japanese.
The above approaches generally treat the prob-
lem in isolation in the sense that the problem is cast
as one where the task is to predict a pronunciation
independent of context. Different pronunciations
for the same string in different linguistic contexts
comes under the rubric of homograph disambigua-
tion, and there is a long tradition of work in this
area; for an early example see (Yarowsky,1996)
and for a recent incarnation see (Gorman et al.,
2018). Not surprisingly, there has been recent in-
terest in neural models for predicting homograph
pronunciations: see (Park and Lee,2020;Shi et al.,
2021) for recent examples focused on Mandarin.
The present task is different, since what disam-
biguates the possible pronunciations of Japanese
features is not generally linguistic, but geograph-
ical context, which can be thought of as a way of
biasing the decision as to which pronunciation to
use, given evidence from the local context. Our
Figure 1: The biasing LAS model from (Pundak et al.,
2018), Figure 1a.
approach is similar in spirit to that of Pundak et al.
(2018), who propose the use of a bias-encoder in a
“listen-attend-and-spell” (Chan et al.,2016) Auto-
matic Speech Recognition architecture. The bias
encoder takes a set of “bias phrases”, which can
be used to guide the model towards a particular
decoding. Pundak et al. (2018)’s model is shown
schematically in Figure 1.
3 Data
Features in Google Maps are stored in a data rep-
resentation that includes a variety of information
about each feature including: its location as a
bounding box in latitude-longitude; the type of
the feature—street, building, municipality, topo-
graphic feature, etc.; name(s) of the feature in
the native language as well as in many (mostly
automatically generated) transliterations; an ad-
dress if there is an address associated with this
feature; road signs that may be associated; and
so forth. Each feature is identified with a unique
hexadecimal feature id. Features may have ad-
ditional names besides the primary names. For
example in English, street names are often ab-
breviated (Main St.) and these abbreviations are
typically expanded (Main Street) as an additional
name. Many Japanese features have pronuncia-
tions of the names added as additional names in
katakana. Some of these have been carefully hand
curated, but many were generated automatically
and are therefore potentially errorful, as we will
see. Since the katakana version is used as the basis
for transliterations into other languages, localized
pronunciations for text-to-speech, as well as search
suggestions, it is important that it be correct.
We started by extracting from the database all
features that include a broad (but not exhaustive)
set of feature types from a bounding box that cov-
ers the four main islands of Japan. We then ex-
tracted feature summaries for names that included
both kanji original names, and katakana rendi-
tions. These summaries include the feature name,
the hiragana version of the name converted from
katakana, and the bounding box for the feature.
We then find, for each feature in the feature sum-
maries, a bucket of other features that are within
a given radius (10 kilometers in our experiments).
Then, for each feature in each bucket, we desig-
nate that feature a target feature, and we build
neighborhoods around that feature. We attempt for
each feature, to find interesting neighboring fea-
tures whose name shares a kanji bigram with the
target feature’s name. The intuition here is that a
feature that is likely to be useful in determining the
pronunciation of another feature should be nearby
geographically, and should share at least some of
the name. In any case we cap the number of ‘non-
interesting’ neighbors to a limit—5 in our experi-
ments. This means that some neighborhoods will
have target features that lack useful neighbors; this
is a realistic situation in that while it is often the
case that one can find hints for a name’s pronunci-
ation in the immediate neighbors, it is not always
the case. While such neighborhoods are not useful
from the point of view of neighbor-based evidence
for a target feature’s pronunciation, they still pro-
vide useful data for training the target sequence-
to-sequence model. Our final dataset consists of
about 2.7M feature neighborhoods, including the
information from the summary for each target fea-
ture as described above, the associated neighbor-
ing features and their summaries, along with the
distance (in kilometers) from the target feature.
Figure 2shows parts of one such neighborhood.
4 Model
Despite the differences noted above, the problem
we are interested in can still be characterized at its
core as a sequence-to-sequence problem. The in-
put is a sequence of tokens representing the feature
name in its original Japanese written form. The
output is a sequence of hiragana characters repre-
senting the correct pronunciation. The difference
between this and a more conventional sequence-
to-sequence problem is that we provide additional
biasing information in the form of geographical
neighbors, such as their pronunciation and geo-
Main Name セラヴィ反町
Pron seravi sorimachi
(i.e. C’est la Vie …)
Neigh: Name 反町
tanmachi
(pink area on map)
Neigh: Name 上反町
kamitanmachi
(green area on map)
Figure 2: A small example of a neighborhood. The
store, circled on the map, has a pronunciation listed as
C’est la Vie Sorimachi, but the neighboring areas are
Tanmachi and Kamitanmachi.Sorimachi is therefore
wrong.
graphical location. This neighbor information is
provided as additional input sequences to aid the
model in making its prediction. In our experi-
ments, we limit the number of neighbors to at most
30 (it is usually much less than this), each consist-
ing of two sequences, namely the neighbors name
and the corresponding pronunciation.
4.1 Model architecture
Due to many recent successes in other NLP appli-
cations, we experiment with a transformer model
(Vaswani et al.,2017). Our transformer model
uses a standard encoder-decoder architecture as the
backbone. The inputs to the model are the input
name with unknown pronunciation 𝑥inp, the neigh-
bor names 𝑥name (of length name_len) and asso-
ciated pronunciations 𝑥pron (of length pron_len).
First, these input tokens are embedded with size
emb_size. The embeddings are then shared be-
tween the feature names and the pronunciations.
i.e. the same embeddings are used for the input
name tokens and the neighbor tokens, and sim-
ilarly between the target pronunciation (decoder
output) and the neighbors’ pronunciations:
embinp =Embedname(𝑥inp) ,
embname =Embedname(𝑥name) ,
embpron =Embedpron(𝑥pron) .
These embedded tokens are then processed sep-
arately by the neighbor encoder. No parameters
are shared between these encoders, or with the de-
coder:
inp =Encoderinp(embinp) ,
name =Encodername(embname) ,
pron =Encoderpron(embpron) .
Since each example has nneigh neighbors, inp
is of shape [inp_size, emb_size] but the pro-
cessed neighbor spelling and pronunciation inputs
are of size [nneigh, name_len, emb_size] and
[nneigh, pron_len, emb_size].
One of the simplest ways to incorporate the
neighboring information is to concatenate the fea-
ture names and pronunciation embeddings into the
main input sequence, allowing the transformer to
attend directly to all the relevant information. Un-
fortunately, this is not possible with a vanilla trans-
former with a quadratic attention mechanism if we
want to attend to, say, 30 neighbors. In our experi-
ments name_len is set to 20 and pron_len is set to
40, yielding (20 + 40) × 30 = 1800 input tokens,
far too many for a vanilla transformer decoder to
attend to. To mitigate against this we average the
encoder outputs to give a single vector per neigh-
bor to attend to:
𝑠name =Ave(ℎname) ,
𝑠pron =Ave(ℎpron) ,
𝑐 = Concat(ℎinp, 𝑠name, 𝑠pron) .
The vectors are concatenated along the neigh-
bor dimension to give a sequence of size
[inp_len+2*nneigh, emb_size]. Optionally, if
embeddings representing the latitudinal and lon-
gitudinal position of the feature (which we refer
to as Lat-Long embeddings, discussed later) are
used then these are also concatenated here. This
input sequence is then concatenated to the encoder
output and is attended over by the transformer de-
coder. There are no positional embeddings added
to this sequence, so they are unordered from the
point of view of decoder attention. Therefore, we
help the decoder match the neighbor names to their
摘要:

HelpfulNeighbors:LeveragingNeighborsinGeographicFeaturePronunciationLlionJones†RichardSproat†HarukoIshikawa†AlexanderGutkin‡†GoogleJapan‡GoogleUK{llion,rws,ishikawa,agutkin}@google.comAbstractIfoneseestheplacenameHoustonMer-cerDogRuninNewYork,howdoesoneknowhowtopronounceit?AssumingoneknowsthatHousto...

展开>> 收起<<
Helpful Neighbors Leveraging Neighbors in Geographic Feature Pronunciation Llion JonesRichard SproatHaruko IshikawaAlexander Gutkin.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:3.25MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注