Helpful Neighbors Leveraging Neighbors in Geographic Feature Pronunciation Llion JonesRichard SproatHaruko IshikawaAlexander Gutkin

2025-05-06 0 0 3.25MB 16 页 10玖币

侵权投诉

Helpful Neighbors:

Leveraging Neighbors in Geographic Feature Pronunciation

Llion Jones†Richard Sproat†Haruko Ishikawa†Alexander Gutkin‡

†Google Japan ‡Google UK

{llion,rws,ishikawa,agutkin}@google.com

Abstract

If one sees the place name Houston Mer-

cer Dog Run in New York, how does one

know how to pronounce it? Assuming one

knows that Houston in New York is pro-

nounced /ˈhaʊstən/ and not like the Texas

city (/ˈhjuːstən/), then one can probably guess

that /ˈhaʊstən/ is also used in the name of

the dog park. We present a novel architec-

ture that learns to use the pronunciations of

neighboring names in order to guess the pro-

nunciation of a given target feature. Ap-

plied to Japanese place names, we demon-

strate the utility of the model to finding and

proposing corrections for errors in Google

Maps.

To demonstrate the utility of this approach

to structurally similar problems, we also re-

port on an application to a totally different

task: Cognate reflex prediction in compara-

tive historical linguistics. A version of the

code has been open-sourced.1

1 Introduction

In many parts of the world, pronunciation of

toponyms and establishments can require local

knowledge. Many visitors to New York, for exam-

ple, get tripped up by Houston Street, which they

assume is pronounced the same as the city in Texas.

If they do not know how to pronounce Houston

Street, they would likely also not know how to pro-

nounce the nearby Houston Mercer Dog Run. But

if one knows one, that can (usually) be used as a

clue to how to pronounce the other.

Before we proceed further, a bit of terminology.

Technically, the term toponym refers to the name

of a geographical or administrative feature, such

as a river, lake, town or state. In most of what fol-

lows, we will use the term feature to refer to these

1https://github.com/google-research/

google-research/tree/master/cognate_inpaint_

neighbors

and other entities such as roads, buildings, schools

etc. In practice we will not make a major distinc-

tion between the two, but since there is a sense in

which toponyms are more basic, and the names of

the more general features are often derived from a

toponym (as in the Houston Mercer Dog Run ex-

ample above), we will retain the distinction where

it is needed.

While features cause not infrequent problems in

the US, they become a truly serious issue in Japan.

Japan is notorious for having toponyms whose pro-

nunciation is so unexpected that even native speak-

ers may not know how to pronounce a given case.

Most toponyms in Japanese are written in kanji

(Chinese characters) with a possible intermixing

of one of the two syllabaries, hiragana or katakana.

Thus 上野 Ueno is entirely in kanji; 虎ノ門 Tora

no mon has two kanji and one katakana symbol

(the second); and 吹割の滝 Fukiwari Waterfalls

has three kanji and one hiragana symbol (the third).

Features more generally tend to have more charac-

ters in one of the syllabaries—especially katakana

if, for example, the feature is a building that in-

cludes the name of a company as part of its name.

The syllabaries are basically phonemic scripts

so there is generally no ambiguity in how to pro-

nounce those portions of names, but kanji present a

serious problem in that the pronunciation of a kanji

string in a toponym is frequently something one

just has to know. To take the example 上野 Ueno

above, that pronunciation (for the well-known area

in Tokyo) is indeed the most common one, but

there are places in Japan with the same spelling

but with pronunciations such as Uwano,Kamino,

Wano, among others.2It is well-known that many

kanji have both a native (kun) Japanese pronunci-

ation (e.g. 山yama ‘mountain’) as well as one or

more Chinese-derived on pronunciations (e.g. 山

2Different pronunciations of kanji are often referred to as

readings, but in this paper we will use the more general term

pronunciation.

san ‘mountain’), but the issue with toponyms goes

well beyond this since there are nanori pronunci-

ations of kanji that are only found in names (Ogi-

hara,2021): 山also has the nanori pronunciation

taka, for example. The kun-on-nanori variants re-

late to an important property of how kanji are used

in Japanese: among all modern writing systems,

the Japanese use of kanji comes closest to being

semasiographic—i.e. representing meaning rather

than specific morphemes. The common toponym

component kawa ‘river’, is usually written 川, but

can also be written as 河, which also means ‘river’.

That kanji in turn has other pronunciations, such as

kō, a Sino-Japanese word for ‘river’. This freedom

to spell words with a range of kanji that have the

same meaning, or to read kanji with any of a num-

ber of morphemes having the same meaning, is a

particular characteristic of Japanese. Thus, while

reading place names can be tricky in many parts

of the world, the problem is particularly acute in

Japan.

Since the variation is largely unpredictable, one

therefore simply needs to know for a given to-

ponym what the pronunciation is. But once one

knows, for instance, that a name written 上野 is

read as Uwano, as with the Houston case, one

ought to be able to deduce that in the name of the

local 上野第1公園 ‘Uwano First Public Park’, this

is read as Uwano and not Ueno. If one’s digi-

tal assistant is reading this name to you, or needs

to understand your pronunciation of the name, it

needs to know the correct pronunciation. While

one might expect a complete and correct maps

database to have all of this information correctly

entered, in practice maps data contain many errors,

especially for less frequently accessed features.

In this paper we propose a model that learns to

use information from the geographical context to

guide the pronunciation of features. We demon-

strate its application to detecting and correcting er-

rors in Google Maps. In addition, in Section 8we

show that the model can be applied to a different

but structurally similar problem, namely the prob-

lem of cognate reflex prediction in comparative

historical linguistics. In this case the ‘neighbors’

are related word forms in a set of languages from a

given language family, and the pronunciation to be

predicted is the corresponding form in a language

from the same family.

2 Background

Pronouncing written geographical feature names

involves a combination of text normalization (if

the names contain expressions such as numbers

or abbreviations), and word pronunciation, often

termed “grapheme-to-phoneme conversion”. Both

of these are typically cast as sequence-to-sequence

problems, and neural approaches to both are now

common. For neural approaches to grapheme-to-

phoneme conversion see (Yao and Zweig,2015;

Rao et al.,2015;Toshniwal and Livescu,2016;

Peters et al.,2017;Yolchuyeva et al.,2019), and

for text normalization see (Sproat and Jaitly,2017;

Zhang et al.,2019;Yolchuyeva et al.,2018;Pra-

manik and Hussain,2019;Mansfield et al.,2019;

Kawamura et al.,2020;Tran and Bui,2021). For

languages that use the Chinese script, grapheme-

to-phoneme conversion may benefit from the fact

that Chinese characters can mostly be decomposed

into a component that relates to the meaning of the

character and another that relates to the pronunci-

ation. The latter information is potentially useful,

in particular in Chinese and in the Sino-Japanese

readings of characters in Japanese. Recent neural

models that have taken advantage of this include

(Dai and Cai,2017;Nguyen et al.,2020). On the

other hand, it should be pointed out that other more

‘brute force’ decompositions of characters seem to

be useful. Thus Yu et al. (2020) propose a byte de-

composition for (UTF-8) character encodings for

a model that covers a wide variety of languages,

including Chinese and Japanese.

The above approaches generally treat the prob-

lem in isolation in the sense that the problem is cast

as one where the task is to predict a pronunciation

independent of context. Different pronunciations

for the same string in different linguistic contexts

comes under the rubric of homograph disambigua-

tion, and there is a long tradition of work in this

area; for an early example see (Yarowsky,1996)

and for a recent incarnation see (Gorman et al.,

2018). Not surprisingly, there has been recent in-

terest in neural models for predicting homograph

pronunciations: see (Park and Lee,2020;Shi et al.,

2021) for recent examples focused on Mandarin.

The present task is different, since what disam-

biguates the possible pronunciations of Japanese

features is not generally linguistic, but geograph-

ical context, which can be thought of as a way of

biasing the decision as to which pronunciation to

use, given evidence from the local context. Our

Figure 1: The biasing LAS model from (Pundak et al.,

2018), Figure 1a.

approach is similar in spirit to that of Pundak et al.

(2018), who propose the use of a bias-encoder in a

“listen-attend-and-spell” (Chan et al.,2016) Auto-

matic Speech Recognition architecture. The bias

encoder takes a set of “bias phrases”, which can

be used to guide the model towards a particular

decoding. Pundak et al. (2018)’s model is shown

schematically in Figure 1.

3 Data

Features in Google Maps are stored in a data rep-

resentation that includes a variety of information

about each feature including: its location as a

bounding box in latitude-longitude; the type of

the feature—street, building, municipality, topo-

graphic feature, etc.; name(s) of the feature in

the native language as well as in many (mostly

automatically generated) transliterations; an ad-

dress if there is an address associated with this

feature; road signs that may be associated; and

so forth. Each feature is identified with a unique

hexadecimal feature id. Features may have ad-

ditional names besides the primary names. For

example in English, street names are often ab-

breviated (Main St.) and these abbreviations are

typically expanded (Main Street) as an additional

name. Many Japanese features have pronuncia-

tions of the names added as additional names in

katakana. Some of these have been carefully hand

curated, but many were generated automatically

and are therefore potentially errorful, as we will

see. Since the katakana version is used as the basis

for transliterations into other languages, localized

pronunciations for text-to-speech, as well as search

suggestions, it is important that it be correct.

We started by extracting from the database all

features that include a broad (but not exhaustive)

set of feature types from a bounding box that cov-

ers the four main islands of Japan. We then ex-

tracted feature summaries for names that included

both kanji original names, and katakana rendi-

tions. These summaries include the feature name,

the hiragana version of the name converted from

katakana, and the bounding box for the feature.

We then find, for each feature in the feature sum-

maries, a bucket of other features that are within

a given radius (10 kilometers in our experiments).

Then, for each feature in each bucket, we desig-

nate that feature a target feature, and we build

neighborhoods around that feature. We attempt for

each feature, to find interesting neighboring fea-

tures whose name shares a kanji bigram with the

target feature’s name. The intuition here is that a

feature that is likely to be useful in determining the

pronunciation of another feature should be nearby

geographically, and should share at least some of

the name. In any case we cap the number of ‘non-

interesting’ neighbors to a limit—5 in our experi-

ments. This means that some neighborhoods will

have target features that lack useful neighbors; this

is a realistic situation in that while it is often the

case that one can find hints for a name’s pronunci-

ation in the immediate neighbors, it is not always

the case. While such neighborhoods are not useful

from the point of view of neighbor-based evidence

for a target feature’s pronunciation, they still pro-

vide useful data for training the target sequence-

to-sequence model. Our final dataset consists of

about 2.7M feature neighborhoods, including the

information from the summary for each target fea-

ture as described above, the associated neighbor-

ing features and their summaries, along with the

distance (in kilometers) from the target feature.

Figure 2shows parts of one such neighborhood.

4 Model

Despite the differences noted above, the problem

we are interested in can still be characterized at its

core as a sequence-to-sequence problem. The in-

put is a sequence of tokens representing the feature

name in its original Japanese written form. The

output is a sequence of hiragana characters repre-

senting the correct pronunciation. The difference

between this and a more conventional sequence-

to-sequence problem is that we provide additional

biasing information in the form of geographical

neighbors, such as their pronunciation and geo-

Main Name セラヴィ反町

Pron seravi sorimachi

(i.e. C’est la Vie …)

Neigh: Name 反町

tanmachi

(pink area on map)

Neigh: Name 上反町

kamitanmachi

(green area on map)

Figure 2: A small example of a neighborhood. The

store, circled on the map, has a pronunciation listed as

C’est la Vie Sorimachi, but the neighboring areas are

Tanmachi and Kamitanmachi.Sorimachi is therefore

wrong.

graphical location. This neighbor information is

provided as additional input sequences to aid the

model in making its prediction. In our experi-

ments, we limit the number of neighbors to at most

30 (it is usually much less than this), each consist-

ing of two sequences, namely the neighbor’s name

and the corresponding pronunciation.

4.1 Model architecture

Due to many recent successes in other NLP appli-

cations, we experiment with a transformer model

(Vaswani et al.,2017). Our transformer model

uses a standard encoder-decoder architecture as the

backbone. The inputs to the model are the input

name with unknown pronunciation 𝑥inp, the neigh-

bor names 𝑥name (of length name_len) and asso-

ciated pronunciations 𝑥pron (of length pron_len).

First, these input tokens are embedded with size

emb_size. The embeddings are then shared be-

tween the feature names and the pronunciations.

i.e. the same embeddings are used for the input

name tokens and the neighbor tokens, and sim-

ilarly between the target pronunciation (decoder

output) and the neighbors’ pronunciations:

embinp =Embedname(𝑥inp) ,

embname =Embedname(𝑥name) ,

embpron =Embedpron(𝑥pron) .

These embedded tokens are then processed sep-

arately by the neighbor encoder. No parameters

are shared between these encoders, or with the de-

coder:

ℎinp =Encoderinp(embinp) ,

ℎname =Encodername(embname) ,

ℎpron =Encoderpron(embpron) .

Since each example has nneigh neighbors, ℎinp

is of shape [inp_size, emb_size] but the pro-

cessed neighbor spelling and pronunciation inputs

are of size [nneigh, name_len, emb_size] and

[nneigh, pron_len, emb_size].

One of the simplest ways to incorporate the

neighboring information is to concatenate the fea-

ture names and pronunciation embeddings into the

main input sequence, allowing the transformer to

attend directly to all the relevant information. Un-

fortunately, this is not possible with a vanilla trans-

former with a quadratic attention mechanism if we

want to attend to, say, 30 neighbors. In our experi-

ments name_len is set to 20 and pron_len is set to

40, yielding (20 + 40) × 30 = 1800 input tokens,

far too many for a vanilla transformer decoder to

attend to. To mitigate against this we average the

encoder outputs to give a single vector per neigh-

bor to attend to:

𝑠name =Ave(ℎname) ,

𝑠pron =Ave(ℎpron) ,

𝑐 = Concat(ℎinp, 𝑠name, 𝑠pron) .

The vectors are concatenated along the neigh-

bor dimension to give a sequence of size

[inp_len+2*nneigh, emb_size]. Optionally, if

embeddings representing the latitudinal and lon-

gitudinal position of the feature (which we refer

to as Lat-Long embeddings, discussed later) are

used then these are also concatenated here. This

input sequence is then concatenated to the encoder

output and is attended over by the transformer de-

coder. There are no positional embeddings added

to this sequence, so they are unordered from the

point of view of decoder attention. Therefore, we

help the decoder match the neighbor names to their

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HelpfulNeighbors:LeveragingNeighborsinGeographicFeaturePronunciationLlionJones†RichardSproat†HarukoIshikawa†AlexanderGutkin‡†GoogleJapan‡GoogleUK{llion,rws,ishikawa,agutkin}@google.comAbstractIfoneseestheplacenameHoustonMer-cerDogRuninNewYork,howdoesoneknowhowtopronounceit?AssumingoneknowsthatHousto...

展开>> 收起<<

Helpful Neighbors Leveraging Neighbors in Geographic Feature Pronunciation Llion JonesRichard SproatHaruko IshikawaAlexander Gutkin.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Helpful Neighbors Leveraging Neighbors in Geographic Feature Pronunciation Llion JonesRichard SproatHaruko IshikawaAlexander Gutkin

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: