SQUID MEASURING SPEECH NATURALNESS IN MANY LANGUAGES Thibault Sellam Ankur Bapna Joshua Camp Diana Mackinnon Ankur P . Parikh Jason Riesa

2025-04-26 0 0 430.18KB 7 页 10玖币

侵权投诉

SQUID : MEASURING SPEECH NATURALNESS IN MANY LANGUAGES

Thibault Sellam, Ankur Bapna, Joshua Camp, Diana Mackinnon, Ankur P. Parikh, Jason Riesa

Google

ABSTRACT

Much of text-to-speech research relies on human evaluation. This

incurs heavy costs and slows down the development process, es-

pecially in heavily multilingual applications where recruiting and

polling annotators can take weeks. We introduce SQuId (Speech

Quality Identiﬁcation), a multilingual naturalness prediction model

trained on over a million ratings and tested in 65 locales—the largest

effort of this type to date. The main insight is that training one model

on many locales consistently surpasses mono-locale baselines. We

show that the model outperforms a competitive baseline based on

w2v-BERT and VoiceMOS by 50.0%. We then demonstrate the ef-

fectiveness of cross-locale transfer during ﬁne-tuning and highlight

its effect on zero-shot locales, for which there is no ﬁne-tuning data.

We highlight the role of non-linguistic effects such as sound artifacts

in cross-locale transfer. Finally, we present the effect of model size

and pre-training diversity with ablation experiments.

1. INTRODUCTION

Evaluation is a major bottleneck for speech synthesis tasks like text-

to-speech (TTS) and speech-to-speech translation. In principle, there

are inﬁnitely many spoken renditions of a piece of text, and there is

no universally agreed upon deﬁnition of what makes an utterance

“correct”. Thus, researchers rely heavily on human evaluation, more

speciﬁcally listening tests, during their day-to-day development cy-

cles. The most popular type of listening test is MOS (Mean Opinion

Score), during which several annotators listen to audio segments and

rate them on a Likert scale between 1 to 5 (examples of foundational

studies that use it include Tacotron [1], Parallel WaveNet [2], or Fast-

Speech 2 [3]). Listening tests can produce reliable results [4], since

humans usually excel at detecting speech quality, and the scheme

can be adapted to the need of every speciﬁc task. But they are

also impractical and expensive: recruiting and polling annotators

increases the cost of running experiments, slows down model re-

search, and makes it impossible to compare results across time and

institutions. The problem gets exacerbated in the multilingual setup:

it may be challenging for researchers to ﬁnd speakers of languages

that are neither spoken by many, nor geographically close to them.

Ultimately, this hinders their progress, skews the literature towards

high resource languages, and prevents them from engaging in heav-

ily multilingual research. This comes in contrast to text-based gen-

eration tasks, such as Machine Translation, for which the research

community has long adopted automatic metrics (such as BLEU [5]

or more recently COMET [6] and BLEURT [7]) as a complement to

human assessment.

To address these issues, there has been a growing interest in

developing automatic metrics for speech synthesis, spearheaded by

systems such as AutoMOS [8], MOSNet [9], or LDNet [10]. The

idea is to cast quality evaluation as a regression or classiﬁcation

problem: these systems predict a quality score from an utterance,

using past listening tests as a source of training data. The task is dif-

ﬁcult because the target domain is complex, even in a monolingual

setup: synthesis artifacts come in many forms and can affect all lev-

els of speech production, including pronunciation, prosody, voice,

and audio quality. And the task is getting harder over time [11]: as

systems progress, the focus has shifted from obvious artifacts (e.g.,

robotic voices) to more subtle errors, such as inappropriate prosody

or mispronunciations. Yet, the same problem that motivates the task

plagues its solution—data is expensive to collect, especially outside

high-resource languages, and so existing studies tend to use rela-

tively limited training and testing sets. Early MOS Predictors have

been shown to be brittle when used out of domain [12], and there are

few studies outside English and Chinese.

In this paper, we study MOS Prediction at scale, and in a mas-

sively multilingual setup. We introduce SQuId (Speech Quality

Identiﬁer) a speech quality detector based on mSLAM (multilingual

Speech and LAnguage Model), a recently published pre-trained

model [13]. SQuId is trained on over a million ratings, an order of

magnitude more than most recent studies [9]. More importantly it is,

to the best of our knowledge, the ﬁrst massively multilingual model

for MOS prediction: we trained the model on 42 locales1and tested

it on 65. For comparison, VoiceMOS, the most comprehensive

benchmark to date, covers two locales only [14]. We describe our

dataset, our model, and show that SQuId outperforms a competitive

baseline based on SSL and VoiceMOS by up to 50.0%. Most im-

provements come from the additional supervised multilingual data,

complemented by minor optimizations that target the multilingual

case. We then conduct several studies to highlight the factors that

contribute to MOS prediction quality in this massively multilingual,

in the wild, setup. Key insights from our work include:

(i) Training one model on a diverse dataset consisting of data

from many locales consistently outperforms the monolingual ap-

proach as a result of cross-locale transfer, an effect well known

in NLP [15–18], ASR [19–21] and TTS literature [22, 23]. The

most spectacular manifestation of this phenomenon is the model’s

strong performance on zero-shot locales, where there is no labelled

MOS prediction data. Cross-locale transfer allows us to increase

the model’s language coverage dramatically and has signiﬁcant

implications on evaluation of multilingual speech synthesis.

(ii) We conduct analyses to understand the nature and mecha-

nism of cross-locale transfer for MOS prediction. We demonstrate

that locale diversity has a large inﬂuence on model’s performance

during ﬁne-tuning, but transfer is driven less by language similarity

and more by the presence of language-agnostic phenomena (possi-

bly including diversity of audio quality, voices and TTS systems) in

the dataset. We highlight the role of para-lingual transfer, by which

1Compared to language, locales take regional variation into account. For

instance English is covered by ﬁve locales in our dataset: US English, UK

English, Indian English, Nigerian English, and Australian English. Each of

these variants should be rendered differently by a TTS engine.

arXiv:2210.06324v2 [cs.CL] 1 Jun 2023

training data in one locale improves performance in another for rea-

sons orthogonal to linguistics.

(iii) Through ablation studies, we highlight the importance of

modeling choices, including multilingual pre-training and model ca-

pacity on SQuId’s performance.

Additional Related Work. Automatic MOS prediction has a

long history in the TTS literature [24]. In addition to the systems

cited above, recent work includes [25–29], none of which tackle

multilinguality. Our work directly builds upon [12], which exper-

iments with wide range of pre-trained models. Our methodology

is close, but we scale up the data (from about 30K to 1,3M sam-

ples), number of locales (from 2 to 65), model size (from 300M to

600M parameters), and contribute a novel analysis of multilingual-

ity. Cross-lingual transfer and massively multilingual NLP have a

rich history in MT [15–17] and pre-trained models [18,30]. Authors

have studied cross-lingual transfer for at least a decade in speech

recognition [31–35] and TTS [22, 36, 37].

2. MULTILINGUAL MOS PREDICTION IN THE WILD

We wish to predict MOS Naturalness ratings for both human and

synthetic speech. Broadly speaking, MOS Naturalness describes

how human-like an utterance sounds. Our main resource is an in-

house corpus, aggregating approximately 1.9 million ratings in 66

locales across 2,092 research and commercial annotation projects

completed between January 2021 and March 2022. Most of the au-

dio is generated by TTS systems including both concatenative and

neural systems, and the annotators are asked to select a rating based

on how natural or unnatural the samples sound. The test sets used

for the evaluations are primarily focused on TTS applications such

as virtual assistant responses, driving directions, book passages, and

news articles; general text from web-crawled corpora is also used.

Sentences are typically rated in isolation (that is, outside of the con-

text in which they originally appeared), though entire paragraphs

are occasionally rated as well. Listening tests were conducted us-

ing crowd-sourced raters on an internal ratings platform, using a 9-

Point Likert Scale using 0.5 points increments This variety in test

sets and TTS technologies means the stimuli contain a diverse set

of errors, including pronunciation errors, text normalization errors,

unnatural prosody, and acoustic artifacts such as discontinuities and

signal-correlated noise.

We apply two splits to the dataset. First, we split the data by

time. We use the ratings collected between January 1st and Decem-

ber 1st 2021 for training and development, and the rest for test. The

motivation is to simulate a realistic use case, whereby a TTS engi-

neer would want to use the past annotations to predict future ones.

The second split is based on region: we hold out 24 locales for which

we have exceptionally few ratings (less than 8,000 each, adding up

to about 5% of the data) and use them for test. The rationale is that

small tasks yield little improvements during training but are particu-

larly useful for analysis. Since there is no training data, we refer to

these locales as zero-shot locales, as opposed to ﬁne-tuned locales.

The dataset is skewed towards US English (18%), followed by UK

English (12%), and ES Spanish (4.2%). To build the development

set, we sample 2.5% of the training set without replacement. Table 1

provides additional statistics.

Challenges and Caveats. Due to the nature of splits, we do not

expect the data to be i.i.d.—the TTS systems and annotators used

to produce and rate the utterances in February 2021 are usually not

those of January 2022. The of number, listening conditions and qual-

ity of the annotations also vary across projects and locales, as do the

input texts chosen to test the systems. Furthermore, it is generally

Num. training utt. / systems. / locales 969,589 / 1,476 / 42

Num. dev utt. / systems / locales 34,042 / 1,474 / 42

Num. test utt. / systems / locales 381,323 / 605 / 65

Ave. utterance duration (seconds) / num. ratings 4.5s / 1.4

Table 1: MOS Prediction dataset statistics.

understood that the term naturalness is underspeciﬁed and may be

interpreted differently be different raters [38, 39]. In short, these

conditions reﬂect TTS evaluation “in the wild”.

3. THE SQUID MODEL

The most important design decision behind our study is to ﬁne-tune

a single model on all locales rather than keeping separate models.

This offers convenience, since we have one model to maintain rather

than 65. More importantly, we assume that if the model has enough

capacity, positive transfer will emerge between the locales [17].

SQuId is based on mSLAM [13], a recently published multi-

modal pre-trained model trained on unlabelled speech (429K hours

in 51 languages), text (15TB in 101 languages), and speech-text pairs

(2.3K hours). We chose this model because it produced state-of-the-

art results in many languages at the time of writing. It is based on the

Conformer architecture, with 600M parameters by default. SQuId’s

input is a 16KHz utterance’s spectrogram, along with an optional lo-

cale tag. The output is a scalar. We ﬁne-tune the model end-to-end

with a simple regression loss. After optional resampling to 16KHz,

we compute an 80-dimensional log Mel spectrogram and extract

mSLAM embeddings ei

1, , ..., ei

Tfor each time step. We mean-pool

across the time dimension, returning an embedding ei

∗, and apply a

fully connected layer to obtain the prediction ˆyi=M.ei

∗+b. By

default we use T= 3,200 time steps, and the embeddings have di-

mension 1,024. The target MOS ratings are linearly rescaled from [1,

5] to [0, 1]. By default we use batch size 32 and learning rate 10−5,

obtained with hyper-parameter search during a preliminary set of ex-

periments. We train the models for 100k steps, save a snapshot every

10k steps, and export the version that yields the best version on our

development set. We run experiments on Cloud TPU v3, using the

Adam optimizer with 1,500 warmup steps.

Additionally, two optimizations lead to slight but consistent per-

formance improvement. (i) We embed the locale tags of each utter-

ance into a 64-wide vector eℓiand concatenate it to ei

∗, forming vec-

tor [ei

∗,eℓi]. For 5% of the data, we use a wildcard identiﬁer ANY-

LOC, which we use for inference on locales unseen during training.

(ii) We sample the data with temperature to rebalance the relative

proportion of the training locales. As described in [17], we resample

each locale ℓwith probability p1/τ

ℓ. We use τ= 10, obtained by

hyper-parameter search on the development set.

4. PERFORMANCE

Let us now present SQuId’s overall performance. To validate our ap-

proach, we ﬁrst ensure that it performs well on VoiceMOS’22, cur-

rently the main benchmark for MOS prediction. We then scale up

the test set and analyze its performance on 65 locales. Throughout

the section, we will compare SQuId to Big-SSL-MOS, a competitive

baseline in the spirit of SSL-MOS [12]. We ﬁne-tune w2v-BERT

on the main VoiceMOS dataset with a regression objective, using a

600M parameters version of the pre-trained model to ensure a fair

comparison ( [12] uses up to 317M). The model comes in two vari-

ants: English-only, and multilingual. The architecture and dataset of

our w2v-BERT implementation are described in detail in [13].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SQUID:MEASURINGSPEECHNATURALNESSINMANYLANGUAGESThibaultSellam,AnkurBapna,JoshuaCamp,DianaMackinnon,AnkurP.Parikh,JasonRiesaGoogleABSTRACTMuchoftext-to-speechresearchreliesonhumanevaluation.Thisincursheavycostsandslowsdownthedevelopmentprocess,es-peciallyinheavilymultilingualapplicationswhererecruiti...

展开>> 收起<<

SQUID MEASURING SPEECH NATURALNESS IN MANY LANGUAGES Thibault Sellam Ankur Bapna Joshua Camp Diana Mackinnon Ankur P . Parikh Jason Riesa.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SQUID MEASURING SPEECH NATURALNESS IN MANY LANGUAGES Thibault Sellam Ankur Bapna Joshua Camp Diana Mackinnon Ankur P . Parikh Jason Riesa

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: