SQUID MEASURING SPEECH NATURALNESS IN MANY LANGUAGES Thibault Sellam Ankur Bapna Joshua Camp Diana Mackinnon Ankur P . Parikh Jason Riesa

2025-04-26 0 0 430.18KB 7 页 10玖币
侵权投诉
SQUID : MEASURING SPEECH NATURALNESS IN MANY LANGUAGES
Thibault Sellam, Ankur Bapna, Joshua Camp, Diana Mackinnon, Ankur P. Parikh, Jason Riesa
Google
ABSTRACT
Much of text-to-speech research relies on human evaluation. This
incurs heavy costs and slows down the development process, es-
pecially in heavily multilingual applications where recruiting and
polling annotators can take weeks. We introduce SQuId (Speech
Quality Identification), a multilingual naturalness prediction model
trained on over a million ratings and tested in 65 locales—the largest
effort of this type to date. The main insight is that training one model
on many locales consistently surpasses mono-locale baselines. We
show that the model outperforms a competitive baseline based on
w2v-BERT and VoiceMOS by 50.0%. We then demonstrate the ef-
fectiveness of cross-locale transfer during fine-tuning and highlight
its effect on zero-shot locales, for which there is no fine-tuning data.
We highlight the role of non-linguistic effects such as sound artifacts
in cross-locale transfer. Finally, we present the effect of model size
and pre-training diversity with ablation experiments.
1. INTRODUCTION
Evaluation is a major bottleneck for speech synthesis tasks like text-
to-speech (TTS) and speech-to-speech translation. In principle, there
are infinitely many spoken renditions of a piece of text, and there is
no universally agreed upon definition of what makes an utterance
“correct”. Thus, researchers rely heavily on human evaluation, more
specifically listening tests, during their day-to-day development cy-
cles. The most popular type of listening test is MOS (Mean Opinion
Score), during which several annotators listen to audio segments and
rate them on a Likert scale between 1 to 5 (examples of foundational
studies that use it include Tacotron [1], Parallel WaveNet [2], or Fast-
Speech 2 [3]). Listening tests can produce reliable results [4], since
humans usually excel at detecting speech quality, and the scheme
can be adapted to the need of every specific task. But they are
also impractical and expensive: recruiting and polling annotators
increases the cost of running experiments, slows down model re-
search, and makes it impossible to compare results across time and
institutions. The problem gets exacerbated in the multilingual setup:
it may be challenging for researchers to find speakers of languages
that are neither spoken by many, nor geographically close to them.
Ultimately, this hinders their progress, skews the literature towards
high resource languages, and prevents them from engaging in heav-
ily multilingual research. This comes in contrast to text-based gen-
eration tasks, such as Machine Translation, for which the research
community has long adopted automatic metrics (such as BLEU [5]
or more recently COMET [6] and BLEURT [7]) as a complement to
human assessment.
To address these issues, there has been a growing interest in
developing automatic metrics for speech synthesis, spearheaded by
systems such as AutoMOS [8], MOSNet [9], or LDNet [10]. The
idea is to cast quality evaluation as a regression or classification
problem: these systems predict a quality score from an utterance,
using past listening tests as a source of training data. The task is dif-
ficult because the target domain is complex, even in a monolingual
setup: synthesis artifacts come in many forms and can affect all lev-
els of speech production, including pronunciation, prosody, voice,
and audio quality. And the task is getting harder over time [11]: as
systems progress, the focus has shifted from obvious artifacts (e.g.,
robotic voices) to more subtle errors, such as inappropriate prosody
or mispronunciations. Yet, the same problem that motivates the task
plagues its solution—data is expensive to collect, especially outside
high-resource languages, and so existing studies tend to use rela-
tively limited training and testing sets. Early MOS Predictors have
been shown to be brittle when used out of domain [12], and there are
few studies outside English and Chinese.
In this paper, we study MOS Prediction at scale, and in a mas-
sively multilingual setup. We introduce SQuId (Speech Quality
Identifier) a speech quality detector based on mSLAM (multilingual
Speech and LAnguage Model), a recently published pre-trained
model [13]. SQuId is trained on over a million ratings, an order of
magnitude more than most recent studies [9]. More importantly it is,
to the best of our knowledge, the first massively multilingual model
for MOS prediction: we trained the model on 42 locales1and tested
it on 65. For comparison, VoiceMOS, the most comprehensive
benchmark to date, covers two locales only [14]. We describe our
dataset, our model, and show that SQuId outperforms a competitive
baseline based on SSL and VoiceMOS by up to 50.0%. Most im-
provements come from the additional supervised multilingual data,
complemented by minor optimizations that target the multilingual
case. We then conduct several studies to highlight the factors that
contribute to MOS prediction quality in this massively multilingual,
in the wild, setup. Key insights from our work include:
(i) Training one model on a diverse dataset consisting of data
from many locales consistently outperforms the monolingual ap-
proach as a result of cross-locale transfer, an effect well known
in NLP [15–18], ASR [19–21] and TTS literature [22, 23]. The
most spectacular manifestation of this phenomenon is the model’s
strong performance on zero-shot locales, where there is no labelled
MOS prediction data. Cross-locale transfer allows us to increase
the model’s language coverage dramatically and has significant
implications on evaluation of multilingual speech synthesis.
(ii) We conduct analyses to understand the nature and mecha-
nism of cross-locale transfer for MOS prediction. We demonstrate
that locale diversity has a large influence on model’s performance
during fine-tuning, but transfer is driven less by language similarity
and more by the presence of language-agnostic phenomena (possi-
bly including diversity of audio quality, voices and TTS systems) in
the dataset. We highlight the role of para-lingual transfer, by which
1Compared to language, locales take regional variation into account. For
instance English is covered by five locales in our dataset: US English, UK
English, Indian English, Nigerian English, and Australian English. Each of
these variants should be rendered differently by a TTS engine.
arXiv:2210.06324v2 [cs.CL] 1 Jun 2023
training data in one locale improves performance in another for rea-
sons orthogonal to linguistics.
(iii) Through ablation studies, we highlight the importance of
modeling choices, including multilingual pre-training and model ca-
pacity on SQuId’s performance.
Additional Related Work. Automatic MOS prediction has a
long history in the TTS literature [24]. In addition to the systems
cited above, recent work includes [25–29], none of which tackle
multilinguality. Our work directly builds upon [12], which exper-
iments with wide range of pre-trained models. Our methodology
is close, but we scale up the data (from about 30K to 1,3M sam-
ples), number of locales (from 2 to 65), model size (from 300M to
600M parameters), and contribute a novel analysis of multilingual-
ity. Cross-lingual transfer and massively multilingual NLP have a
rich history in MT [15–17] and pre-trained models [18,30]. Authors
have studied cross-lingual transfer for at least a decade in speech
recognition [31–35] and TTS [22, 36, 37].
2. MULTILINGUAL MOS PREDICTION IN THE WILD
We wish to predict MOS Naturalness ratings for both human and
synthetic speech. Broadly speaking, MOS Naturalness describes
how human-like an utterance sounds. Our main resource is an in-
house corpus, aggregating approximately 1.9 million ratings in 66
locales across 2,092 research and commercial annotation projects
completed between January 2021 and March 2022. Most of the au-
dio is generated by TTS systems including both concatenative and
neural systems, and the annotators are asked to select a rating based
on how natural or unnatural the samples sound. The test sets used
for the evaluations are primarily focused on TTS applications such
as virtual assistant responses, driving directions, book passages, and
news articles; general text from web-crawled corpora is also used.
Sentences are typically rated in isolation (that is, outside of the con-
text in which they originally appeared), though entire paragraphs
are occasionally rated as well. Listening tests were conducted us-
ing crowd-sourced raters on an internal ratings platform, using a 9-
Point Likert Scale using 0.5 points increments This variety in test
sets and TTS technologies means the stimuli contain a diverse set
of errors, including pronunciation errors, text normalization errors,
unnatural prosody, and acoustic artifacts such as discontinuities and
signal-correlated noise.
We apply two splits to the dataset. First, we split the data by
time. We use the ratings collected between January 1st and Decem-
ber 1st 2021 for training and development, and the rest for test. The
motivation is to simulate a realistic use case, whereby a TTS engi-
neer would want to use the past annotations to predict future ones.
The second split is based on region: we hold out 24 locales for which
we have exceptionally few ratings (less than 8,000 each, adding up
to about 5% of the data) and use them for test. The rationale is that
small tasks yield little improvements during training but are particu-
larly useful for analysis. Since there is no training data, we refer to
these locales as zero-shot locales, as opposed to fine-tuned locales.
The dataset is skewed towards US English (18%), followed by UK
English (12%), and ES Spanish (4.2%). To build the development
set, we sample 2.5% of the training set without replacement. Table 1
provides additional statistics.
Challenges and Caveats. Due to the nature of splits, we do not
expect the data to be i.i.d.—the TTS systems and annotators used
to produce and rate the utterances in February 2021 are usually not
those of January 2022. The of number, listening conditions and qual-
ity of the annotations also vary across projects and locales, as do the
input texts chosen to test the systems. Furthermore, it is generally
Num. training utt. / systems. / locales 969,589 / 1,476 / 42
Num. dev utt. / systems / locales 34,042 / 1,474 / 42
Num. test utt. / systems / locales 381,323 / 605 / 65
Ave. utterance duration (seconds) / num. ratings 4.5s / 1.4
Table 1: MOS Prediction dataset statistics.
understood that the term naturalness is underspecified and may be
interpreted differently be different raters [38, 39]. In short, these
conditions reflect TTS evaluation “in the wild”.
3. THE SQUID MODEL
The most important design decision behind our study is to fine-tune
a single model on all locales rather than keeping separate models.
This offers convenience, since we have one model to maintain rather
than 65. More importantly, we assume that if the model has enough
capacity, positive transfer will emerge between the locales [17].
SQuId is based on mSLAM [13], a recently published multi-
modal pre-trained model trained on unlabelled speech (429K hours
in 51 languages), text (15TB in 101 languages), and speech-text pairs
(2.3K hours). We chose this model because it produced state-of-the-
art results in many languages at the time of writing. It is based on the
Conformer architecture, with 600M parameters by default. SQuId’s
input is a 16KHz utterance’s spectrogram, along with an optional lo-
cale tag. The output is a scalar. We fine-tune the model end-to-end
with a simple regression loss. After optional resampling to 16KHz,
we compute an 80-dimensional log Mel spectrogram and extract
mSLAM embeddings ei
1, , ..., ei
Tfor each time step. We mean-pool
across the time dimension, returning an embedding ei
, and apply a
fully connected layer to obtain the prediction ˆyi=M.ei
+b. By
default we use T= 3,200 time steps, and the embeddings have di-
mension 1,024. The target MOS ratings are linearly rescaled from [1,
5] to [0, 1]. By default we use batch size 32 and learning rate 105,
obtained with hyper-parameter search during a preliminary set of ex-
periments. We train the models for 100k steps, save a snapshot every
10k steps, and export the version that yields the best version on our
development set. We run experiments on Cloud TPU v3, using the
Adam optimizer with 1,500 warmup steps.
Additionally, two optimizations lead to slight but consistent per-
formance improvement. (i) We embed the locale tags of each utter-
ance into a 64-wide vector eiand concatenate it to ei
, forming vec-
tor [ei
,ei]. For 5% of the data, we use a wildcard identifier ANY-
LOC, which we use for inference on locales unseen during training.
(ii) We sample the data with temperature to rebalance the relative
proportion of the training locales. As described in [17], we resample
each locale with probability p1
. We use τ= 10, obtained by
hyper-parameter search on the development set.
4. PERFORMANCE
Let us now present SQuId’s overall performance. To validate our ap-
proach, we first ensure that it performs well on VoiceMOS’22, cur-
rently the main benchmark for MOS prediction. We then scale up
the test set and analyze its performance on 65 locales. Throughout
the section, we will compare SQuId to Big-SSL-MOS, a competitive
baseline in the spirit of SSL-MOS [12]. We fine-tune w2v-BERT
on the main VoiceMOS dataset with a regression objective, using a
600M parameters version of the pre-trained model to ensure a fair
comparison ( [12] uses up to 317M). The model comes in two vari-
ants: English-only, and multilingual. The architecture and dataset of
our w2v-BERT implementation are described in detail in [13].
摘要:

SQUID:MEASURINGSPEECHNATURALNESSINMANYLANGUAGESThibaultSellam,AnkurBapna,JoshuaCamp,DianaMackinnon,AnkurP.Parikh,JasonRiesaGoogleABSTRACTMuchoftext-to-speechresearchreliesonhumanevaluation.Thisincursheavycostsandslowsdownthedevelopmentprocess,es-peciallyinheavilymultilingualapplicationswhererecruiti...

展开>> 收起<<
SQUID MEASURING SPEECH NATURALNESS IN MANY LANGUAGES Thibault Sellam Ankur Bapna Joshua Camp Diana Mackinnon Ankur P . Parikh Jason Riesa.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:430.18KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注