CAN WE USE COMMON VOICE TO TRAIN A MULTI-SPEAKER TTS SYSTEM Sewade Ogun Vincent Colotte Emmanuel Vincent Universit e de Lorraine CNRS Inria LORIA F-54000 Nancy France

2025-04-27 0 0 232.43KB 6 页 10玖币

侵权投诉

CAN WE USE COMMON VOICE TO TRAIN A MULTI-SPEAKER TTS SYSTEM?

Sewade Ogun, Vincent Colotte, Emmanuel Vincent

Universit´

e de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France

sewade.ogun@inria.fr

ABSTRACT

Training of multi-speaker text-to-speech (TTS) systems relies

on curated datasets based on high-quality recordings or au-

diobooks. Such datasets often lack speaker diversity and are

expensive to collect. As an alternative, recent studies have

leveraged the availability of large, crowdsourced automatic

speech recognition (ASR) datasets. A major problem with

such datasets is the presence of noisy and/or distorted sam-

ples, which degrade TTS quality. In this paper, we propose

to automatically select high-quality training samples using

a non-intrusive mean opinion score (MOS) estimator, WV-

MOS. We show the viability of this approach for training a

multi-speaker GlowTTS model on the Common Voice En-

glish dataset. Our approach improves the overall quality of

generated utterances by 1.26 MOS point with respect to train-

ing on all the samples and by 0.35 MOS point with respect to

training on the LibriTTS dataset. This opens the door to au-

tomatic TTS dataset curation for a wider range of languages.

Index Terms—Multi-speaker text-to-speech, Common

Voice, crowdsourced corpus, non-intrusive quality estimation

1. INTRODUCTION

Research on text-to-speech (TTS) is increasingly focusing on

multi-speaker TTS as it is more challenging and often re-

quires explicit modelling of speaker characteristics. This in-

terest has helped improve the performance of multi-speaker

TTS in terms of prosody [1], expressiveness [2], new speaker

generation [3], zero-shot training [4], and synthetic data gen-

eration for downstream tasks like automatic speech recogni-

tion (ASR) [5], among others. Depending on the application,

different characteristics of speech need to be modeled. For ex-

ample, for synthetic ASR training data generation, it is neces-

sary for the model to have seen diverse speakers and accents.

Datasets currently used for TTS system training fall into

two categories, namely studio-quality TTS datasets such as

VCTK [6], and TTS datasets curated from audiobooks such as

LibriTTS [7]. However, the VCTK dataset includes only 110

speakers, and LibriTTS has a concentration of US English

accents, which are not representative of the entire spectrum of

speakers and accents. On top of that, the collection of datasets

such as VCTK may be too expensive for some languages.

Large, crowdsourced ASR datasets are good candidates

for driving TTS research in future directions, as they inher-

ently exhibit the larger speaker variability (in terms of accent,

speaking style, speaking rate, etc.) required for TTS systems

to model diverse speakers. However, problems such as noise,

low bandwidth, mispronunciation, variation in recording con-

ditions, etc., hinder their usability for TTS training.

In this paper, we focus on automatically selecting high-

quality training samples from a crowdsourced dataset, using

Common Voice English [8] as an example. In this context,

quality cannot be estimated via subjective listening tests, that

are intractable with 1.4 M utterances, or objective metrics like

PESQ [9] that require a reference signal. Instead, we leverage

the increasing accuracy of deep learning based, non-intrusive

quality estimators. Speciﬁcally, we use a self-supervised

model ﬁne-tuned for mean opinion score (MOS) estimation,

WV-MOS [10], and select the speakers whose average WV-

MOS score across all utterances is above a threshold. We

evaluate the intelligibility, audio quality and speaker similar-

ity of the utterances generated by a multi-speaker GlowTTS

model trained on the resulting dataset, and also brieﬂy explore

the other factors not captured by WV-MOS.

Section 2 describes related works on TTS dataset creation,

TTS training on noisy speech, and MOS estimation. Section 3

describes the Common Voice dataset, its properties and limi-

tations. Sections 4 and 5 describe our method and the experi-

ments performed to validate it. We conclude in Section 6.

2. RELATED WORK

Several multi-speaker datasets have been collected in recent

years for TTS applications [6, 7]. To create these corpora, re-

searchers either record utterances in semi-anechoic chambers

for good signal quality, or utilise various methods to select

utterances from audiobooks, as this is less cumbersome. For

example, the LibriTTS dataset was derived from the popular

LibriSpeech ASR dataset [11] by trimming silences and ﬁl-

tering out utterances with low estimated signal-to-noise ratio

(SNR). Although this ﬁltering step is not perfect and allows

a few noisy samples to remain uncaught, the resulting dataset

is believed to be good enough for TTS since the original Lib-

riSpeech is higher-quality than Common Voice on average.

A few works have used other metrics such as the word er-

arXiv:2210.06370v1 [eess.AS] 12 Oct 2022

ror rate or the Mel cepstral distortion to automatically select

good training utterances from noisy speech datasets [12–14],

however they have only been demonstrated on small datasets

so far. Research on multi-speaker TTS training using noisy

speech has also focused on directly modelling the noise in

order to factor it out during inference [15, 16], and on encod-

ing all the environmental characteristics of speech for novel

speech generation in different conditions [17].

Recently, several methods have been proposed to auto-

matically measure the quality of speech utterances [18,19] but

they do not always generalise well outside of the training cor-

pus. Self-supervised pretrained models followed by a shallow

MOS regression head result in higher correlation with human

evaluators’ scores [20] than previous architectures. They also

generalise better to unseen speakers and utterances, and can

be used to evaluate the performance of speech processing sys-

tems for a variety of tasks [10].

3. THE COMMON VOICE DATASET

Common Voice [8] is a crowdsourced, Creative Commons

Zero licensed, read speech dataset currently available in over

93 languages. It contains recordings from volunteers who

read a text transcript sourced from public domain text. Each

utterance is up-voted or down-voted by volunteers according

to a list of criteria.1These criteria are not very restrictive, e.g.,

various kinds of background noises are allowed. Utterances

with more than two up-votes are marked as validated. The

validated utterances are then split into train, development and

test sets, with non-overlapping speakers and sentences.

3.1. Analysing Common Voice Dataset Quality for TTS

Although the validated set has been widely exploited for ASR

[21], we observe some undesirable properties for TTS:

• Noise: Speech quality may be degraded by electromag-

netic noise or acoustic noise such as mouse clicks, low

frequency noise, background speakers and background

music, among others. Since the utterances are stored as

mp3, quantization noise can also sometimes be heard.

• Low bandwidth: Due to recording choices or high com-

pression, some audio ﬁles are low-pass ﬁltered, with a

cutoff frequency that varies from one ﬁle to another.

• Mispronunciation: We observe mispronunciations of

“unfamiliar” words, variations in the pronunciation of

certain other words, and some utterances in other lan-

guages (e.g., German utterances in the English corpus).

• Unavailable speaker metadata: Age, gender and accent

information are not available for all speakers, while

some TTS systems require this information as input.

• Other factors include variable recording characteristics

(microphone, room, recording device), speaking rate,

and volume. These recording characteristics must be

1https://commonvoice.mozilla.org/en/criteria

ignored by models, and the speaking rate and volume,

while being inherent characteristics of the speaker, can

enlarge the space of variables to be considered.

These characteristics are generally not a hindrance for ASR

training, and they can even be desirable for robustness. How-

ever this is not the case for TTS training [7, 12].

3.2. Dataset Preparation

In the following, we use the English subset of Common Voice

(version 7.0). We exclude the predeﬁned development and

test sets, and utterances longer than 16.7 s to allow large batch

sizes. We consider all other utterances in the 2015 h vali-

dated set as candidate TTS training samples. The samples are

preprocessed by resampling from 32 or 48 kHz to 16 kHz,

and removing beginning and end silences using pydub2with

a threshold of -50 dBFS. The range of speaker duration is also

limited to between 20 min and 10 h by randomly selecting a

10 h subset of utterances for speakers with longer duration

and discarding speakers with less than 20 min total duration.

Furthermore, the training utterances are denoised using

the pretrained DPTNet model of Asteroid [22]. We run sep-

arate experiments for the original and denoised utterances to

evaluate the impact of denoising on the resulting TTS model.

4. METHODOLOGY

We ﬁlter the dataset by only selecting speakers with high au-

tomatically estimated MOS scores. We believe that utterances

from these speakers are of high quality and devoid of noise

and missing frequency bands. To ascertain this, we train dif-

ferent TTS models on the same dataset ﬁltered at different

estimated MOS thresholds.

4.1. MOS Estimation

MOS estimation is performed using WV-MOS [10], a pre-

trained MOS estimation model.3The model combines a pre-

trained wav2vec2.0 feature extractor and a 2-layer multi-layer

perceptron (MLP) head, which are jointly ﬁne-tuned on the

subjective evaluation scores of the Voice Conversion Chal-

lenge 2018 using a mean squared error loss. It was shown to

correlate well with human quality judgment regarding noise

and low bandwidth [10, App. C].

Every speaker is assigned a single, speaker-level WV-

MOS score by averaging the estimated utterance-level scores.

We assume that recording and environmental conditions for

each speaker remain relatively constant.

We select all utterances from those speakers whose

speaker-level WV-MOS score is above a threshold of 4.0,

3.8, 3.5, 3.0, or 2.0, and compare the resulting TTS systems

2https://github.com/jiaaro/pydub

3https://github.com/AndreevP/WV-MOS

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CANWEUSECOMMONVOICETOTRAINAMULTI-SPEAKERTTSSYSTEM?SewadeOgun,VincentColotte,EmmanuelVincentUniversit´edeLorraine,CNRS,Inria,LORIA,F-54000Nancy,Francesewade.ogun@inria.frABSTRACTTrainingofmulti-speakertext-to-speech(TTS)systemsreliesoncurateddatasetsbasedonhigh-qualityrecordingsorau-diobooks.Suchdata...

展开>> 收起<<

CAN WE USE COMMON VOICE TO TRAIN A MULTI-SPEAKER TTS SYSTEM Sewade Ogun Vincent Colotte Emmanuel Vincent Universit e de Lorraine CNRS Inria LORIA F-54000 Nancy France.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CAN WE USE COMMON VOICE TO TRAIN A MULTI-SPEAKER TTS SYSTEM Sewade Ogun Vincent Colotte Emmanuel Vincent Universit e de Lorraine CNRS Inria LORIA F-54000 Nancy France

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: