
ror rate or the Mel cepstral distortion to automatically select
good training utterances from noisy speech datasets [12–14],
however they have only been demonstrated on small datasets
so far. Research on multi-speaker TTS training using noisy
speech has also focused on directly modelling the noise in
order to factor it out during inference [15, 16], and on encod-
ing all the environmental characteristics of speech for novel
speech generation in different conditions [17].
Recently, several methods have been proposed to auto-
matically measure the quality of speech utterances [18,19] but
they do not always generalise well outside of the training cor-
pus. Self-supervised pretrained models followed by a shallow
MOS regression head result in higher correlation with human
evaluators’ scores [20] than previous architectures. They also
generalise better to unseen speakers and utterances, and can
be used to evaluate the performance of speech processing sys-
tems for a variety of tasks [10].
3. THE COMMON VOICE DATASET
Common Voice [8] is a crowdsourced, Creative Commons
Zero licensed, read speech dataset currently available in over
93 languages. It contains recordings from volunteers who
read a text transcript sourced from public domain text. Each
utterance is up-voted or down-voted by volunteers according
to a list of criteria.1These criteria are not very restrictive, e.g.,
various kinds of background noises are allowed. Utterances
with more than two up-votes are marked as validated. The
validated utterances are then split into train, development and
test sets, with non-overlapping speakers and sentences.
3.1. Analysing Common Voice Dataset Quality for TTS
Although the validated set has been widely exploited for ASR
[21], we observe some undesirable properties for TTS:
• Noise: Speech quality may be degraded by electromag-
netic noise or acoustic noise such as mouse clicks, low
frequency noise, background speakers and background
music, among others. Since the utterances are stored as
mp3, quantization noise can also sometimes be heard.
• Low bandwidth: Due to recording choices or high com-
pression, some audio files are low-pass filtered, with a
cutoff frequency that varies from one file to another.
• Mispronunciation: We observe mispronunciations of
“unfamiliar” words, variations in the pronunciation of
certain other words, and some utterances in other lan-
guages (e.g., German utterances in the English corpus).
• Unavailable speaker metadata: Age, gender and accent
information are not available for all speakers, while
some TTS systems require this information as input.
• Other factors include variable recording characteristics
(microphone, room, recording device), speaking rate,
and volume. These recording characteristics must be
1https://commonvoice.mozilla.org/en/criteria
ignored by models, and the speaking rate and volume,
while being inherent characteristics of the speaker, can
enlarge the space of variables to be considered.
These characteristics are generally not a hindrance for ASR
training, and they can even be desirable for robustness. How-
ever this is not the case for TTS training [7, 12].
3.2. Dataset Preparation
In the following, we use the English subset of Common Voice
(version 7.0). We exclude the predefined development and
test sets, and utterances longer than 16.7 s to allow large batch
sizes. We consider all other utterances in the 2015 h vali-
dated set as candidate TTS training samples. The samples are
preprocessed by resampling from 32 or 48 kHz to 16 kHz,
and removing beginning and end silences using pydub2with
a threshold of -50 dBFS. The range of speaker duration is also
limited to between 20 min and 10 h by randomly selecting a
10 h subset of utterances for speakers with longer duration
and discarding speakers with less than 20 min total duration.
Furthermore, the training utterances are denoised using
the pretrained DPTNet model of Asteroid [22]. We run sep-
arate experiments for the original and denoised utterances to
evaluate the impact of denoising on the resulting TTS model.
4. METHODOLOGY
We filter the dataset by only selecting speakers with high au-
tomatically estimated MOS scores. We believe that utterances
from these speakers are of high quality and devoid of noise
and missing frequency bands. To ascertain this, we train dif-
ferent TTS models on the same dataset filtered at different
estimated MOS thresholds.
4.1. MOS Estimation
MOS estimation is performed using WV-MOS [10], a pre-
trained MOS estimation model.3The model combines a pre-
trained wav2vec2.0 feature extractor and a 2-layer multi-layer
perceptron (MLP) head, which are jointly fine-tuned on the
subjective evaluation scores of the Voice Conversion Chal-
lenge 2018 using a mean squared error loss. It was shown to
correlate well with human quality judgment regarding noise
and low bandwidth [10, App. C].
Every speaker is assigned a single, speaker-level WV-
MOS score by averaging the estimated utterance-level scores.
We assume that recording and environmental conditions for
each speaker remain relatively constant.
We select all utterances from those speakers whose
speaker-level WV-MOS score is above a threshold of 4.0,
3.8, 3.5, 3.0, or 2.0, and compare the resulting TTS systems
2https://github.com/jiaaro/pydub
3https://github.com/AndreevP/WV-MOS