
training data in one locale improves performance in another for rea-
sons orthogonal to linguistics.
(iii) Through ablation studies, we highlight the importance of
modeling choices, including multilingual pre-training and model ca-
pacity on SQuId’s performance.
Additional Related Work. Automatic MOS prediction has a
long history in the TTS literature [24]. In addition to the systems
cited above, recent work includes [25–29], none of which tackle
multilinguality. Our work directly builds upon [12], which exper-
iments with wide range of pre-trained models. Our methodology
is close, but we scale up the data (from about 30K to 1,3M sam-
ples), number of locales (from 2 to 65), model size (from 300M to
600M parameters), and contribute a novel analysis of multilingual-
ity. Cross-lingual transfer and massively multilingual NLP have a
rich history in MT [15–17] and pre-trained models [18,30]. Authors
have studied cross-lingual transfer for at least a decade in speech
recognition [31–35] and TTS [22, 36, 37].
2. MULTILINGUAL MOS PREDICTION IN THE WILD
We wish to predict MOS Naturalness ratings for both human and
synthetic speech. Broadly speaking, MOS Naturalness describes
how human-like an utterance sounds. Our main resource is an in-
house corpus, aggregating approximately 1.9 million ratings in 66
locales across 2,092 research and commercial annotation projects
completed between January 2021 and March 2022. Most of the au-
dio is generated by TTS systems including both concatenative and
neural systems, and the annotators are asked to select a rating based
on how natural or unnatural the samples sound. The test sets used
for the evaluations are primarily focused on TTS applications such
as virtual assistant responses, driving directions, book passages, and
news articles; general text from web-crawled corpora is also used.
Sentences are typically rated in isolation (that is, outside of the con-
text in which they originally appeared), though entire paragraphs
are occasionally rated as well. Listening tests were conducted us-
ing crowd-sourced raters on an internal ratings platform, using a 9-
Point Likert Scale using 0.5 points increments This variety in test
sets and TTS technologies means the stimuli contain a diverse set
of errors, including pronunciation errors, text normalization errors,
unnatural prosody, and acoustic artifacts such as discontinuities and
signal-correlated noise.
We apply two splits to the dataset. First, we split the data by
time. We use the ratings collected between January 1st and Decem-
ber 1st 2021 for training and development, and the rest for test. The
motivation is to simulate a realistic use case, whereby a TTS engi-
neer would want to use the past annotations to predict future ones.
The second split is based on region: we hold out 24 locales for which
we have exceptionally few ratings (less than 8,000 each, adding up
to about 5% of the data) and use them for test. The rationale is that
small tasks yield little improvements during training but are particu-
larly useful for analysis. Since there is no training data, we refer to
these locales as zero-shot locales, as opposed to fine-tuned locales.
The dataset is skewed towards US English (18%), followed by UK
English (12%), and ES Spanish (4.2%). To build the development
set, we sample 2.5% of the training set without replacement. Table 1
provides additional statistics.
Challenges and Caveats. Due to the nature of splits, we do not
expect the data to be i.i.d.—the TTS systems and annotators used
to produce and rate the utterances in February 2021 are usually not
those of January 2022. The of number, listening conditions and qual-
ity of the annotations also vary across projects and locales, as do the
input texts chosen to test the systems. Furthermore, it is generally
Num. training utt. / systems. / locales 969,589 / 1,476 / 42
Num. dev utt. / systems / locales 34,042 / 1,474 / 42
Num. test utt. / systems / locales 381,323 / 605 / 65
Ave. utterance duration (seconds) / num. ratings 4.5s / 1.4
Table 1: MOS Prediction dataset statistics.
understood that the term naturalness is underspecified and may be
interpreted differently be different raters [38, 39]. In short, these
conditions reflect TTS evaluation “in the wild”.
3. THE SQUID MODEL
The most important design decision behind our study is to fine-tune
a single model on all locales rather than keeping separate models.
This offers convenience, since we have one model to maintain rather
than 65. More importantly, we assume that if the model has enough
capacity, positive transfer will emerge between the locales [17].
SQuId is based on mSLAM [13], a recently published multi-
modal pre-trained model trained on unlabelled speech (429K hours
in 51 languages), text (15TB in 101 languages), and speech-text pairs
(2.3K hours). We chose this model because it produced state-of-the-
art results in many languages at the time of writing. It is based on the
Conformer architecture, with 600M parameters by default. SQuId’s
input is a 16KHz utterance’s spectrogram, along with an optional lo-
cale tag. The output is a scalar. We fine-tune the model end-to-end
with a simple regression loss. After optional resampling to 16KHz,
we compute an 80-dimensional log Mel spectrogram and extract
mSLAM embeddings ei
1, , ..., ei
Tfor each time step. We mean-pool
across the time dimension, returning an embedding ei
∗, and apply a
fully connected layer to obtain the prediction ˆyi=M.ei
∗+b. By
default we use T= 3,200 time steps, and the embeddings have di-
mension 1,024. The target MOS ratings are linearly rescaled from [1,
5] to [0, 1]. By default we use batch size 32 and learning rate 10−5,
obtained with hyper-parameter search during a preliminary set of ex-
periments. We train the models for 100k steps, save a snapshot every
10k steps, and export the version that yields the best version on our
development set. We run experiments on Cloud TPU v3, using the
Adam optimizer with 1,500 warmup steps.
Additionally, two optimizations lead to slight but consistent per-
formance improvement. (i) We embed the locale tags of each utter-
ance into a 64-wide vector eℓiand concatenate it to ei
∗, forming vec-
tor [ei
∗,eℓi]. For 5% of the data, we use a wildcard identifier ANY-
LOC, which we use for inference on locales unseen during training.
(ii) We sample the data with temperature to rebalance the relative
proportion of the training locales. As described in [17], we resample
each locale ℓwith probability p1/τ
ℓ. We use τ= 10, obtained by
hyper-parameter search on the development set.
4. PERFORMANCE
Let us now present SQuId’s overall performance. To validate our ap-
proach, we first ensure that it performs well on VoiceMOS’22, cur-
rently the main benchmark for MOS prediction. We then scale up
the test set and analyze its performance on 65 locales. Throughout
the section, we will compare SQuId to Big-SSL-MOS, a competitive
baseline in the spirit of SSL-MOS [12]. We fine-tune w2v-BERT
on the main VoiceMOS dataset with a regression objective, using a
600M parameters version of the pre-trained model to ensure a fair
comparison ( [12] uses up to 317M). The model comes in two vari-
ants: English-only, and multilingual. The architecture and dataset of
our w2v-BERT implementation are described in detail in [13].