Semi-Supervised Learning Based on Reference
Model for Low-resource TTS
Xulong Zhang, Jianzong Wang∗, Ning Cheng, Jing Xiao
Ping An Technology (Shenzhen) Co., Ltd.
Abstract—Most previous neural text-to-speech (TTS) methods
are mainly based on supervised learning methods, which means
they depend on a large training dataset and hard to achieve
comparable performance under low-resource conditions. To ad-
dress this issue, we propose a semi-supervised learning method
for neural TTS in which labeled target data is limited, which can
also resolve the problem of exposure bias in the previous auto-
regressive models. Specifically, we pre-train the reference model
based on Fastspeech2 with much source data, fine-tuned on a
limited target dataset. Meanwhile, pseudo labels generated by the
original reference model are used to guide the fine-tuned model’s
training further, achieve a regularization effect, and reduce the
overfitting of the fine-tuned model during training on the limited
target data. Experimental results show that our proposed semi-
supervised learning scheme with limited target data significantly
improves the voice quality for test data to achieve naturalness
and robustness in speech synthesis.
Index Terms—semi-supervised learning, pseudo labels, low-
resource, TTS, knowledge distillation
I. INTRODUCTION
Text-to-speech (TTS) is to covert linguistic features from
phonemes to the acoustic features of spectrum to synthe-
size understandable and natural audio indistinguishable from
human recordings. TTS is widely used in application such
as voice navigation, telephone banking, voice translation, e-
commerce voice customer service, and smart speakers. Gener-
ally speaking, most neural TTS methods [1]–[7] utilize two
steps to deal with TTS problem. First, they generate mel-
spectrogram from the input text information. TTS’s primary
challenge is the lack of training data. The recording materials
of target speakers are pretty limited, which is supposed to be
solved urgently. The exposure bias is the main factor for the
auto regressive model, it produced by the unmatch between
the ground truth data and the generated data. Many existing
methods [8]–[12] meet the exposure bias in the module of
decoder in the auto regressive model [13], [14].
The traditional TTS system is mainly build up of two
modules, there are front end and back end. The preprocess of
text, such as text analysis and language feature extraction, is
the main function of the fron end. The back end converts the
linguistic features into spectrum of directly raw waveforms.
The output is constructed according to the language functions
of the front-end and used for speech synthesis. Traditional TTS
technology [15]–[22] is complex and requires professional
knowledge in phonetic linguistics.
∗Corresponding author: Jianzong Wang, jzwang@188.com.
Neural TTS attracted much attention in the deep learning
and speech community in recent years. Most researches use
deep neural network-based methods to deal with TTS tasks.
WaveNet [23] was proposed, this probabilistic auto-regressive
model takes linguistic features extracted from input texts as
input. While huge data in scale of tens of thounds samples
were needed to train the model. Tacotron [6] could directly
generates waveform signals from input text. The experimental
results achieved 3.82 in terms of a mean opinion score (MOS),
surpassing production parametric systems in terms of the
generated speech’s naturalness.
Shen et al. [5] proposed Tacotron2 using WaveNet as the
vocoder instead of Griffin-Lim [24], which achieved a MOS of
4.53. Tacotron and Tacotron2 were conditioned on the efficient
data, while with limited data, the model works not well. As
far as we know, it takes at least ten hours of recording time
to build a natural TTS system. Specifically, there are strict
needs for the recording environment such as a professional
studio for the sound collection. Besides, the content of the
sound should cover enough phonemes, and the distribution of
the phonemes should be well-tuned. There is very costly and
hard to build such a vast and high-quality dataset covered with
different speakers. Therefore, it is still a critical task to utilize
a few minutes of audio recordings to synthesize any voice in
the target’s voice, which is to implement TTS under few-shot
conditions.
Generally, there will be a degradation in sound quality
and robustness when training a TTS model with a limited
dataset [25]. To enlarge the capacity of the model for adding
new speakers, the pre-trained TTS model was finetuned with
the voice of new speaker, which is a research topic name
few-shot TTS [26], [27], also known as speaker adaption [4],
[28]–[33]. However, these methods need a additional process
of finetune with the recordings about several minutes or more
of the new speaks, and a limited amount of target label data
can easily lead to overfitting of the model. Therefore, it has
certain limitations: although the process of finetuning cloud
change the pretrained model to adapt on new speakers and
achieve multi speaker TTS, the training of the model with
few samples on the target speaker may lead to error for the
cross lingual speaking.
In this paper, we focus on the study of semi-supervised
learning scheme, the semi-supervised learning based on the
reference model for few-shot neural TTS, which performs
well for the inference on out of domain samples. In the
method, the reference model based on the backbone network
arXiv:2210.14723v1 [cs.SD] 25 Oct 2022