Semi-Supervised Learning Based on Reference Model for Low-resource TTS Xulong Zhang Jianzong Wang Ning Cheng Jing Xiao

2025-05-02 0 0 1.76MB 6 页 10玖币
侵权投诉
Semi-Supervised Learning Based on Reference
Model for Low-resource TTS
Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao
Ping An Technology (Shenzhen) Co., Ltd.
Abstract—Most previous neural text-to-speech (TTS) methods
are mainly based on supervised learning methods, which means
they depend on a large training dataset and hard to achieve
comparable performance under low-resource conditions. To ad-
dress this issue, we propose a semi-supervised learning method
for neural TTS in which labeled target data is limited, which can
also resolve the problem of exposure bias in the previous auto-
regressive models. Specifically, we pre-train the reference model
based on Fastspeech2 with much source data, fine-tuned on a
limited target dataset. Meanwhile, pseudo labels generated by the
original reference model are used to guide the fine-tuned model’s
training further, achieve a regularization effect, and reduce the
overfitting of the fine-tuned model during training on the limited
target data. Experimental results show that our proposed semi-
supervised learning scheme with limited target data significantly
improves the voice quality for test data to achieve naturalness
and robustness in speech synthesis.
Index Terms—semi-supervised learning, pseudo labels, low-
resource, TTS, knowledge distillation
I. INTRODUCTION
Text-to-speech (TTS) is to covert linguistic features from
phonemes to the acoustic features of spectrum to synthe-
size understandable and natural audio indistinguishable from
human recordings. TTS is widely used in application such
as voice navigation, telephone banking, voice translation, e-
commerce voice customer service, and smart speakers. Gener-
ally speaking, most neural TTS methods [1]–[7] utilize two
steps to deal with TTS problem. First, they generate mel-
spectrogram from the input text information. TTS’s primary
challenge is the lack of training data. The recording materials
of target speakers are pretty limited, which is supposed to be
solved urgently. The exposure bias is the main factor for the
auto regressive model, it produced by the unmatch between
the ground truth data and the generated data. Many existing
methods [8]–[12] meet the exposure bias in the module of
decoder in the auto regressive model [13], [14].
The traditional TTS system is mainly build up of two
modules, there are front end and back end. The preprocess of
text, such as text analysis and language feature extraction, is
the main function of the fron end. The back end converts the
linguistic features into spectrum of directly raw waveforms.
The output is constructed according to the language functions
of the front-end and used for speech synthesis. Traditional TTS
technology [15]–[22] is complex and requires professional
knowledge in phonetic linguistics.
Corresponding author: Jianzong Wang, jzwang@188.com.
Neural TTS attracted much attention in the deep learning
and speech community in recent years. Most researches use
deep neural network-based methods to deal with TTS tasks.
WaveNet [23] was proposed, this probabilistic auto-regressive
model takes linguistic features extracted from input texts as
input. While huge data in scale of tens of thounds samples
were needed to train the model. Tacotron [6] could directly
generates waveform signals from input text. The experimental
results achieved 3.82 in terms of a mean opinion score (MOS),
surpassing production parametric systems in terms of the
generated speech’s naturalness.
Shen et al. [5] proposed Tacotron2 using WaveNet as the
vocoder instead of Griffin-Lim [24], which achieved a MOS of
4.53. Tacotron and Tacotron2 were conditioned on the efficient
data, while with limited data, the model works not well. As
far as we know, it takes at least ten hours of recording time
to build a natural TTS system. Specifically, there are strict
needs for the recording environment such as a professional
studio for the sound collection. Besides, the content of the
sound should cover enough phonemes, and the distribution of
the phonemes should be well-tuned. There is very costly and
hard to build such a vast and high-quality dataset covered with
different speakers. Therefore, it is still a critical task to utilize
a few minutes of audio recordings to synthesize any voice in
the target’s voice, which is to implement TTS under few-shot
conditions.
Generally, there will be a degradation in sound quality
and robustness when training a TTS model with a limited
dataset [25]. To enlarge the capacity of the model for adding
new speakers, the pre-trained TTS model was finetuned with
the voice of new speaker, which is a research topic name
few-shot TTS [26], [27], also known as speaker adaption [4],
[28]–[33]. However, these methods need a additional process
of finetune with the recordings about several minutes or more
of the new speaks, and a limited amount of target label data
can easily lead to overfitting of the model. Therefore, it has
certain limitations: although the process of finetuning cloud
change the pretrained model to adapt on new speakers and
achieve multi speaker TTS, the training of the model with
few samples on the target speaker may lead to error for the
cross lingual speaking.
In this paper, we focus on the study of semi-supervised
learning scheme, the semi-supervised learning based on the
reference model for few-shot neural TTS, which performs
well for the inference on out of domain samples. In the
method, the reference model based on the backbone network
arXiv:2210.14723v1 [cs.SD] 25 Oct 2022
of Fastspeech2 is pre-trained by multiple speakers’ amount of
recordings. Then the reference model is transferred into the
low-data target speaker datasets to be fine-tuned. Meanwhile,
pseudo labels generated by the original reference model are
used to guide the fine-tuned model’s training further, achieve
a regularization effect, and reduce the overfitting of the fine-
tuned model during training on the limited target data.
II. RELATED WORKS
A. Knowledge Distillation
Knowledge distillation (KD) [34] can make student modle
get the information from the teacher model. Its success is
usually attributed to the privileged information of similarity
between the class distribution of the teacher model and the
student model. It was first proposed by Hinton et al. [34]
transfer knowledge from large teacher networks to smaller
student networks. It works by training students to predict target
classification labels and imitate teachers’ class probabilities
because these features contain additional information about
how teachers generalize [34].
Liu et al. [35] tried the method of the teacher-student
model for resolving the problem of exposure bias. There is
an existing problem of exposure bias of autoregressive, due
to the unmatched training and inference phase. This problem
cloud leads to an unpredictable error for the model during the
inference and accumulates the error frame by frame along the
time axis.
B. Pseudo Label
Pseudo labels [36] are the predicted labels by the model with
the maximum probability for the unlabeled data sample, which
may not be the real target class. The pseudo label can alleviate
the handcrafted label by the human. During the training phase,
the pseudo labels and labels are applied to train the new model
in a supervised mode. For unlabeled data, each weight update
recalculates the pseudo label, which is used to supervise the
model trainging task with the same loss function. Due to the
number of the different data have huge different in the data
scale, the balance of the different data are very important to
the perfromance of the final trained modle.
Higuchi et al. [37] tried the method of using pseudo labels
to do the automatic speech recognition (ASR), and the results
show an improvement with the use of text generated from
untranscribed audio. While for the task of TTS, it has often
been treated as supervised mode. A semi-supervised learning
method based on the generated label could release the cost of
paired data for training.
III. PROPOSED METHOD
Our method is a semi-supervised learning scheme. This
method works well when the label data is not abundant, and
it can address the problem of exposure bias caused by the
different processes during autoregressive mode between the
inference and training phases. Firstly, we pre-train a backbone
network based on Fastspeech2 to introduce a reference loss.
The total loss is obtained by configuring the appropriate trade-
off parameter ω, where the reference model is fixed during
the training iteration process and the fine-tuned model is fine-
tuned based on a copied initial reference model. We illustrate
the overall architecture of the proposed semi-supervised learn-
ing scheme in Figure 1.
The model was mainly built up by hard loss and ref-
erence loss. The hard loss is the MSE loss between the
mel-spectrogram predicted by the self-training model and the
ground truth spectrum. The reference loss is the MSE loss
between the predicted spectrum of the pre-trained reference
model and the predicted spectrum by the self-training model.
backbone network
backbone network
reference model
adapted model
source text
cccccccc
source audio
fixed pseudo label
target text
cccccccc
target audio
pretrained phase
Fig. 1. Diagram of the semi-supervised learning method based on backbone
network in 2 steps: Step 1, pre-training the reference model with abundant
source data; Step 2, fine-tuning the original reference model with a limited
target dataset, meanwhile pseudo labels generated by the original reference
model are used to further guide the training of the adapted model.
A. The Backbone Network
We follow the architecture of the main components of
Fastspeech2, we use the feed-forward transformer to build
up our model. This method uses the network structure of
the encoder and decoder in the Fastspeech2 [38] model. It
is a sequence-to-sequence cyclic feature prediction network,
where the encoder converts the input of phoneme sequence
into a latent vector, and the decoder is used to predict the
output of the mel-spectrogram from the latent vector of the
linguistic feature. The vocoder of HiFi-GAN [39] was used
for the audio generation from the mel-spectrum. With using
waveform generation technology will not affect the validity of
the proposed training scheme.
Figure 2 shows the overall network structure of the model.
The input of this model is a text sequence of the speaker in the
training set. After mapping it into a learned 512-dimensional
phoneme embedding, it is passed into a encoder with three
feed forward transformer modules, the positional encoding and
speaker embedding were added to the input of encoder. After
the encoder, a fixed-length context vector is obtained. The
latent vector was feed to three predictor to predict energy,
pitch and duration separtely. With the predicted energy, pitch
and duration, the speaker embedding and positional encoding
are feed into the decoder for the mel spectrum decoder. The
decoder consists of four layers feed forward transformer. The
decoder generates the mel-spectrogram from the encoded input
text sequence. The speaker embedding module used the X-
vector for the speaker representation, it was added both in the
encoder and decoder.
摘要:

Semi-SupervisedLearningBasedonReferenceModelforLow-resourceTTSXulongZhang,JianzongWang,NingCheng,JingXiaoPingAnTechnology(Shenzhen)Co.,Ltd.Abstract—Mostpreviousneuraltext-to-speech(TTS)methodsaremainlybasedonsupervisedlearningmethods,whichmeanstheydependonalargetrainingdatasetandhardtoachievecompar...

展开>> 收起<<
Semi-Supervised Learning Based on Reference Model for Low-resource TTS Xulong Zhang Jianzong Wang Ning Cheng Jing Xiao.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:1.76MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注