Semi-Supervised Learning Based on Reference Model for Low-resource TTS Xulong Zhang Jianzong Wang Ning Cheng Jing Xiao

2025-05-02 0 0 1.76MB 6 页 10玖币

侵权投诉

Semi-Supervised Learning Based on Reference

Model for Low-resource TTS

Xulong Zhang, Jianzong Wang∗, Ning Cheng, Jing Xiao

Ping An Technology (Shenzhen) Co., Ltd.

Abstract—Most previous neural text-to-speech (TTS) methods

are mainly based on supervised learning methods, which means

they depend on a large training dataset and hard to achieve

comparable performance under low-resource conditions. To ad-

dress this issue, we propose a semi-supervised learning method

for neural TTS in which labeled target data is limited, which can

also resolve the problem of exposure bias in the previous auto-

regressive models. Speciﬁcally, we pre-train the reference model

based on Fastspeech2 with much source data, ﬁne-tuned on a

limited target dataset. Meanwhile, pseudo labels generated by the

original reference model are used to guide the ﬁne-tuned model’s

training further, achieve a regularization effect, and reduce the

overﬁtting of the ﬁne-tuned model during training on the limited

target data. Experimental results show that our proposed semi-

supervised learning scheme with limited target data signiﬁcantly

improves the voice quality for test data to achieve naturalness

and robustness in speech synthesis.

Index Terms—semi-supervised learning, pseudo labels, low-

resource, TTS, knowledge distillation

I. INTRODUCTION

Text-to-speech (TTS) is to covert linguistic features from

phonemes to the acoustic features of spectrum to synthe-

size understandable and natural audio indistinguishable from

human recordings. TTS is widely used in application such

as voice navigation, telephone banking, voice translation, e-

commerce voice customer service, and smart speakers. Gener-

ally speaking, most neural TTS methods [1]–[7] utilize two

steps to deal with TTS problem. First, they generate mel-

spectrogram from the input text information. TTS’s primary

challenge is the lack of training data. The recording materials

of target speakers are pretty limited, which is supposed to be

solved urgently. The exposure bias is the main factor for the

auto regressive model, it produced by the unmatch between

the ground truth data and the generated data. Many existing

methods [8]–[12] meet the exposure bias in the module of

decoder in the auto regressive model [13], [14].

The traditional TTS system is mainly build up of two

modules, there are front end and back end. The preprocess of

text, such as text analysis and language feature extraction, is

the main function of the fron end. The back end converts the

linguistic features into spectrum of directly raw waveforms.

The output is constructed according to the language functions

of the front-end and used for speech synthesis. Traditional TTS

technology [15]–[22] is complex and requires professional

knowledge in phonetic linguistics.

∗Corresponding author: Jianzong Wang, jzwang@188.com.

Neural TTS attracted much attention in the deep learning

and speech community in recent years. Most researches use

deep neural network-based methods to deal with TTS tasks.

WaveNet [23] was proposed, this probabilistic auto-regressive

model takes linguistic features extracted from input texts as

input. While huge data in scale of tens of thounds samples

were needed to train the model. Tacotron [6] could directly

generates waveform signals from input text. The experimental

results achieved 3.82 in terms of a mean opinion score (MOS),

surpassing production parametric systems in terms of the

generated speech’s naturalness.

Shen et al. [5] proposed Tacotron2 using WaveNet as the

vocoder instead of Grifﬁn-Lim [24], which achieved a MOS of

4.53. Tacotron and Tacotron2 were conditioned on the efﬁcient

data, while with limited data, the model works not well. As

far as we know, it takes at least ten hours of recording time

to build a natural TTS system. Speciﬁcally, there are strict

needs for the recording environment such as a professional

studio for the sound collection. Besides, the content of the

sound should cover enough phonemes, and the distribution of

the phonemes should be well-tuned. There is very costly and

hard to build such a vast and high-quality dataset covered with

different speakers. Therefore, it is still a critical task to utilize

a few minutes of audio recordings to synthesize any voice in

the target’s voice, which is to implement TTS under few-shot

conditions.

Generally, there will be a degradation in sound quality

and robustness when training a TTS model with a limited

dataset [25]. To enlarge the capacity of the model for adding

new speakers, the pre-trained TTS model was ﬁnetuned with

the voice of new speaker, which is a research topic name

few-shot TTS [26], [27], also known as speaker adaption [4],

[28]–[33]. However, these methods need a additional process

of ﬁnetune with the recordings about several minutes or more

of the new speaks, and a limited amount of target label data

can easily lead to overﬁtting of the model. Therefore, it has

certain limitations: although the process of ﬁnetuning cloud

change the pretrained model to adapt on new speakers and

achieve multi speaker TTS, the training of the model with

few samples on the target speaker may lead to error for the

cross lingual speaking.

In this paper, we focus on the study of semi-supervised

learning scheme, the semi-supervised learning based on the

reference model for few-shot neural TTS, which performs

well for the inference on out of domain samples. In the

method, the reference model based on the backbone network

arXiv:2210.14723v1 [cs.SD] 25 Oct 2022

of Fastspeech2 is pre-trained by multiple speakers’ amount of

recordings. Then the reference model is transferred into the

low-data target speaker datasets to be ﬁne-tuned. Meanwhile,

pseudo labels generated by the original reference model are

used to guide the ﬁne-tuned model’s training further, achieve

a regularization effect, and reduce the overﬁtting of the ﬁne-

tuned model during training on the limited target data.

II. RELATED WORKS

A. Knowledge Distillation

Knowledge distillation (KD) [34] can make student modle

get the information from the teacher model. Its success is

usually attributed to the privileged information of similarity

between the class distribution of the teacher model and the

student model. It was ﬁrst proposed by Hinton et al. [34]

transfer knowledge from large teacher networks to smaller

student networks. It works by training students to predict target

classiﬁcation labels and imitate teachers’ class probabilities

because these features contain additional information about

how teachers generalize [34].

Liu et al. [35] tried the method of the teacher-student

model for resolving the problem of exposure bias. There is

an existing problem of exposure bias of autoregressive, due

to the unmatched training and inference phase. This problem

cloud leads to an unpredictable error for the model during the

inference and accumulates the error frame by frame along the

time axis.

B. Pseudo Label

Pseudo labels [36] are the predicted labels by the model with

the maximum probability for the unlabeled data sample, which

may not be the real target class. The pseudo label can alleviate

the handcrafted label by the human. During the training phase,

the pseudo labels and labels are applied to train the new model

in a supervised mode. For unlabeled data, each weight update

recalculates the pseudo label, which is used to supervise the

model trainging task with the same loss function. Due to the

number of the different data have huge different in the data

scale, the balance of the different data are very important to

the perfromance of the ﬁnal trained modle.

Higuchi et al. [37] tried the method of using pseudo labels

to do the automatic speech recognition (ASR), and the results

show an improvement with the use of text generated from

untranscribed audio. While for the task of TTS, it has often

been treated as supervised mode. A semi-supervised learning

method based on the generated label could release the cost of

paired data for training.

III. PROPOSED METHOD

Our method is a semi-supervised learning scheme. This

method works well when the label data is not abundant, and

it can address the problem of exposure bias caused by the

different processes during autoregressive mode between the

inference and training phases. Firstly, we pre-train a backbone

network based on Fastspeech2 to introduce a reference loss.

The total loss is obtained by conﬁguring the appropriate trade-

off parameter ω, where the reference model is ﬁxed during

the training iteration process and the ﬁne-tuned model is ﬁne-

tuned based on a copied initial reference model. We illustrate

the overall architecture of the proposed semi-supervised learn-

ing scheme in Figure 1.

The model was mainly built up by hard loss and ref-

erence loss. The hard loss is the MSE loss between the

mel-spectrogram predicted by the self-training model and the

ground truth spectrum. The reference loss is the MSE loss

between the predicted spectrum of the pre-trained reference

model and the predicted spectrum by the self-training model.

backbone network

reference model

adapted model

source text

cccccccc

source audio

fixed pseudo label

target text

cccccccc

target audio

pretrained phase

Fig. 1. Diagram of the semi-supervised learning method based on backbone

network in 2 steps: Step 1, pre-training the reference model with abundant

source data; Step 2, ﬁne-tuning the original reference model with a limited

target dataset, meanwhile pseudo labels generated by the original reference

model are used to further guide the training of the adapted model.

A. The Backbone Network

We follow the architecture of the main components of

Fastspeech2, we use the feed-forward transformer to build

up our model. This method uses the network structure of

the encoder and decoder in the Fastspeech2 [38] model. It

is a sequence-to-sequence cyclic feature prediction network,

where the encoder converts the input of phoneme sequence

into a latent vector, and the decoder is used to predict the

output of the mel-spectrogram from the latent vector of the

linguistic feature. The vocoder of HiFi-GAN [39] was used

for the audio generation from the mel-spectrum. With using

waveform generation technology will not affect the validity of

the proposed training scheme.

Figure 2 shows the overall network structure of the model.

The input of this model is a text sequence of the speaker in the

training set. After mapping it into a learned 512-dimensional

phoneme embedding, it is passed into a encoder with three

feed forward transformer modules, the positional encoding and

speaker embedding were added to the input of encoder. After

the encoder, a ﬁxed-length context vector is obtained. The

latent vector was feed to three predictor to predict energy,

pitch and duration separtely. With the predicted energy, pitch

and duration, the speaker embedding and positional encoding

are feed into the decoder for the mel spectrum decoder. The

decoder consists of four layers feed forward transformer. The

decoder generates the mel-spectrogram from the encoded input

text sequence. The speaker embedding module used the X-

vector for the speaker representation, it was added both in the

encoder and decoder.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Semi-SupervisedLearningBasedonReferenceModelforLow-resourceTTSXulongZhang,JianzongWang,NingCheng,JingXiaoPingAnTechnology(Shenzhen)Co.,Ltd.AbstractMostpreviousneuraltext-to-speech(TTS)methodsaremainlybasedonsupervisedlearningmethods,whichmeanstheydependonalargetrainingdatasetandhardtoachievecompar...

展开>> 收起<<

Semi-Supervised Learning Based on Reference Model for Low-resource TTS Xulong Zhang Jianzong Wang Ning Cheng Jing Xiao.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Semi-Supervised Learning Based on Reference Model for Low-resource TTS Xulong Zhang Jianzong Wang Ning Cheng Jing Xiao

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: