Adapitch Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data

2025-04-30 0 0 405.44KB 5 页 10玖币

侵权投诉

Adapitch: Adaption Multi-Speaker Text-to-Speech

Conditioned on Pitch Disentangling with

Untranscribed Data

Xulong Zhang, Jianzong Wang∗, Ning Cheng, Jing Xiao

Ping An Technology (Shenzhen) Co., Ltd.

Abstract—In this paper, we proposed Adapitch, a multi-

speaker TTS method that makes adaptation of the supervised

module with untranscribed data. We design two self supervised

modules to train the text encoder and mel decoder separately with

untranscribed data to enhance the representation of text and mel.

To better handle the prosody information in a synthesized voice,

a supervised TTS module is designed conditioned on content

disentangling of pitch, text, and speaker. The training phase was

separated into two parts, pretrained and ﬁxed the text encoder

and mel decoder with unsupervised mode, then the supervised

mode on the disentanglement of TTS. Experiment results show

that the Adaptich achieved much better quality than baseline

methods.

Index Terms—text-to-speech (TTS), multi-speaker modeling,

pitch embedding, self supervised, adaptation

I. INTRODUCTION

Early speech synthesis methods mainly include concatena-

tive synthesis and parametric synthesis [1]. The concatenative

synthesis builds a huge database for different phoneme speech

audio, for the speciﬁc sentence it retrieval and concat the

phoneme voice sequence together for the output. While the

building of a speech database is complicated, the synthesized

voice is natural of the human. The parametric synthesis is

to relieve the directly concated from waveform, it generates

acoustic parameters to reconstruct the waveform. The para-

metric synthesis has low data cost, but it is heard as robotic

and can be easily recognized as not human speech.

Recently, neural network based text-to-speech has achieved

great success, allowing us to synthesize more natural and

realistic speech [2], [3]. In addition to the demand for the

naturalness of speech synthesis, more and more personalized

speech synthesis needs require higher speech quality at the

same time. However, in many practical application scenarios,

only a small amount of speech data of the target speaker can

be obtained, and personalized speech synthesis for the target

speaker is still a big challenge. This task is the so-called low-

resource TTS that only requires a small amount of reference

corpus for speech synthesis.

In order to solve the low-resource TTS task, previous

studies [4], [5] have proposed a method of transfer learning

adaptation. First, train a complete TTS model through speaker

data with a large amount of corpus, and then use the corpus of

the target speaker to ﬁne-tune the trained model. The problem

∗Corresponding author: Jianzong Wang, jzwang@188.com.

with this method is that a certain amount of reference corpus

is required for ﬁne-tuning. The model after ﬁne-tuning has

a certain degree of loss in naturalness and speech quality in

speech synthesis. In addition, the ﬁne-tune method cannot be

applied to the scene of real time speech synthesis, making this

method unattractive.

In addition to the ﬁne-tune method, previous studies [6]–[8]

have proposed the use of speaker embedding. During training,

speaker embedding is added for joint training to avoid using

target data to ﬁne-tune the model. However, accents are often

mismatched and nuances such as characteristic prosody are

lost. It is difﬁcult to ﬁnd a single feature to represent all the

voice data and pronunciation style of the target speaker.

In order to solve the problem of speaker embedding features,

some studies [9]–[11] have proposed embedding a variety of

features other than speaker features. Each feature represents

different attributes of the speaker corpus, such as prosody em-

bedding and style embedding. But this method only combines

more single dimensions to characterize the speech synthesis

of the target speaker. A small number of features obtained

from fewer target data means that it is difﬁcult to cover all

the pronunciation of the target speaker.

We utilize untranscribed data to pretrain the pluggable text

encoder and mel decoder in self supervised mode, with a

supervised module for disentangling the content of speech to

obtain pitch and speaker embedding. It includes the encoding

of phoneme sequence, the variance adaptor module to disen-

tangle pitch, text, and speaker, and the decoding module to get

the ﬁnal mel spectrum. Pitch regressor and speaker Look Up

Table (LUT) are introduced in the disentangle module. The

LUT is used to control the encoding to enhance the different

speaker feature embedding. The pitch regressor is used to

control the pitch information in the speech synthesis that is

independent. The encoder of the phoneme controls the content

without the different speaker related features. The mel decoder

through a self supervised training to keep robust decoding of

mel spectrum.

Our contributions are as follows: 1) Two pluggable self su-

pervised submodules are designed to utilize the untranscribed

data for enhancing the representation of text encoder and mel

decoder. 2) An adaption between supervised input and self

supervised input of mel decoder are added to utilizing the

pretrained mel decoder.

arXiv:2210.13803v1 [cs.SD] 25 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Adapitch:AdaptionMulti-SpeakerText-to-SpeechConditionedonPitchDisentanglingwithUntranscribedDataXulongZhang,JianzongWang,NingCheng,JingXiaoPingAnTechnology(Shenzhen)Co.,Ltd.AbstractInthispaper,weproposedAdapitch,amulti-speakerTTSmethodthatmakesadaptationofthesupervisedmodulewithuntranscribeddata.W...

展开>> 收起<<

Adapitch Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Adapitch Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: