Adapitch Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data

2025-04-30 0 0 405.44KB 5 页 10玖币
侵权投诉
Adapitch: Adaption Multi-Speaker Text-to-Speech
Conditioned on Pitch Disentangling with
Untranscribed Data
Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao
Ping An Technology (Shenzhen) Co., Ltd.
Abstract—In this paper, we proposed Adapitch, a multi-
speaker TTS method that makes adaptation of the supervised
module with untranscribed data. We design two self supervised
modules to train the text encoder and mel decoder separately with
untranscribed data to enhance the representation of text and mel.
To better handle the prosody information in a synthesized voice,
a supervised TTS module is designed conditioned on content
disentangling of pitch, text, and speaker. The training phase was
separated into two parts, pretrained and fixed the text encoder
and mel decoder with unsupervised mode, then the supervised
mode on the disentanglement of TTS. Experiment results show
that the Adaptich achieved much better quality than baseline
methods.
Index Terms—text-to-speech (TTS), multi-speaker modeling,
pitch embedding, self supervised, adaptation
I. INTRODUCTION
Early speech synthesis methods mainly include concatena-
tive synthesis and parametric synthesis [1]. The concatenative
synthesis builds a huge database for different phoneme speech
audio, for the specific sentence it retrieval and concat the
phoneme voice sequence together for the output. While the
building of a speech database is complicated, the synthesized
voice is natural of the human. The parametric synthesis is
to relieve the directly concated from waveform, it generates
acoustic parameters to reconstruct the waveform. The para-
metric synthesis has low data cost, but it is heard as robotic
and can be easily recognized as not human speech.
Recently, neural network based text-to-speech has achieved
great success, allowing us to synthesize more natural and
realistic speech [2], [3]. In addition to the demand for the
naturalness of speech synthesis, more and more personalized
speech synthesis needs require higher speech quality at the
same time. However, in many practical application scenarios,
only a small amount of speech data of the target speaker can
be obtained, and personalized speech synthesis for the target
speaker is still a big challenge. This task is the so-called low-
resource TTS that only requires a small amount of reference
corpus for speech synthesis.
In order to solve the low-resource TTS task, previous
studies [4], [5] have proposed a method of transfer learning
adaptation. First, train a complete TTS model through speaker
data with a large amount of corpus, and then use the corpus of
the target speaker to fine-tune the trained model. The problem
Corresponding author: Jianzong Wang, jzwang@188.com.
with this method is that a certain amount of reference corpus
is required for fine-tuning. The model after fine-tuning has
a certain degree of loss in naturalness and speech quality in
speech synthesis. In addition, the fine-tune method cannot be
applied to the scene of real time speech synthesis, making this
method unattractive.
In addition to the fine-tune method, previous studies [6]–[8]
have proposed the use of speaker embedding. During training,
speaker embedding is added for joint training to avoid using
target data to fine-tune the model. However, accents are often
mismatched and nuances such as characteristic prosody are
lost. It is difficult to find a single feature to represent all the
voice data and pronunciation style of the target speaker.
In order to solve the problem of speaker embedding features,
some studies [9]–[11] have proposed embedding a variety of
features other than speaker features. Each feature represents
different attributes of the speaker corpus, such as prosody em-
bedding and style embedding. But this method only combines
more single dimensions to characterize the speech synthesis
of the target speaker. A small number of features obtained
from fewer target data means that it is difficult to cover all
the pronunciation of the target speaker.
We utilize untranscribed data to pretrain the pluggable text
encoder and mel decoder in self supervised mode, with a
supervised module for disentangling the content of speech to
obtain pitch and speaker embedding. It includes the encoding
of phoneme sequence, the variance adaptor module to disen-
tangle pitch, text, and speaker, and the decoding module to get
the final mel spectrum. Pitch regressor and speaker Look Up
Table (LUT) are introduced in the disentangle module. The
LUT is used to control the encoding to enhance the different
speaker feature embedding. The pitch regressor is used to
control the pitch information in the speech synthesis that is
independent. The encoder of the phoneme controls the content
without the different speaker related features. The mel decoder
through a self supervised training to keep robust decoding of
mel spectrum.
Our contributions are as follows: 1) Two pluggable self su-
pervised submodules are designed to utilize the untranscribed
data for enhancing the representation of text encoder and mel
decoder. 2) An adaption between supervised input and self
supervised input of mel decoder are added to utilizing the
pretrained mel decoder.
arXiv:2210.13803v1 [cs.SD] 25 Oct 2022
摘要:

Adapitch:AdaptionMulti-SpeakerText-to-SpeechConditionedonPitchDisentanglingwithUntranscribedDataXulongZhang,JianzongWang,NingCheng,JingXiaoPingAnTechnology(Shenzhen)Co.,Ltd.Abstract—Inthispaper,weproposedAdapitch,amulti-speakerTTSmethodthatmakesadaptationofthesupervisedmodulewithuntranscribeddata.W...

展开>> 收起<<
Adapitch Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:405.44KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注