Adapitch: Adaption Multi-Speaker Text-to-Speech
Conditioned on Pitch Disentangling with
Untranscribed Data
Xulong Zhang, Jianzong Wang∗, Ning Cheng, Jing Xiao
Ping An Technology (Shenzhen) Co., Ltd.
Abstract—In this paper, we proposed Adapitch, a multi-
speaker TTS method that makes adaptation of the supervised
module with untranscribed data. We design two self supervised
modules to train the text encoder and mel decoder separately with
untranscribed data to enhance the representation of text and mel.
To better handle the prosody information in a synthesized voice,
a supervised TTS module is designed conditioned on content
disentangling of pitch, text, and speaker. The training phase was
separated into two parts, pretrained and fixed the text encoder
and mel decoder with unsupervised mode, then the supervised
mode on the disentanglement of TTS. Experiment results show
that the Adaptich achieved much better quality than baseline
methods.
Index Terms—text-to-speech (TTS), multi-speaker modeling,
pitch embedding, self supervised, adaptation
I. INTRODUCTION
Early speech synthesis methods mainly include concatena-
tive synthesis and parametric synthesis [1]. The concatenative
synthesis builds a huge database for different phoneme speech
audio, for the specific sentence it retrieval and concat the
phoneme voice sequence together for the output. While the
building of a speech database is complicated, the synthesized
voice is natural of the human. The parametric synthesis is
to relieve the directly concated from waveform, it generates
acoustic parameters to reconstruct the waveform. The para-
metric synthesis has low data cost, but it is heard as robotic
and can be easily recognized as not human speech.
Recently, neural network based text-to-speech has achieved
great success, allowing us to synthesize more natural and
realistic speech [2], [3]. In addition to the demand for the
naturalness of speech synthesis, more and more personalized
speech synthesis needs require higher speech quality at the
same time. However, in many practical application scenarios,
only a small amount of speech data of the target speaker can
be obtained, and personalized speech synthesis for the target
speaker is still a big challenge. This task is the so-called low-
resource TTS that only requires a small amount of reference
corpus for speech synthesis.
In order to solve the low-resource TTS task, previous
studies [4], [5] have proposed a method of transfer learning
adaptation. First, train a complete TTS model through speaker
data with a large amount of corpus, and then use the corpus of
the target speaker to fine-tune the trained model. The problem
∗Corresponding author: Jianzong Wang, jzwang@188.com.
with this method is that a certain amount of reference corpus
is required for fine-tuning. The model after fine-tuning has
a certain degree of loss in naturalness and speech quality in
speech synthesis. In addition, the fine-tune method cannot be
applied to the scene of real time speech synthesis, making this
method unattractive.
In addition to the fine-tune method, previous studies [6]–[8]
have proposed the use of speaker embedding. During training,
speaker embedding is added for joint training to avoid using
target data to fine-tune the model. However, accents are often
mismatched and nuances such as characteristic prosody are
lost. It is difficult to find a single feature to represent all the
voice data and pronunciation style of the target speaker.
In order to solve the problem of speaker embedding features,
some studies [9]–[11] have proposed embedding a variety of
features other than speaker features. Each feature represents
different attributes of the speaker corpus, such as prosody em-
bedding and style embedding. But this method only combines
more single dimensions to characterize the speech synthesis
of the target speaker. A small number of features obtained
from fewer target data means that it is difficult to cover all
the pronunciation of the target speaker.
We utilize untranscribed data to pretrain the pluggable text
encoder and mel decoder in self supervised mode, with a
supervised module for disentangling the content of speech to
obtain pitch and speaker embedding. It includes the encoding
of phoneme sequence, the variance adaptor module to disen-
tangle pitch, text, and speaker, and the decoding module to get
the final mel spectrum. Pitch regressor and speaker Look Up
Table (LUT) are introduced in the disentangle module. The
LUT is used to control the encoding to enhance the different
speaker feature embedding. The pitch regressor is used to
control the pitch information in the speech synthesis that is
independent. The encoder of the phoneme controls the content
without the different speaker related features. The mel decoder
through a self supervised training to keep robust decoding of
mel spectrum.
Our contributions are as follows: 1) Two pluggable self su-
pervised submodules are designed to utilize the untranscribed
data for enhancing the representation of text encoder and mel
decoder. 2) An adaption between supervised input and self
supervised input of mel decoder are added to utilizing the
pretrained mel decoder.
arXiv:2210.13803v1 [cs.SD] 25 Oct 2022