In some applications such as online communication, it would be desirable to keep the original
rise/fall pitch pattern, since it strongly affects the nuance of speech. Inappropriate modification of
the relative pitch pattern would cause miscommunication in regard to the speaker’s intent and/or
emotion. One way to securely keep the relative pitch pattern of input speech would be to take the PV
approach, which allows us to explicitly specify the F0pattern of the converted speech. However, one
disadvantage is that it cannot generally achieve as high sound quality as the NV approach. Recently,
some F0-controllable NV methods have been proposed, with which we can specify both F0and timbre
conditionings explicitly [17, 22, 27]. However, even with these models, a generated speech does not
necessarily have the same F0pattern as the input conditioning. This is because the relationship
between the specified F0pattern and the conversion result has not been considered in the training
criteria, and thus the decoder may disregard the specified F0pattern and not reproduce it faithfully
in the reconstruction process.
In this paper, we develop a disentangled and F0-controllable neural VC (DisC-VC). DisC consists
of the VAE-based generator and auxiliary network. The auxiliary network, which is composed of
F0extractor and speaker classifier, ensures that the conversion result correctly reflects the specified
F0/timbre information. We show the effectiveness of DisC by both objective and subjective evaluations.
2 Proposed Method
2.1 Preprocessing of training samples
Figure 1 shows the architecture of the DisC model. During the training phase, we use the standard-
ized log mel-spectrogram x∈RF×T, the log F0pattern λ∈RT, and the speaker index s∈ {1,2, . . . , S}
as the input features, where T,F, and Sindicate the numbers of time and frequency bins and speak-
ers, respectively. To define x, we first compute the log mel-spectrogram x(0). Then, each entry of the
column vectors of x(0) is standardized according to the mean and standard deviation for all the time
bins and training samples. To compute λ, we extract the F0pattern from an input speech signal with
the WORLD F0estimator [25] and apply logarithmic transformation to its non-zero entries.
2.2 Generator
The generator consists of content encoder ENCCand decoder DEC. To reflect the discrete nature
of the phonemic units of speech, we adopt the encoder of Vector Quantised-Variational AutoEncoder or
VQ-VAE [39] as ENCC. In the generator, we first extract the content feature c= ENCC(x)∈RNC×T
from x, where we set NC= 16 in the experiment. Then, the decoder reconstructs the mean and
variance of the log mel-spectrogram ( ˆ
µ,ˆ
σ) = DEC(c,λ, s), where ˆ
µ,ˆ
σ∈RF×T.
2.3 Training objective
2.3.1 Reconstruction error
The first training objective is the reconstruction error of the generator. Let qφ(c|x) be a probability
density function (PDF) of content feature cconditioned on log mel-spectrogram x, based on which
the content encoder with parameter φgenerates the content feature. Let pθ(x|c,λ, s) be a PDF of
xconditioned on a set of features (λ, s), based on which the decoder with parameter θgenerates the
output mel-spectrogram. We assume the prior distribution of cto be a discrete uniform distribution
independent of the F0/timbre features (i.e., p(c|λ, s) = p(c)). By using Jensen’s inequality, the negative
log marginal distribution can be upper bounded by
−log pθ(x|λ, s)≤D−Ec∼qφ(c|x)[log pθ(x|c,λ, s)],(1)
where D:=DKL[qφ(c|x)||p(c)] and DKL(q||p) indicates the KL divergence of qfrom p. It can be shown
that the difference between the right- and left-hand sides of (1) is equal to DKL(qφ(c|x)||pθ(c|x,λ, s)),
2