DisC-VC Disentangled and F0-Controllable Neural Voice Conversion Chihiro Watanabe1and Hirokazu Kameoka1

2025-05-03 0 0 1.01MB 10 页 10玖币
侵权投诉
DisC-VC: Disentangled and F0-Controllable Neural Voice
Conversion
Chihiro Watanabe1and Hirokazu Kameoka1
1NTT Communication Science Laboratories, 3-1, Morinosato Wakamiya, Atsugi-shi, Kanagawa Pref.
243-0198 Japan
Abstract
Voice conversion is a task to convert a non-linguistic feature of a given utterance. Since
naturalness of speech strongly depends on its pitch pattern, in some applications, it would be
desirable to keep the original rise/fall pitch pattern while changing the speaker identity. Some of the
existing methods address this problem by either using a source-filter model or developing a neural
network that takes an F0pattern as input to the model. Although the latter approach can achieve
relatively high sound quality compared to the former one, there is no consideration for discrepancy
between the target and generated F0patterns in its training process. In this paper, we propose a
new variational-autoencoder-based voice conversion model accompanied by an auxiliary network,
which ensures that the conversion result correctly reflects the specified F0/timbre information. We
show the effectiveness of the proposed method by objective and subjective evaluations.
Keywords. Voice conversion, F0conversion, timbre conversion, neural network, variational au-
toencoder
1 Introduction
Voice conversion (VC) is a task to convert a non-linguistic feature of a given utterance and it can
be used for various applications [24] including personalizing or changing emotion of text-to-speech
systems [8, 11, 31]. In this paper, we focus on the conversion of F0pattern and timbre (i.e., speaker
identity).
A basic VC framework consists of feature extraction from a speech signal, feature conversion, and
speech synthesis based on a converted feature. With regard to feature conversion, some of the existing
methods require parallel data set (e.g., Gaussian-mixture-model-based [10,35,38], deep-neural-network-
based [3, 4], and sequence-to-sequence-based [30, 41] methods), while others do not (e.g., recognition-
synthesis-based [7, 21], phonetic-posteriorgram-based [36], variational-autoencoder (VAE) [18]-based
[12,28,29], generative-adversarial-network-based [6,15,16], flow-based [23,34], and score-based [13,26]
methods). As for feature extraction and speech synthesis, the existing methods can be classified
roughly into two approaches. One is the parametric-vocoder-based (PV) approach [4, 12, 35], where
we first apply timbre conversion based on mel-cepstral coefficients (MCCs) and F0conversion (e.g.,
linear transformation) separately and then synthesize a waveform with a parametric vocoder such
as WORLD [25]. The other is the neural-vocoder-based (NV) approach [2, 14, 29], where we use a
neural network for speech synthesis. While early neural vocoders have required spectral envelope- and
excitation-related features as input [5, 37], the recent mainstream is to synthesize a waveform from a
mel-spectrogram [20, 40].
chihiro.watanabe.xz@hco.ntt.co.jp
This work was supported by JST CREST Grant Number JPMJCR19A3, Japan.
1
arXiv:2210.11059v1 [eess.AS] 20 Oct 2022
In some applications such as online communication, it would be desirable to keep the original
rise/fall pitch pattern, since it strongly affects the nuance of speech. Inappropriate modification of
the relative pitch pattern would cause miscommunication in regard to the speaker’s intent and/or
emotion. One way to securely keep the relative pitch pattern of input speech would be to take the PV
approach, which allows us to explicitly specify the F0pattern of the converted speech. However, one
disadvantage is that it cannot generally achieve as high sound quality as the NV approach. Recently,
some F0-controllable NV methods have been proposed, with which we can specify both F0and timbre
conditionings explicitly [17, 22, 27]. However, even with these models, a generated speech does not
necessarily have the same F0pattern as the input conditioning. This is because the relationship
between the specified F0pattern and the conversion result has not been considered in the training
criteria, and thus the decoder may disregard the specified F0pattern and not reproduce it faithfully
in the reconstruction process.
In this paper, we develop a disentangled and F0-controllable neural VC (DisC-VC). DisC consists
of the VAE-based generator and auxiliary network. The auxiliary network, which is composed of
F0extractor and speaker classifier, ensures that the conversion result correctly reflects the specified
F0/timbre information. We show the effectiveness of DisC by both objective and subjective evaluations.
2 Proposed Method
2.1 Preprocessing of training samples
Figure 1 shows the architecture of the DisC model. During the training phase, we use the standard-
ized log mel-spectrogram xRF×T, the log F0pattern λRT, and the speaker index s∈ {1,2, . . . , S}
as the input features, where T,F, and Sindicate the numbers of time and frequency bins and speak-
ers, respectively. To define x, we first compute the log mel-spectrogram x(0). Then, each entry of the
column vectors of x(0) is standardized according to the mean and standard deviation for all the time
bins and training samples. To compute λ, we extract the F0pattern from an input speech signal with
the WORLD F0estimator [25] and apply logarithmic transformation to its non-zero entries.
2.2 Generator
The generator consists of content encoder ENCCand decoder DEC. To reflect the discrete nature
of the phonemic units of speech, we adopt the encoder of Vector Quantised-Variational AutoEncoder or
VQ-VAE [39] as ENCC. In the generator, we first extract the content feature c= ENCC(x)RNC×T
from x, where we set NC= 16 in the experiment. Then, the decoder reconstructs the mean and
variance of the log mel-spectrogram ( ˆ
µ,ˆ
σ) = DEC(c,λ, s), where ˆ
µ,ˆ
σRF×T.
2.3 Training objective
2.3.1 Reconstruction error
The first training objective is the reconstruction error of the generator. Let qφ(c|x) be a probability
density function (PDF) of content feature cconditioned on log mel-spectrogram x, based on which
the content encoder with parameter φgenerates the content feature. Let pθ(x|c,λ, s) be a PDF of
xconditioned on a set of features (λ, s), based on which the decoder with parameter θgenerates the
output mel-spectrogram. We assume the prior distribution of cto be a discrete uniform distribution
independent of the F0/timbre features (i.e., p(c|λ, s) = p(c)). By using Jensen’s inequality, the negative
log marginal distribution can be upper bounded by
log pθ(x|λ, s)DEcqφ(c|x)[log pθ(x|c,λ, s)],(1)
where D:=DKL[qφ(c|x)||p(c)] and DKL(q||p) indicates the KL divergence of qfrom p. It can be shown
that the difference between the right- and left-hand sides of (1) is equal to DKL(qφ(c|x)||pθ(c|x,λ, s)),
2
摘要:

DisC-VC:DisentangledandF0-ControllableNeuralVoiceConversionChihiroWatanabe*1andHirokazuKameoka„11NTTCommunicationScienceLaboratories,3-1,MorinosatoWakamiya,Atsugi-shi,KanagawaPref.243-0198JapanAbstractVoiceconversionisatasktoconvertanon-linguisticfeatureofagivenutterance.Sincenaturalnessofspeechstro...

展开>> 收起<<
DisC-VC Disentangled and F0-Controllable Neural Voice Conversion Chihiro Watanabe1and Hirokazu Kameoka1.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:1.01MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注