DisC-VC Disentangled and F0-Controllable Neural Voice Conversion Chihiro Watanabe1and Hirokazu Kameoka1

2025-05-03 0 0 1.01MB 10 页 10玖币

侵权投诉

DisC-VC: Disentangled and F0-Controllable Neural Voice

Conversion

Chihiro Watanabe∗1and Hirokazu Kameoka†1

1NTT Communication Science Laboratories, 3-1, Morinosato Wakamiya, Atsugi-shi, Kanagawa Pref.

243-0198 Japan

Abstract

Voice conversion is a task to convert a non-linguistic feature of a given utterance. Since

naturalness of speech strongly depends on its pitch pattern, in some applications, it would be

desirable to keep the original rise/fall pitch pattern while changing the speaker identity. Some of the

existing methods address this problem by either using a source-ﬁlter model or developing a neural

network that takes an F0pattern as input to the model. Although the latter approach can achieve

relatively high sound quality compared to the former one, there is no consideration for discrepancy

between the target and generated F0patterns in its training process. In this paper, we propose a

new variational-autoencoder-based voice conversion model accompanied by an auxiliary network,

which ensures that the conversion result correctly reﬂects the speciﬁed F0/timbre information. We

show the eﬀectiveness of the proposed method by objective and subjective evaluations.

Keywords. Voice conversion, F0conversion, timbre conversion, neural network, variational au-

toencoder

1 Introduction

Voice conversion (VC) is a task to convert a non-linguistic feature of a given utterance and it can

be used for various applications [24] including personalizing or changing emotion of text-to-speech

systems [8, 11, 31]. In this paper, we focus on the conversion of F0pattern and timbre (i.e., speaker

identity).

A basic VC framework consists of feature extraction from a speech signal, feature conversion, and

speech synthesis based on a converted feature. With regard to feature conversion, some of the existing

methods require parallel data set (e.g., Gaussian-mixture-model-based [10,35,38], deep-neural-network-

based [3, 4], and sequence-to-sequence-based [30, 41] methods), while others do not (e.g., recognition-

synthesis-based [7, 21], phonetic-posteriorgram-based [36], variational-autoencoder (VAE) [18]-based

[12,28,29], generative-adversarial-network-based [6,15,16], ﬂow-based [23,34], and score-based [13,26]

methods). As for feature extraction and speech synthesis, the existing methods can be classiﬁed

roughly into two approaches. One is the parametric-vocoder-based (PV) approach [4, 12, 35], where

we ﬁrst apply timbre conversion based on mel-cepstral coeﬃcients (MCCs) and F0conversion (e.g.,

linear transformation) separately and then synthesize a waveform with a parametric vocoder such

as WORLD [25]. The other is the neural-vocoder-based (NV) approach [2, 14, 29], where we use a

neural network for speech synthesis. While early neural vocoders have required spectral envelope- and

excitation-related features as input [5, 37], the recent mainstream is to synthesize a waveform from a

mel-spectrogram [20, 40].

∗chihiro.watanabe.xz@hco.ntt.co.jp

†This work was supported by JST CREST Grant Number JPMJCR19A3, Japan.

arXiv:2210.11059v1 [eess.AS] 20 Oct 2022

In some applications such as online communication, it would be desirable to keep the original

rise/fall pitch pattern, since it strongly aﬀects the nuance of speech. Inappropriate modiﬁcation of

the relative pitch pattern would cause miscommunication in regard to the speaker’s intent and/or

emotion. One way to securely keep the relative pitch pattern of input speech would be to take the PV

approach, which allows us to explicitly specify the F0pattern of the converted speech. However, one

disadvantage is that it cannot generally achieve as high sound quality as the NV approach. Recently,

some F0-controllable NV methods have been proposed, with which we can specify both F0and timbre

conditionings explicitly [17, 22, 27]. However, even with these models, a generated speech does not

necessarily have the same F0pattern as the input conditioning. This is because the relationship

between the speciﬁed F0pattern and the conversion result has not been considered in the training

criteria, and thus the decoder may disregard the speciﬁed F0pattern and not reproduce it faithfully

in the reconstruction process.

In this paper, we develop a disentangled and F0-controllable neural VC (DisC-VC). DisC consists

of the VAE-based generator and auxiliary network. The auxiliary network, which is composed of

F0extractor and speaker classiﬁer, ensures that the conversion result correctly reﬂects the speciﬁed

F0/timbre information. We show the eﬀectiveness of DisC by both objective and subjective evaluations.

2 Proposed Method

2.1 Preprocessing of training samples

Figure 1 shows the architecture of the DisC model. During the training phase, we use the standard-

ized log mel-spectrogram x∈RF×T, the log F0pattern λ∈RT, and the speaker index s∈ {1,2, . . . , S}

as the input features, where T,F, and Sindicate the numbers of time and frequency bins and speak-

ers, respectively. To deﬁne x, we ﬁrst compute the log mel-spectrogram x(0). Then, each entry of the

column vectors of x(0) is standardized according to the mean and standard deviation for all the time

bins and training samples. To compute λ, we extract the F0pattern from an input speech signal with

the WORLD F0estimator [25] and apply logarithmic transformation to its non-zero entries.

2.2 Generator

The generator consists of content encoder ENCCand decoder DEC. To reﬂect the discrete nature

of the phonemic units of speech, we adopt the encoder of Vector Quantised-Variational AutoEncoder or

VQ-VAE [39] as ENCC. In the generator, we ﬁrst extract the content feature c= ENCC(x)∈RNC×T

from x, where we set NC= 16 in the experiment. Then, the decoder reconstructs the mean and

variance of the log mel-spectrogram ( ˆ

µ,ˆ

σ) = DEC(c,λ, s), where ˆ

µ,ˆ

σ∈RF×T.

2.3 Training objective

2.3.1 Reconstruction error

The ﬁrst training objective is the reconstruction error of the generator. Let qφ(c|x) be a probability

density function (PDF) of content feature cconditioned on log mel-spectrogram x, based on which

the content encoder with parameter φgenerates the content feature. Let pθ(x|c,λ, s) be a PDF of

xconditioned on a set of features (λ, s), based on which the decoder with parameter θgenerates the

output mel-spectrogram. We assume the prior distribution of cto be a discrete uniform distribution

independent of the F0/timbre features (i.e., p(c|λ, s) = p(c)). By using Jensen’s inequality, the negative

log marginal distribution can be upper bounded by

−log pθ(x|λ, s)≤D−Ec∼qφ(c|x)[log pθ(x|c,λ, s)],(1)

where D:=DKL[qφ(c|x)||p(c)] and DKL(q||p) indicates the KL divergence of qfrom p. It can be shown

that the diﬀerence between the right- and left-hand sides of (1) is equal to DKL(qφ(c|x)||pθ(c|x,λ, s)),

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DisC-VC:DisentangledandF0-ControllableNeuralVoiceConversionChihiroWatanabe*1andHirokazuKameoka11NTTCommunicationScienceLaboratories,3-1,MorinosatoWakamiya,Atsugi-shi,KanagawaPref.243-0198JapanAbstractVoiceconversionisatasktoconvertanon-linguisticfeatureofagivenutterance.Sincenaturalnessofspeechstro...

展开>> 收起<<

DisC-VC Disentangled and F0-Controllable Neural Voice Conversion Chihiro Watanabe1and Hirokazu Kameoka1.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DisC-VC Disentangled and F0-Controllable Neural Voice Conversion Chihiro Watanabe1and Hirokazu Kameoka1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: