1 Robust One-Shot Singing V oice Conversion Naoya Takahashi Member IEEE Mayank Kumar Singh Member IEEE Yuki Mitsufuji Senior Member IEEE

2025-04-30 0 0 5.4MB 12 页 10玖币

侵权投诉

Robust One-Shot Singing Voice Conversion

Naoya Takahashi Member, IEEE,, Mayank Kumar Singh Member, IEEE,, Yuki Mitsufuji, Senior Member, IEEE

Abstract—Recent progress in deep generative models has

improved the quality of voice conversion in the speech domain.

However, high-quality singing voice conversion (SVC) of unseen

singers remains challenging due to the wider variety of musical

expressions in pitch, loudness, and pronunciation. Moreover,

singing voices are often recorded with reverb and accompaniment

music, which make SVC even more challenging. In this work,

we present a robust one-shot SVC (ROSVC) that performs

any-to-any SVC robustly even on such distorted singing voices.

To this end, we ﬁrst propose a one-shot SVC model based

on generative adversarial networks that generalizes to unseen

singers via partial domain conditioning and learns to accurately

recover the target pitch via pitch distribution matching and

AdaIN-skip conditioning. We then propose a two-stage training

method called Robustify that train the one-shot SVC model in

the ﬁrst stage on clean data to ensure high-quality conversion,

and introduces enhancement modules to the encoders of the

model in the second stage to enhance the feature extraction from

distorted singing voices. To further improve the voice quality and

pitch reconstruction accuracy, we ﬁnally propose a hierarchical

diffusion model for singing voice neural vocoders. Experimental

results show that the proposed method outperforms state-of-the-

art one-shot SVC baselines for both seen and unseen singers and

signiﬁcantly improves the robustness against distortions.

Index Terms—one-shot singing voice conversion, noise robust,

neural vocoder, diffusion models

I. INTRODUCTION

The aim of singing voice conversion (SVC) is to convert

a source singing voice into another singer’s voice while

maintaining the melody and lyrical content of the given source.

SVC has attracted increasing attention due to its potential

applications over a wide area including content creation,

education, and entertainment. Despite recent advancements of

voice conversion in speech domain, SVC remains challenging

owing to following reasons: (1) singing voices have a wider

variety of pitch, loudness, and pronunciation owing to dif-

ferent styles of musical expression, which make them more

challenging to model; (2) human perception is often sensitive

to a singing voice with pitch error because it is perceived

as off-pitch and fails to maintain the melodic contents; (3)

the scarcity of large-scale clean singing voice datasets hinders

generalization of SVC models; and (4) SVC models are prone

to distortion of input singing voices. As a result, many SVC

approaches have focused on converting a singing voice into

those seen during the training (known as a many-to-many case)

in relatively small and clean datasets [1]–[12]. However, it is

often difﬁcult or even impossible to collect a clean singing

voice from the target singer in advance. Thus, extending SVC

models to unseen target singers (known as a any-to-any case)

is an inevitable requirement for many practical applications.

Moreover, in many cases a singing voice will be modiﬁed

with a reverb effect and face interference from music, as a

singer will often sing along with accompaniment music. The

distortions caused by the interference of music and reverb

contaminates the singing voice and hinders the extraction of

the acoustic features required for SVC (e.g., pitch, linguistic

content, and singer’s voice characteristics), thus leading to

a severe degradation in the SVC performance. One way

to mitigate this problem is to use music source separation

and dereverberation algorithms to remove music and reverb

from recordings. However, they often produce non-negligible

artefacts, and using such processed samples for input to an

SVC system will still considerably degrade the SVC quality.

In this paper, we propose a robust one-shot singing voice

conversion (ROSVC) that robustly generates the singing voice

of any target singer from any source singing content even

with a distorted voice. The proposed model takes as reference

less than ten seconds of the singing voice from a (possibly

unseen) target singer and converts the source singer’s voice

robustly in a one-shot manner even when the singing voice

has interference from accompaniment music and modiﬁed with

the reverb effects. To this end, we propose three components;

(i) a neural network architecture and training framework that

enables one-shot SVC with accurate pitch control, (ii) a two-

stage training called Robustify that improves the robustness

of the feature extraction against the distortions of the input

singing voice, and (iii) a hierarchical diffusion model for a

singing voice vocoder that learns multiple diffusion models

at different sampling rates to improve the quality and pitch

stability.

Our intensive experiments on the NUS48E, NHSS, and

MUSDB18 datasets show that the proposed method outper-

forms ﬁve one-shot VC baselines on both seen and unseen

singers and signiﬁcantly improves the robustness against dis-

tortions caused by reverb and accompaniment music.

Our contributions are summarized as follows:

1) We propose a network architecture and training method

for one-shot singing voice conversion to enables the

generation of high quality singing voice with accurate

pitch recovery.

2) To consider a more practical and challenging scenario

of singing voice containing accompaniment music and

reverb, we propose a framework called Robustify that

signiﬁcantly improves the robustness of the SVC model

against such distortions.

3) We further propose a hierarchical diffusion model-based

neural vocoder to generate a high-quality singing voice

4) We conduct extensive experiments using various singing

voice datasets and show that the proposed method

outperforms state-of-the-art one-shot SVC models and

signiﬁcantly improves the robustness against distortions

caused by interference from accompaniment music and

reverb.

arXiv:2210.11096v2 [cs.SD] 6 Oct 2023

Part of this work on the hierarchical diffusion model-based

neural vocoder was published as a conference paper [13],

that focused on the vocoding task for ground truth acoustic

features. In the paper, we newly propose a novel one-shot SVC

that is robust against distortions and investigate the hierarchical

diffusion model-based neural vocoder on the challenging SVC

scenario.

Audio samples are available at our website *.

II. RELATED WORKS

A. Singing voice conversion

Unique challenges in SVC is to accurately recover the target

pitch and handle wide variety of pitch, loudness, and musical

expression. Initial SVC approaches tackled the SVC problem

by utilizing parallel data [1], [14]. Several methods have been

proposed to overcome the necessity of the expensive parallel

data by using deep generative models such as autoregressive

models [2], [3], [7], [15], variational autoencoders [5], GANs

[4], [8], [9], [11], and diffusion models [6]. However, they

are limited to many(any)-to-many or many-to-one cases and

cannot handle unseen target singers. Other approaches leverage

a speaker recognition network (SRN) to extract the speaker

embeddings from reference audio [15]. Li et al. [16] investi-

gated a hierarchical speaker representation for one-shot SVC.

Our approach is different as their training objectives are to

reconstruct the input voice from disentangled features and

the conversion is done only during inference by changing the

speaker embeddings, while in our approach, the input voices

are converted during the training so that it does not suffer from

training-inference mode gap and more directly constrains the

converted samples. Moreover, all previous approaches focus on

clean singing voice and are prone to distortions. In contrast,

our work aims at any-to-any SVC on possibly distorted singing

voice without parallel data.

B. One-shot voice conversion

One-shot VC has been actively investigated in the speech

domain [17]–[20]. AdaIN-VC [17] uses a speaker encoder

to extract speaker embeddings and condition the decoder

using adaptive instance normalization (AdaIN) layers. VQVC+

[18] extracts speaker-independent content embeddings us-

ing vector quantization and utilizes the residual information

as speaker information. Fragment-VC [20] utilizes a cross-

attention mechanism to use fragments from reference samples

to produce a converted voice. Although these approaches have

shown promising results on speech, they cannot be scaled

to singing voices due to their simplicity, as shown in our

experiment.

C. Noise robust voice conversion

There have been a few attempts to improve the robustness

of speech VC against noise by using a clean–noisy speech

pair to learn noise robust representations [21], [22]. These

approaches are not directly applicable because our model

*https://t-naoya.github.io/rosvc/

converts the singer identity during the training and there is

no denoised target that we can utilize to train the model via

some reconstruction losses. Xie et al. propose leveraging a

pre-trained denoising model and directly using noisy speech

as a target signal [23]. In these studies, environmental sounds

are used as noise, which is uncorrelated to voice. In contrast,

we consider reverb and accompaniment music as noise, which

can be more challenging because accompaniment music often

contains a similar harmonic structure to the singing voice, and

strong reverb effects on the singing voice make robust feature

extraction more difﬁcult.

D. Neural vocoder

Voice conversion models often operate in acoustic feature

domains to efﬁciently model the mapping of speech charac-

teristics. Neural vocoders are often used for generating high-

quality waveform from acoustic features [24]–[29]. A number

of generative models have been adopted to neural vocoders

including autoregressive models [24], [25], [30], generative

adversarial networks (GANs) [28], [31]–[33], and ﬂow-based

models [26], [27]. Recently, diffusion models [34], [35] have

been shown to generate high-ﬁdelity samples in a wide range

of areas [36] and have been adopted for neural vocoders in

the speech domain [29], [37]. Although they are efﬁciently

trained by maximizing the evidence lower bound (ELBO) and

can produce high-quality speech data, the inference speed is

relatively slow compared to other non-autoregressive model-

based vocoders as they require many iterations to generate

the data. To address this problem, PriorGrad [38] introduces

a data-dependent prior, i.e., Gaussian distribution with a diag-

onal covariance matrix whose entries are frame-wise energies

of the mel-spectrogram. As the noise drawn from the data-

dependent prior provides an initial waveform closer to the

target than the noise from a standard Gaussian, PriorGrad

achieves faster convergence and inference with a superior

performance. SpecGrad [39] further improves the prior by

incorporating the spectral envelope of the mel-spectrogram to

introduce noise that is more similar to the target signal.

However, we found that state-of-the-art neural vocoders

provide insufﬁcient quality when they are applied to a singing

voice. In this work, we address this problem by proposing a

hierarchical diffusion model.

III. PROPOSED ONE-SHOT SVC FRAMEWORK

We base our SVC model on the generative adversar-

ial network (GAN)-based voice conversion model called

StarGANv2-VC [40]. Although StarGANv2-VC yields excel-

lent sample quality in speech voice conversion, it has several

limitations when it applied to SVC: (i) It is limited to the

many(any)-to-many case and cannot generate the singing voice

of unseen speakers. (ii) When it is applied to a singing voice,

the converted voice often sounds off-pitch. (iii) It does not have

pitch controllability, which is important for SVC as converted

singing voices are often played with an accompaniment and

thus the pitch should be aligned with the accompaniment

music. (iv) The conversion quality is severely degraded when

the singing voices contain reverb and accompaniment music.

…

Style

Encoder

Classifier

Encoder

Mapping

Network

…

Source

singer

Source

Reference

Converted

Domain code

Discriminator

R/F

Pitch

Extractor

Domain code Scale

Sampler

𝛼

style

target f0

style

Robustify

AdaIN

LReLU

Conv

AdaIN

LReLU

Conv

Generator ×N

(a)

(b) (c) (d)

Generator

Fig. 1. Robust one-shot SVC framework. (a) The generator is conditioned on the style vector and the target f0to be reconstructed in the converted sample.

The target f0is obtained by scaling the source f0to match the target distribution based on the domain statistics. (b),(c) Unlike the mapping network, the

style encoder is domain-independent and yet trained to fool the domain-speciﬁc discriminators. The encoders are reﬁned to improve the robustness against

input distortions in the second stage of the training.

To address these problems, we ﬁrst extend StarGANv2-VC

by introducing a domain-independent style encoder for en-

abling one-shot voice conversion (Sec.III-A) and introduce

AdaIN-skip pitch conditioning to enable accurate pitch control

(Sec.III-B). We then introduce a two-stage training framework

called Robustify to improve the robustness of the feature

extraction against the distortions (in Sec. IV). Finally, we

introduce a hierarchical diffusion model for the singing voice

vocoder to enable high-quality signing voice waveform gen-

eration (Sec.s V).

A. One-shot SVC framework with domain-speciﬁc and

domain-independent modules

The overview of the proposed one-shot SVC framework

is shown in Figure 1. The generator G(h, ftrg

0, s)converts

the source mel-spectrogram Xsrc into a sample in the target

domain Xtrg based on an encoder output h=E(Xsrc),

target fundamental frequency (F0) ftrg

0, and style embedding

s, where the domain in our case is the singer identity. his

a time-varying feature expected to contain linguistic contents

while sis a global feature expected to contain the singer’s

voice characteristics. The discriminator Dconsists of shared

layers followed by domain-speciﬁc heads to classify whether

the input is real or fake on each target domain via adversarial

loss

Ladv =EX,f trg

0,s[log D(X, ysrc)−log D(G(h, ftrg

0, s), ytrg )],

(1)

where ytrg ∈ Y denotes the target domain. Although the

domain-speciﬁc discriminator helps make the generator’s out-

puts realistic and similar to the target domain, we further

promote the conversion by introducing an additional classiﬁer

C. The classiﬁer C() takes as input the generated sample and

is trained to identify the source domain ysrc via classiﬁcation

loss Lcl(y)while the generator is trained to fool the classiﬁer

via the adversarial classiﬁcation loss Lac(y):

Lcl =EX,f trg

0,s[CE(C(G(h, ftrg

0, s)), ysrc)],(2)

Lac =EX,f trg

0,s[CE(C(G(h, ftrg

0, s)), ytrg )],(3)

where CE denotes the cross-entropy loss. The style embedding

sis obtained by either the style encoder or the mapping

network. Given the target domain ytrg, the mapping network

Mtransforms a random latent code z∼ N(0,1) into the style

embedding as s=M(z, ytrg ). In the original StarGANv2

[40], [41], the mapping network, style encoder, and discrim-

inator have domain-speciﬁc projection heads to enable the

model to easily handle domain-speciﬁc information and focus

on diversity within the domain. However, this architecture

limits the conversion within the pre-deﬁned domains, which

is many-to-many SVC in our case. To enable any-to-any

one-shot SVC, we propose using a domain-independent style

encoder S(X)while keeping the domain-speciﬁc heads for

the mapping network and discriminator. By doing so, the

style encoder does not require the domain code and can then

transform any singer’s voice during the inference time while

the domain-speciﬁc mapping network and discriminator still

guide the model to handle the domain-speciﬁc information. We

empirically demonstrate that this design does not deteriorate

the conversion quality compared to the original many-to-many

model. Note that the mapping network is not utilized for

inference, as our goal is one-shot SVC, but it isstill useful

for guiding the model to learn domain speciﬁc characteristics

and the diversity within the domains.

B. Pitch conditioning

Unlike speech voice conversion, accurate pitch reconstruc-

tion is essential for SVC to maintain the melodic content.

Although the StarGANv2-VC model in [40] uses the f0feature

extracted from the source by an F0 estimation network to

guide the generation, the output is only weakly constrained

to have a normalized F0 trajectory similar to that of the

source. Therefore, the absolute pitch of the converted sample

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1RobustOne-ShotSingingVoiceConversionNaoyaTakahashiMember,IEEE,,MayankKumarSinghMember,IEEE,,YukiMitsufuji,SeniorMember,IEEEAbstract—Recentprogressindeepgenerativemodelshasimprovedthequalityofvoiceconversioninthespeechdomain.However,high-qualitysingingvoiceconversion(SVC)ofunseensingersremainschalle...

展开>> 收起<<

1 Robust One-Shot Singing V oice Conversion Naoya Takahashi Member IEEE Mayank Kumar Singh Member IEEE Yuki Mitsufuji Senior Member IEEE.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Robust One-Shot Singing V oice Conversion Naoya Takahashi Member IEEE Mayank Kumar Singh Member IEEE Yuki Mitsufuji Senior Member IEEE

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: