1 Robust One-Shot Singing V oice Conversion Naoya Takahashi Member IEEE Mayank Kumar Singh Member IEEE Yuki Mitsufuji Senior Member IEEE

2025-04-30 0 0 5.4MB 12 页 10玖币
侵权投诉
1
Robust One-Shot Singing Voice Conversion
Naoya Takahashi Member, IEEE,, Mayank Kumar Singh Member, IEEE,, Yuki Mitsufuji, Senior Member, IEEE
Abstract—Recent progress in deep generative models has
improved the quality of voice conversion in the speech domain.
However, high-quality singing voice conversion (SVC) of unseen
singers remains challenging due to the wider variety of musical
expressions in pitch, loudness, and pronunciation. Moreover,
singing voices are often recorded with reverb and accompaniment
music, which make SVC even more challenging. In this work,
we present a robust one-shot SVC (ROSVC) that performs
any-to-any SVC robustly even on such distorted singing voices.
To this end, we first propose a one-shot SVC model based
on generative adversarial networks that generalizes to unseen
singers via partial domain conditioning and learns to accurately
recover the target pitch via pitch distribution matching and
AdaIN-skip conditioning. We then propose a two-stage training
method called Robustify that train the one-shot SVC model in
the first stage on clean data to ensure high-quality conversion,
and introduces enhancement modules to the encoders of the
model in the second stage to enhance the feature extraction from
distorted singing voices. To further improve the voice quality and
pitch reconstruction accuracy, we finally propose a hierarchical
diffusion model for singing voice neural vocoders. Experimental
results show that the proposed method outperforms state-of-the-
art one-shot SVC baselines for both seen and unseen singers and
significantly improves the robustness against distortions.
Index Terms—one-shot singing voice conversion, noise robust,
neural vocoder, diffusion models
I. INTRODUCTION
The aim of singing voice conversion (SVC) is to convert
a source singing voice into another singer’s voice while
maintaining the melody and lyrical content of the given source.
SVC has attracted increasing attention due to its potential
applications over a wide area including content creation,
education, and entertainment. Despite recent advancements of
voice conversion in speech domain, SVC remains challenging
owing to following reasons: (1) singing voices have a wider
variety of pitch, loudness, and pronunciation owing to dif-
ferent styles of musical expression, which make them more
challenging to model; (2) human perception is often sensitive
to a singing voice with pitch error because it is perceived
as off-pitch and fails to maintain the melodic contents; (3)
the scarcity of large-scale clean singing voice datasets hinders
generalization of SVC models; and (4) SVC models are prone
to distortion of input singing voices. As a result, many SVC
approaches have focused on converting a singing voice into
those seen during the training (known as a many-to-many case)
in relatively small and clean datasets [1]–[12]. However, it is
often difficult or even impossible to collect a clean singing
voice from the target singer in advance. Thus, extending SVC
models to unseen target singers (known as a any-to-any case)
is an inevitable requirement for many practical applications.
Moreover, in many cases a singing voice will be modified
with a reverb effect and face interference from music, as a
singer will often sing along with accompaniment music. The
distortions caused by the interference of music and reverb
contaminates the singing voice and hinders the extraction of
the acoustic features required for SVC (e.g., pitch, linguistic
content, and singer’s voice characteristics), thus leading to
a severe degradation in the SVC performance. One way
to mitigate this problem is to use music source separation
and dereverberation algorithms to remove music and reverb
from recordings. However, they often produce non-negligible
artefacts, and using such processed samples for input to an
SVC system will still considerably degrade the SVC quality.
In this paper, we propose a robust one-shot singing voice
conversion (ROSVC) that robustly generates the singing voice
of any target singer from any source singing content even
with a distorted voice. The proposed model takes as reference
less than ten seconds of the singing voice from a (possibly
unseen) target singer and converts the source singer’s voice
robustly in a one-shot manner even when the singing voice
has interference from accompaniment music and modified with
the reverb effects. To this end, we propose three components;
(i) a neural network architecture and training framework that
enables one-shot SVC with accurate pitch control, (ii) a two-
stage training called Robustify that improves the robustness
of the feature extraction against the distortions of the input
singing voice, and (iii) a hierarchical diffusion model for a
singing voice vocoder that learns multiple diffusion models
at different sampling rates to improve the quality and pitch
stability.
Our intensive experiments on the NUS48E, NHSS, and
MUSDB18 datasets show that the proposed method outper-
forms five one-shot VC baselines on both seen and unseen
singers and significantly improves the robustness against dis-
tortions caused by reverb and accompaniment music.
Our contributions are summarized as follows:
1) We propose a network architecture and training method
for one-shot singing voice conversion to enables the
generation of high quality singing voice with accurate
pitch recovery.
2) To consider a more practical and challenging scenario
of singing voice containing accompaniment music and
reverb, we propose a framework called Robustify that
significantly improves the robustness of the SVC model
against such distortions.
3) We further propose a hierarchical diffusion model-based
neural vocoder to generate a high-quality singing voice
4) We conduct extensive experiments using various singing
voice datasets and show that the proposed method
outperforms state-of-the-art one-shot SVC models and
significantly improves the robustness against distortions
caused by interference from accompaniment music and
reverb.
arXiv:2210.11096v2 [cs.SD] 6 Oct 2023
2
Part of this work on the hierarchical diffusion model-based
neural vocoder was published as a conference paper [13],
that focused on the vocoding task for ground truth acoustic
features. In the paper, we newly propose a novel one-shot SVC
that is robust against distortions and investigate the hierarchical
diffusion model-based neural vocoder on the challenging SVC
scenario.
Audio samples are available at our website *.
II. RELATED WORKS
A. Singing voice conversion
Unique challenges in SVC is to accurately recover the target
pitch and handle wide variety of pitch, loudness, and musical
expression. Initial SVC approaches tackled the SVC problem
by utilizing parallel data [1], [14]. Several methods have been
proposed to overcome the necessity of the expensive parallel
data by using deep generative models such as autoregressive
models [2], [3], [7], [15], variational autoencoders [5], GANs
[4], [8], [9], [11], and diffusion models [6]. However, they
are limited to many(any)-to-many or many-to-one cases and
cannot handle unseen target singers. Other approaches leverage
a speaker recognition network (SRN) to extract the speaker
embeddings from reference audio [15]. Li et al. [16] investi-
gated a hierarchical speaker representation for one-shot SVC.
Our approach is different as their training objectives are to
reconstruct the input voice from disentangled features and
the conversion is done only during inference by changing the
speaker embeddings, while in our approach, the input voices
are converted during the training so that it does not suffer from
training-inference mode gap and more directly constrains the
converted samples. Moreover, all previous approaches focus on
clean singing voice and are prone to distortions. In contrast,
our work aims at any-to-any SVC on possibly distorted singing
voice without parallel data.
B. One-shot voice conversion
One-shot VC has been actively investigated in the speech
domain [17]–[20]. AdaIN-VC [17] uses a speaker encoder
to extract speaker embeddings and condition the decoder
using adaptive instance normalization (AdaIN) layers. VQVC+
[18] extracts speaker-independent content embeddings us-
ing vector quantization and utilizes the residual information
as speaker information. Fragment-VC [20] utilizes a cross-
attention mechanism to use fragments from reference samples
to produce a converted voice. Although these approaches have
shown promising results on speech, they cannot be scaled
to singing voices due to their simplicity, as shown in our
experiment.
C. Noise robust voice conversion
There have been a few attempts to improve the robustness
of speech VC against noise by using a clean–noisy speech
pair to learn noise robust representations [21], [22]. These
approaches are not directly applicable because our model
*https://t-naoya.github.io/rosvc/
converts the singer identity during the training and there is
no denoised target that we can utilize to train the model via
some reconstruction losses. Xie et al. propose leveraging a
pre-trained denoising model and directly using noisy speech
as a target signal [23]. In these studies, environmental sounds
are used as noise, which is uncorrelated to voice. In contrast,
we consider reverb and accompaniment music as noise, which
can be more challenging because accompaniment music often
contains a similar harmonic structure to the singing voice, and
strong reverb effects on the singing voice make robust feature
extraction more difficult.
D. Neural vocoder
Voice conversion models often operate in acoustic feature
domains to efficiently model the mapping of speech charac-
teristics. Neural vocoders are often used for generating high-
quality waveform from acoustic features [24]–[29]. A number
of generative models have been adopted to neural vocoders
including autoregressive models [24], [25], [30], generative
adversarial networks (GANs) [28], [31]–[33], and flow-based
models [26], [27]. Recently, diffusion models [34], [35] have
been shown to generate high-fidelity samples in a wide range
of areas [36] and have been adopted for neural vocoders in
the speech domain [29], [37]. Although they are efficiently
trained by maximizing the evidence lower bound (ELBO) and
can produce high-quality speech data, the inference speed is
relatively slow compared to other non-autoregressive model-
based vocoders as they require many iterations to generate
the data. To address this problem, PriorGrad [38] introduces
a data-dependent prior, i.e., Gaussian distribution with a diag-
onal covariance matrix whose entries are frame-wise energies
of the mel-spectrogram. As the noise drawn from the data-
dependent prior provides an initial waveform closer to the
target than the noise from a standard Gaussian, PriorGrad
achieves faster convergence and inference with a superior
performance. SpecGrad [39] further improves the prior by
incorporating the spectral envelope of the mel-spectrogram to
introduce noise that is more similar to the target signal.
However, we found that state-of-the-art neural vocoders
provide insufficient quality when they are applied to a singing
voice. In this work, we address this problem by proposing a
hierarchical diffusion model.
III. PROPOSED ONE-SHOT SVC FRAMEWORK
We base our SVC model on the generative adversar-
ial network (GAN)-based voice conversion model called
StarGANv2-VC [40]. Although StarGANv2-VC yields excel-
lent sample quality in speech voice conversion, it has several
limitations when it applied to SVC: (i) It is limited to the
many(any)-to-many case and cannot generate the singing voice
of unseen speakers. (ii) When it is applied to a singing voice,
the converted voice often sounds off-pitch. (iii) It does not have
pitch controllability, which is important for SVC as converted
singing voices are often played with an accompaniment and
thus the pitch should be aligned with the accompaniment
music. (iv) The conversion quality is severely degraded when
the singing voices contain reverb and accompaniment music.
3
Style
Encoder
Classifier
Encoder
Mapping
Network
Source
singer
Source
Reference
Converted
Domain code
Domain code
z
Discriminator
R/F
R/F
R/F
Pitch
Extractor
Domain code Scale
Sampler
𝛼
style
style
target f0
style
Robustify
h
f0
s
AdaIN
LReLU
Conv
AdaIN
LReLU
Conv
Conv
Generator ×N
(a)
(b) (c) (d)
Generator
Fig. 1. Robust one-shot SVC framework. (a) The generator is conditioned on the style vector and the target f0to be reconstructed in the converted sample.
The target f0is obtained by scaling the source f0to match the target distribution based on the domain statistics. (b),(c) Unlike the mapping network, the
style encoder is domain-independent and yet trained to fool the domain-specific discriminators. The encoders are refined to improve the robustness against
input distortions in the second stage of the training.
To address these problems, we first extend StarGANv2-VC
by introducing a domain-independent style encoder for en-
abling one-shot voice conversion (Sec.III-A) and introduce
AdaIN-skip pitch conditioning to enable accurate pitch control
(Sec.III-B). We then introduce a two-stage training framework
called Robustify to improve the robustness of the feature
extraction against the distortions (in Sec. IV). Finally, we
introduce a hierarchical diffusion model for the singing voice
vocoder to enable high-quality signing voice waveform gen-
eration (Sec.s V).
A. One-shot SVC framework with domain-specific and
domain-independent modules
The overview of the proposed one-shot SVC framework
is shown in Figure 1. The generator G(h, ftrg
0, s)converts
the source mel-spectrogram Xsrc into a sample in the target
domain Xtrg based on an encoder output h=E(Xsrc),
target fundamental frequency (F0) ftrg
0, and style embedding
s, where the domain in our case is the singer identity. his
a time-varying feature expected to contain linguistic contents
while sis a global feature expected to contain the singer’s
voice characteristics. The discriminator Dconsists of shared
layers followed by domain-specific heads to classify whether
the input is real or fake on each target domain via adversarial
loss
Ladv =EX,f trg
0,s[log D(X, ysrc)log D(G(h, ftrg
0, s), ytrg )],
(1)
where ytrg ∈ Y denotes the target domain. Although the
domain-specific discriminator helps make the generator’s out-
puts realistic and similar to the target domain, we further
promote the conversion by introducing an additional classifier
C. The classifier C() takes as input the generated sample and
is trained to identify the source domain ysrc via classification
loss Lcl(y)while the generator is trained to fool the classifier
via the adversarial classification loss Lac(y):
Lcl =EX,f trg
0,s[CE(C(G(h, ftrg
0, s)), ysrc)],(2)
Lac =EX,f trg
0,s[CE(C(G(h, ftrg
0, s)), ytrg )],(3)
where CE denotes the cross-entropy loss. The style embedding
sis obtained by either the style encoder or the mapping
network. Given the target domain ytrg, the mapping network
Mtransforms a random latent code z∼ N(0,1) into the style
embedding as s=M(z, ytrg ). In the original StarGANv2
[40], [41], the mapping network, style encoder, and discrim-
inator have domain-specific projection heads to enable the
model to easily handle domain-specific information and focus
on diversity within the domain. However, this architecture
limits the conversion within the pre-defined domains, which
is many-to-many SVC in our case. To enable any-to-any
one-shot SVC, we propose using a domain-independent style
encoder S(X)while keeping the domain-specific heads for
the mapping network and discriminator. By doing so, the
style encoder does not require the domain code and can then
transform any singer’s voice during the inference time while
the domain-specific mapping network and discriminator still
guide the model to handle the domain-specific information. We
empirically demonstrate that this design does not deteriorate
the conversion quality compared to the original many-to-many
model. Note that the mapping network is not utilized for
inference, as our goal is one-shot SVC, but it isstill useful
for guiding the model to learn domain specific characteristics
and the diversity within the domains.
B. Pitch conditioning
Unlike speech voice conversion, accurate pitch reconstruc-
tion is essential for SVC to maintain the melodic content.
Although the StarGANv2-VC model in [40] uses the f0feature
extracted from the source by an F0 estimation network to
guide the generation, the output is only weakly constrained
to have a normalized F0 trajectory similar to that of the
source. Therefore, the absolute pitch of the converted sample
摘要:

1RobustOne-ShotSingingVoiceConversionNaoyaTakahashiMember,IEEE,,MayankKumarSinghMember,IEEE,,YukiMitsufuji,SeniorMember,IEEEAbstract—Recentprogressindeepgenerativemodelshasimprovedthequalityofvoiceconversioninthespeechdomain.However,high-qualitysingingvoiceconversion(SVC)ofunseensingersremainschalle...

展开>> 收起<<
1 Robust One-Shot Singing V oice Conversion Naoya Takahashi Member IEEE Mayank Kumar Singh Member IEEE Yuki Mitsufuji Senior Member IEEE.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:5.4MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注