2
Part of this work on the hierarchical diffusion model-based
neural vocoder was published as a conference paper [13],
that focused on the vocoding task for ground truth acoustic
features. In the paper, we newly propose a novel one-shot SVC
that is robust against distortions and investigate the hierarchical
diffusion model-based neural vocoder on the challenging SVC
scenario.
Audio samples are available at our website *.
II. RELATED WORKS
A. Singing voice conversion
Unique challenges in SVC is to accurately recover the target
pitch and handle wide variety of pitch, loudness, and musical
expression. Initial SVC approaches tackled the SVC problem
by utilizing parallel data [1], [14]. Several methods have been
proposed to overcome the necessity of the expensive parallel
data by using deep generative models such as autoregressive
models [2], [3], [7], [15], variational autoencoders [5], GANs
[4], [8], [9], [11], and diffusion models [6]. However, they
are limited to many(any)-to-many or many-to-one cases and
cannot handle unseen target singers. Other approaches leverage
a speaker recognition network (SRN) to extract the speaker
embeddings from reference audio [15]. Li et al. [16] investi-
gated a hierarchical speaker representation for one-shot SVC.
Our approach is different as their training objectives are to
reconstruct the input voice from disentangled features and
the conversion is done only during inference by changing the
speaker embeddings, while in our approach, the input voices
are converted during the training so that it does not suffer from
training-inference mode gap and more directly constrains the
converted samples. Moreover, all previous approaches focus on
clean singing voice and are prone to distortions. In contrast,
our work aims at any-to-any SVC on possibly distorted singing
voice without parallel data.
B. One-shot voice conversion
One-shot VC has been actively investigated in the speech
domain [17]–[20]. AdaIN-VC [17] uses a speaker encoder
to extract speaker embeddings and condition the decoder
using adaptive instance normalization (AdaIN) layers. VQVC+
[18] extracts speaker-independent content embeddings us-
ing vector quantization and utilizes the residual information
as speaker information. Fragment-VC [20] utilizes a cross-
attention mechanism to use fragments from reference samples
to produce a converted voice. Although these approaches have
shown promising results on speech, they cannot be scaled
to singing voices due to their simplicity, as shown in our
experiment.
C. Noise robust voice conversion
There have been a few attempts to improve the robustness
of speech VC against noise by using a clean–noisy speech
pair to learn noise robust representations [21], [22]. These
approaches are not directly applicable because our model
*https://t-naoya.github.io/rosvc/
converts the singer identity during the training and there is
no denoised target that we can utilize to train the model via
some reconstruction losses. Xie et al. propose leveraging a
pre-trained denoising model and directly using noisy speech
as a target signal [23]. In these studies, environmental sounds
are used as noise, which is uncorrelated to voice. In contrast,
we consider reverb and accompaniment music as noise, which
can be more challenging because accompaniment music often
contains a similar harmonic structure to the singing voice, and
strong reverb effects on the singing voice make robust feature
extraction more difficult.
D. Neural vocoder
Voice conversion models often operate in acoustic feature
domains to efficiently model the mapping of speech charac-
teristics. Neural vocoders are often used for generating high-
quality waveform from acoustic features [24]–[29]. A number
of generative models have been adopted to neural vocoders
including autoregressive models [24], [25], [30], generative
adversarial networks (GANs) [28], [31]–[33], and flow-based
models [26], [27]. Recently, diffusion models [34], [35] have
been shown to generate high-fidelity samples in a wide range
of areas [36] and have been adopted for neural vocoders in
the speech domain [29], [37]. Although they are efficiently
trained by maximizing the evidence lower bound (ELBO) and
can produce high-quality speech data, the inference speed is
relatively slow compared to other non-autoregressive model-
based vocoders as they require many iterations to generate
the data. To address this problem, PriorGrad [38] introduces
a data-dependent prior, i.e., Gaussian distribution with a diag-
onal covariance matrix whose entries are frame-wise energies
of the mel-spectrogram. As the noise drawn from the data-
dependent prior provides an initial waveform closer to the
target than the noise from a standard Gaussian, PriorGrad
achieves faster convergence and inference with a superior
performance. SpecGrad [39] further improves the prior by
incorporating the spectral envelope of the mel-spectrogram to
introduce noise that is more similar to the target signal.
However, we found that state-of-the-art neural vocoders
provide insufficient quality when they are applied to a singing
voice. In this work, we address this problem by proposing a
hierarchical diffusion model.
III. PROPOSED ONE-SHOT SVC FRAMEWORK
We base our SVC model on the generative adversar-
ial network (GAN)-based voice conversion model called
StarGANv2-VC [40]. Although StarGANv2-VC yields excel-
lent sample quality in speech voice conversion, it has several
limitations when it applied to SVC: (i) It is limited to the
many(any)-to-many case and cannot generate the singing voice
of unseen speakers. (ii) When it is applied to a singing voice,
the converted voice often sounds off-pitch. (iii) It does not have
pitch controllability, which is important for SVC as converted
singing voices are often played with an accompaniment and
thus the pitch should be aligned with the accompaniment
music. (iv) The conversion quality is severely degraded when
the singing voices contain reverb and accompaniment music.