
(a) PWG spectrogram (b) PWG waveform
Fig. 1. Spectrogram and waveform generated by PWG.
2. HIFI-WAVEGAN
Before introducing our proposed HiFi-WaveGAN, we conducted an
analysis of the limitations of TTS neural vocoders when synthesiz-
ing 48kHz singing voices, using the PWG vocoder as an example.
In Figure 1 (a), we observed glitches in the low-frequency part that
disrupt the continuity of the spectrogram. These glitches arise be-
cause the receptive field of PWG is insufficient to cover the long
continuous pronunciation required for singing voices. Furthermore,
the high-frequency harmonics in Figure 1 (a) appear blurry, indicat-
ing that the PWG neural vocoder struggles to accurately reconstruct
the high-frequency components necessary for SVS tasks. We further
investigated the source of these problems in the PWG spectrogram,
as well as in other neural TTS vocoders. Figure 1 (b) reveals periodic
distortions in the waveform generated by the PWG vocoder, leading
to trembling and low-quality singing voices.
To address the aforementioned issues, we present a detailed
illustration of our HiFi-WaveGAN in the subsequent part of this
section. As depicted in Figure 2, our HiFi-WaveGAN consists of
an ExWaveNet generator, responsible for generating high-quality
singing voices. Additionally, we employ two independent discrim-
inators: MPD (Multi-Period Discriminator) and MRSD (Multi-
Resolution Spectrogram Discriminator). These discriminators are
designed to distinguish real/fake waveforms from periodic patterns
and consecutive long dependencies, respectively.
2.1. Extended WaveNet
Similar to the generator in PWG [8], we also adopt a WaveNet-based
model as the generator. However, recognizing that singing voices ex-
hibit longer continuous pronunciation compared to speech [19], we
enhance the architecture of WaveNet by utilizing an 18-layer one-
dimensional CNN with larger kernel sizes, resulting in an improved
model called Extended WaveNet (ExWaveNet). Specifically, we
evenly divide the 18 layers of the generator into three stacks, and the
kernel sizes in each stack are set to {3,3,9,9,17,17}, determined
through neural architecture search [24]. This modification empowers
the network to effectively capture longer continuous pronunciation
in singing voices with high sampling rates, courtesy of the increased
receptive field.
In addition to modeling long continuous pronunciation, restor-
ing expressiveness in singing voices is also crucial. To achieve this,
we concatenate the pitch with the mel-spectrogram as the input to
the upsampling network, following the same approach as in PWG.
The upsampled representation is then concatenated with a pulse se-
quence T, which will be explained in detail in the next paragraph,
serving as the conditional input for generating singing voices with
strong expressiveness. Moreover, we observed that upsampling the
random input noise using an identical network also improves the ex-
pressiveness of the synthesized audio.
As mentioned in the previous paragraph, there are some peri-
odic distortions in the waveform generated by TTS neural vocoders,
leading to low-quality synthesized singing voices. To address this,
we propose an additional Pulse Extractor (PE) to generate a pulse
sequence as a constraint condition during waveform synthesis. As
shown in Fig. 2, the extractor takes the mel-spectrogram and pitch as
input. The pulse is extracted at each extreme point of the waveform
envelope, determined by the V/UV decision and mel-spectrogram,
and can be formulated as:
T[i] =
||M[i]||F, UV = 1, i =s
f0
0, UV = 1, i ̸=s
f0
noise, UV = 0
(1)
where Mrepresents the mel-spectrogram, iis the time index,
|| · ||Fdenotes the Frobenius norm, T[i]indicates the pulse value at
index i, and sand f0denote the sampling rate and pitch, respectively.
The noise in the formula is generated from a Gaussian distribution.
2.2. Discriminators
Identifying both consecutive long-term dependencies and periodic
patterns plays a crucial role in modeling realistic audio [9]. In the
proposed HiFi-WaveGAN, we employ two independent discrimina-
tors to evaluate singing voices from these two aspects.
The first discriminator is MRSD, adapted from UnivNet [21],
which identifies consecutive long-term dependencies in singing
voices from the spectrogram. We transform both real and fake
singing voices into spectrograms using different combinations of
FFT size, window length, and shift size. Then, two-dimensional
convolutional layers are applied to the spectrograms. As depicted
in Fig. 2, the model employs Ksub-discriminators, each utilizing a
specific combination of spectrogram inputs. In our implementation,
Kis set to four.
The second discriminator is MPD, identical to the one used in
HiFiGAN [9]. It transforms the one-dimensional waveform with
length Tinto 2-d data with height T/p and width pby setting the
periods pto Mdifferent values, resulting in Mindependent sub-
discriminators within MPD. In this paper, we set Mto five and pto
[2,3,5,7,11]. As described in [9], this design allows the discrim-
inator to capture distinct implicit structures by examining different
parts of the input audio.
2.3. Loss function
Similar to other models [9, 19], we adopt a weighted combination of
multiple loss terms as the final loss function formulated by Eq. (2)
and Eq. (3) to supervise the training process of our HiFi-WaveGAN.
LD=Ladv (D;G),(2)
LG=λ1∗ Ladv (G;D) + λ2∗ Laux +λ3∗ Lf m ,(3)
where Ladv ,Laux, and Lf m denote adversarial loss, auxiliary
spectrogram-phase loss, and feature match loss, respectively. In this
paper, λ1,λ2, and λ3are set to 1,120, and 10, respectively.
2.3.1. Adversarial loss
For the adversarial loss, we adopt the format in LS-GAN [25] to
avoid the gradient vanishing. The formula is shown as
Ladv (G;D) = Ez∼N (0,1)[(1 −D(G(z)))2],(4)
Ladv (D;G) = Ey∼pdata [(1 −D(y))2] + Ez∼N (0,1)[D(G(z))2],
(5)