SOURCE-FILTER HIFI-GAN: FAST AND PITCH CONTROLLABLE
HIGH-FIDELITY NEURAL VOCODER
Reo Yoneyama1, Yi-Chiao Wu2, and Tomoki Toda1
1Nagoya University, Japan, 2Meta Reality Labs Research, USA
ABSTRACT
Our previous work, the unified source-filter GAN (uSFGAN)
vocoder, introduced a novel architecture based on the source-
filter theory into the parallel waveform generative adversarial
network to achieve high voice quality and pitch controllabil-
ity. However, the high temporal resolution inputs result in
high computation costs. Although the HiFi-GAN vocoder
achieves fast high-fidelity voice generation thanks to the effi-
cient upsampling-based generator architecture, the pitch con-
trollability is severely limited. To realize a fast and pitch-
controllable high-fidelity neural vocoder, we introduce the
source-filter theory into HiFi-GAN by hierarchically condi-
tioning the resonance filtering network on a well-estimated
source excitation information. According to the experimental
results, our proposed method outperforms HiFi-GAN and uSF-
GAN on a singing voice generation in voice quality and synthe-
sis speed on a single CPU. Furthermore, unlike the uSFGAN
vocoder, the proposed method can be easily adopted/integrated
in real-time applications and end-to-end systems.
Index Terms—Speech synthesis, neural vocoder, source-
filter model, generative adversarial networks
1. INTRODUCTION
A neural vocoder is a generative deep neural network (DNN)
that generates raw waveforms based on the input acoustic fea-
tures. The vocoder capability (e.g., voice quality, controllabil-
ity, and synthesis speed) dramatically affects the overall perfor-
mance of speech generation applications such as text-to-speech
(TTS), singing voice synthesis (SVS), and voice conversion.
Especially, the controllability of fundamental frequencies (F0)
is essential for flexibly generating desired intonation and musi-
cal pitch patterns.
HiFi-GAN [1] is one of the most popular neural vocoders
due to its convincing voice quality and efficient synthesis.
HiFi-GAN gradually upsamples the low temporal resolution
acoustic features to match the high temporal resolution raw
waveform. As the computational cost depends on the sequence
length, the upsampling architecture, which works from much
lower temporal resolutions, facilitates fast synthesis. Although
HiFi-GAN and recent high-fidelity neural vocoders [2–5] have
achieved convincing voice quality, they lack F0controllability.
To solve the aforementioned problem, unified source-filter
GAN (uSFGAN) [6, 7] attempts to introduce a reasonable
inductive bias of human voice production to DNNs to simul-
taneously achieve high voice quality and F0controllability.
USFGAN factorizes the generator network of Quasi-Periodic
Parallel WaveGAN [8] into a source excitation generation
network (source-network) and a resonance filtering network
(filter-network). However, because of the very high temporal
resolution inputs, the slow synthesis makes it challenging to
adopt/integrate uSFGAN in real-time applications, and end-to-
end systems [9–11].
For fast high-fidelity generations and F0controllability, we
propose source-filter HiFi-GAN (SiFi-GAN), which introduces
source-filter modeling into HiFi-GAN. The proposed model is
based on the V1 configuration described in the HiFi-GAN pa-
per [1] because of its superior voice quality. SiFi-GAN has two
separate upsampling networks: a source-network and a filter-
network, which are connected in series to simulate a pseudo
cascade mechanism of the source-filter theory [12]. The pa-
rameters of HiFi-GAN V1 are pruned to compensate for the
additional computation costs of the source-filter modeling. Ac-
cording to the experimental results, SiFi-GAN achieves a faster
synthesis than HiFi-GAN V1 on a single CPU with better voice
quality. Also, SiFi-GAN achieves comparable F0controllabil-
ity as the uSFGAN-based model with much faster synthesis.
2. BASELINE HIFI-GAN AND USFGAN
This chapter describes two neural vocoders: HiFi-GAN [1] and
uSFGAN [6, 7]. They are the basis of the proposed SiFi-GAN
and we use them as the baselines.
2.1. HiFi-GAN
HiFi-GAN [1] is a generative adversarial networks (GAN) [13]
based high-fidelity neural vocoder with a sophisticatedly de-
signed generator and multi-period and multi-scale discrimina-
tors. The generator receives a mel-spectrogram as input and
upsamples it to match the temporal resolution of the target
raw waveform with transposed convolution neural networks
followed by multi-receptive field fusion (MRF) modules. MRF
comprises several residual blocks, each has customized kernel
and dilation sizes to capture the input feature from different
receptive fields. The hyperparameters in MRFs can be defined
as a trade-off between synthesis efficiency and voice quality.
The generator learns to fool the discriminators while the
discriminators learn to distinguish between the generated and
arXiv:2210.15533v3 [cs.SD] 27 Feb 2023