SOURCE-FILTER HIFI-GAN FAST AND PITCH CONTROLLABLE HIGH-FIDELITY NEURAL VOCODER Reo Yoneyama1 Yi-Chiao Wu2 and Tomoki Toda1

2025-05-03 0 0 3.57MB 5 页 10玖币

侵权投诉

SOURCE-FILTER HIFI-GAN: FAST AND PITCH CONTROLLABLE

HIGH-FIDELITY NEURAL VOCODER

Reo Yoneyama1, Yi-Chiao Wu2, and Tomoki Toda1

1Nagoya University, Japan, 2Meta Reality Labs Research, USA

ABSTRACT

Our previous work, the uniﬁed source-ﬁlter GAN (uSFGAN)

vocoder, introduced a novel architecture based on the source-

ﬁlter theory into the parallel waveform generative adversarial

network to achieve high voice quality and pitch controllabil-

ity. However, the high temporal resolution inputs result in

high computation costs. Although the HiFi-GAN vocoder

achieves fast high-ﬁdelity voice generation thanks to the efﬁ-

cient upsampling-based generator architecture, the pitch con-

trollability is severely limited. To realize a fast and pitch-

controllable high-ﬁdelity neural vocoder, we introduce the

source-ﬁlter theory into HiFi-GAN by hierarchically condi-

tioning the resonance ﬁltering network on a well-estimated

source excitation information. According to the experimental

results, our proposed method outperforms HiFi-GAN and uSF-

GAN on a singing voice generation in voice quality and synthe-

sis speed on a single CPU. Furthermore, unlike the uSFGAN

vocoder, the proposed method can be easily adopted/integrated

in real-time applications and end-to-end systems.

Index Terms—Speech synthesis, neural vocoder, source-

ﬁlter model, generative adversarial networks

1. INTRODUCTION

A neural vocoder is a generative deep neural network (DNN)

that generates raw waveforms based on the input acoustic fea-

tures. The vocoder capability (e.g., voice quality, controllabil-

ity, and synthesis speed) dramatically affects the overall perfor-

mance of speech generation applications such as text-to-speech

(TTS), singing voice synthesis (SVS), and voice conversion.

Especially, the controllability of fundamental frequencies (F0)

is essential for ﬂexibly generating desired intonation and musi-

cal pitch patterns.

HiFi-GAN [1] is one of the most popular neural vocoders

due to its convincing voice quality and efﬁcient synthesis.

HiFi-GAN gradually upsamples the low temporal resolution

acoustic features to match the high temporal resolution raw

waveform. As the computational cost depends on the sequence

length, the upsampling architecture, which works from much

lower temporal resolutions, facilitates fast synthesis. Although

HiFi-GAN and recent high-ﬁdelity neural vocoders [2–5] have

achieved convincing voice quality, they lack F0controllability.

To solve the aforementioned problem, uniﬁed source-ﬁlter

GAN (uSFGAN) [6, 7] attempts to introduce a reasonable

inductive bias of human voice production to DNNs to simul-

taneously achieve high voice quality and F0controllability.

USFGAN factorizes the generator network of Quasi-Periodic

Parallel WaveGAN [8] into a source excitation generation

network (source-network) and a resonance ﬁltering network

(ﬁlter-network). However, because of the very high temporal

resolution inputs, the slow synthesis makes it challenging to

adopt/integrate uSFGAN in real-time applications, and end-to-

end systems [9–11].

For fast high-ﬁdelity generations and F0controllability, we

propose source-ﬁlter HiFi-GAN (SiFi-GAN), which introduces

source-ﬁlter modeling into HiFi-GAN. The proposed model is

based on the V1 conﬁguration described in the HiFi-GAN pa-

per [1] because of its superior voice quality. SiFi-GAN has two

separate upsampling networks: a source-network and a ﬁlter-

network, which are connected in series to simulate a pseudo

cascade mechanism of the source-ﬁlter theory [12]. The pa-

rameters of HiFi-GAN V1 are pruned to compensate for the

additional computation costs of the source-ﬁlter modeling. Ac-

cording to the experimental results, SiFi-GAN achieves a faster

synthesis than HiFi-GAN V1 on a single CPU with better voice

quality. Also, SiFi-GAN achieves comparable F0controllabil-

ity as the uSFGAN-based model with much faster synthesis.

2. BASELINE HIFI-GAN AND USFGAN

This chapter describes two neural vocoders: HiFi-GAN [1] and

uSFGAN [6, 7]. They are the basis of the proposed SiFi-GAN

and we use them as the baselines.

2.1. HiFi-GAN

HiFi-GAN [1] is a generative adversarial networks (GAN) [13]

based high-ﬁdelity neural vocoder with a sophisticatedly de-

signed generator and multi-period and multi-scale discrimina-

tors. The generator receives a mel-spectrogram as input and

upsamples it to match the temporal resolution of the target

raw waveform with transposed convolution neural networks

followed by multi-receptive ﬁeld fusion (MRF) modules. MRF

comprises several residual blocks, each has customized kernel

and dilation sizes to capture the input feature from different

receptive ﬁelds. The hyperparameters in MRFs can be deﬁned

as a trade-off between synthesis efﬁciency and voice quality.

The generator learns to fool the discriminators while the

discriminators learn to distinguish between the generated and

arXiv:2210.15533v3 [cs.SD] 27 Feb 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SOURCE-FILTERHIFI-GAN:FASTANDPITCHCONTROLLABLEHIGH-FIDELITYNEURALVOCODERReoYoneyama1,Yi-ChiaoWu2,andTomokiToda11NagoyaUniversity,Japan,2MetaRealityLabsResearch,USAABSTRACTOurpreviouswork,theuniedsource-lterGAN(uSFGAN)vocoder,introducedanovelarchitecturebasedonthesource-ltertheoryintotheparallelwa...

展开>> 收起<<

SOURCE-FILTER HIFI-GAN FAST AND PITCH CONTROLLABLE HIGH-FIDELITY NEURAL VOCODER Reo Yoneyama1 Yi-Chiao Wu2 and Tomoki Toda1.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SOURCE-FILTER HIFI-GAN FAST AND PITCH CONTROLLABLE HIGH-FIDELITY NEURAL VOCODER Reo Yoneyama1 Yi-Chiao Wu2 and Tomoki Toda1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: