HIFI-WA VEGAN GENERATIVE ADVERSARIAL NETWORK WITH AUXILIARY SPECTROGRAM-PHASE LOSS FOR HIGH-FIDELITY SINGING VOICE GENERATION Chunhui Wang1Chang Zeng23 Jun Chen4 Xing He1

2025-05-06 0 0 1.17MB 6 页 10玖币

侵权投诉

HIFI-WAVEGAN: GENERATIVE ADVERSARIAL NETWORK WITH AUXILIARY

SPECTROGRAM-PHASE LOSS FOR HIGH-FIDELITY SINGING VOICE GENERATION

∗Chunhui Wang1,∗Chang Zeng2,3, Jun Chen4, Xing He1

1Beijing Bombax XiaoIce Technology Co., Ltd, China

2National Institute of Informatics, Japan 3SOKENDAI, Japan

4Shenzhen International Graduate School, Tsinghua University, Shenzhen, China

ABSTRACT

Entertainment-oriented singing voice synthesis (SVS) requires a

vocoder to generate high-ﬁdelity (e.g. 48kHz) audio. However, most

text-to-speech (TTS) vocoders cannot reconstruct the waveform well

in this scenario. In this paper, we propose HiFi-WaveGAN to synthe-

size the 48kHz high-quality singing voices in real-time. Speciﬁcally,

it consists of an Extended WaveNet served as a generator, a multi-

period discriminator proposed in HiFiGAN, and a multi-resolution

spectrogram discriminator borrowed from UnivNet. To better recon-

struct the high-frequency part from the full-band mel-spectrogram,

we incorporate a pulse extractor to generate the constraint for the

synthesized waveform. Additionally, an auxiliary spectrogram-

phase loss is utilized to approximate the real distribution further.

The experimental results show that our proposed HiFi-WaveGAN

obtains 4.23 in the mean opinion score (MOS) metric for the 48kHz

SVS task, signiﬁcantly outperforming other neural vocoders.

Index Terms—generative adversarial network, vocoder, high-

ﬁdelity, singing voice generation

1. INTRODUCTION

Speech synthesis has been a prominent area of research in the speech

community for an extended period. With the advent of deep learn-

ing, neural networks have replaced various components in the speech

synthesis pipeline, including front-end text analysis [1, 2], acoustic

model [3, 4, 5, 6, 7], and vocoder [8, 9, 10]. Singing voice synthe-

sis (SVS), given its similarity to speech synthesis in workﬂow, can

beneﬁt signiﬁcantly from the advancements in Text-to-Speech (TTS)

[11, 12, 13, 14].

For instance, [13] proposed an acoustic model for SVS based on

Fastspeech [4, 15], originally designed for speech synthesis. How-

ever, when it comes to the vocoder, most studies [13, 16] have sim-

ply borrowed existing models such as WORLD [17] and WaveRNN

[18] from the speech synthesis domain without customizing them

to account for the unique characteristics of singing voices. Conse-

quently, these vocoders struggle to accurately reconstruct the high-

frequency components of the mel-spectrogram, particularly when

dealing with entertainment-oriented scenarios that require generat-

ing high-ﬁdelity (e.g., 48kHz) singing voices.

The SingGAN vocoder [19] designed for SVS highlighted sev-

eral reasons why vanilla TTS vocoders are not well-suited for gen-

erating natural singing voices, including:

• Long continuous pronunciation,

• Strong expressiveness,

*These authors contributed equally to this work.

• Higher sampling rate than speech.

SingGAN addressed these characteristics by designing a generative

adversarial network capable of generating singing voices at a 24kHz

sampling rate. However, this sampling rate limitation may impact

the quality of the singing voices, particularly in the high-frequency

parts, as it cannot fully capture information beyond 12kHz, in ac-

cordance with Nyquist’s sampling law. Therefore, there is room to

enhance the quality further by generating singing voices at a higher

sampling rate.

To address this, we propose a novel HiFi-WaveGAN in this pa-

per, capable of generating high-ﬁdelity singing voices at a human-

level quality with a 48kHz sampling rate. Our approach consists of

an Extended WaveNet [20] (ExWaveNet) serving as the generator, a

multi-period discriminator (MPD) proposed in HiFiGAN [9], and a

multi-resolution spectrogram discriminator (MRSD) borrowed from

UnivNet [21].

For the ExWaveNet generator, we expand the kernel size of the

convolutional layers to accommodate a larger receptive ﬁeld, en-

abling it to handle the long continuous pronunciation in SVS. Rec-

ognizing that pitch carries greater expressiveness in SVS compared

to TTS, we concatenate it with the mel-spectrogram and input it into

an upsample network.

To address the challenge of accurately reconstructing the high-

frequency regions from the full-band mel-spectrogram, we introduce

an additional Pulse Extractor (PE) to generate a pulse sequence,

which is then incorporated as a conditional constraint input to the

ExWaveNet. This operation plays a crucial role in rectifying un-

expected distortions in the waveform. Furthermore, we leverage

the auxiliary spectrogram-phase loss [22], which is combined with

the adversarial training loss and feature match loss, to supervise the

training process of our HiFi-WaveGAN. The inclusion of the phase-

related term in the auxiliary loss enhances our model’s ability to ap-

proximate the real distribution more effectively when compared to

models that lack this component.

In the experiment, Xiaoicesing2 [23] is combined with the pro-

posed HiFi-WaveGAN and other neural vocoders, which are used as

the baseline models. The experimental results show that the quality

of the synthesized singing voice of our proposed HiFi-WaveGAN

not only outperforms all baselines but also comes very close to the

human level under the MOS metric.

The rest of this paper is organized as follows. HiFi-WaveGAN

is illustrated in Section 2 in detail, including the ExWaveNet gener-

ator, discriminators, and training loss functions. The experimental

settings and results*are reported in Section 3. Finally, we conclude

our paper in Section 4.

*Demo page: https://wavelandspeech.github.io/hiﬁ-wavegan/

arXiv:2210.12740v3 [eess.AS] 17 Sep 2023

(a) PWG spectrogram (b) PWG waveform

Fig. 1. Spectrogram and waveform generated by PWG.

2. HIFI-WAVEGAN

Before introducing our proposed HiFi-WaveGAN, we conducted an

analysis of the limitations of TTS neural vocoders when synthesiz-

ing 48kHz singing voices, using the PWG vocoder as an example.

In Figure 1 (a), we observed glitches in the low-frequency part that

disrupt the continuity of the spectrogram. These glitches arise be-

cause the receptive ﬁeld of PWG is insufﬁcient to cover the long

continuous pronunciation required for singing voices. Furthermore,

the high-frequency harmonics in Figure 1 (a) appear blurry, indicat-

ing that the PWG neural vocoder struggles to accurately reconstruct

the high-frequency components necessary for SVS tasks. We further

investigated the source of these problems in the PWG spectrogram,

as well as in other neural TTS vocoders. Figure 1 (b) reveals periodic

distortions in the waveform generated by the PWG vocoder, leading

to trembling and low-quality singing voices.

To address the aforementioned issues, we present a detailed

illustration of our HiFi-WaveGAN in the subsequent part of this

section. As depicted in Figure 2, our HiFi-WaveGAN consists of

an ExWaveNet generator, responsible for generating high-quality

singing voices. Additionally, we employ two independent discrim-

inators: MPD (Multi-Period Discriminator) and MRSD (Multi-

Resolution Spectrogram Discriminator). These discriminators are

designed to distinguish real/fake waveforms from periodic patterns

and consecutive long dependencies, respectively.

2.1. Extended WaveNet

Similar to the generator in PWG [8], we also adopt a WaveNet-based

model as the generator. However, recognizing that singing voices ex-

hibit longer continuous pronunciation compared to speech [19], we

enhance the architecture of WaveNet by utilizing an 18-layer one-

dimensional CNN with larger kernel sizes, resulting in an improved

model called Extended WaveNet (ExWaveNet). Speciﬁcally, we

evenly divide the 18 layers of the generator into three stacks, and the

kernel sizes in each stack are set to {3,3,9,9,17,17}, determined

through neural architecture search [24]. This modiﬁcation empowers

the network to effectively capture longer continuous pronunciation

in singing voices with high sampling rates, courtesy of the increased

receptive ﬁeld.

In addition to modeling long continuous pronunciation, restor-

ing expressiveness in singing voices is also crucial. To achieve this,

we concatenate the pitch with the mel-spectrogram as the input to

the upsampling network, following the same approach as in PWG.

The upsampled representation is then concatenated with a pulse se-

quence T, which will be explained in detail in the next paragraph,

serving as the conditional input for generating singing voices with

strong expressiveness. Moreover, we observed that upsampling the

random input noise using an identical network also improves the ex-

pressiveness of the synthesized audio.

As mentioned in the previous paragraph, there are some peri-

odic distortions in the waveform generated by TTS neural vocoders,

leading to low-quality synthesized singing voices. To address this,

we propose an additional Pulse Extractor (PE) to generate a pulse

sequence as a constraint condition during waveform synthesis. As

shown in Fig. 2, the extractor takes the mel-spectrogram and pitch as

input. The pulse is extracted at each extreme point of the waveform

envelope, determined by the V/UV decision and mel-spectrogram,

and can be formulated as:

T[i] = 









||M[i]||F, UV = 1, i =s

0, UV = 1, i ̸=s

noise, UV = 0

(1)

where Mrepresents the mel-spectrogram, iis the time index,

|| · ||Fdenotes the Frobenius norm, T[i]indicates the pulse value at

index i, and sand f0denote the sampling rate and pitch, respectively.

The noise in the formula is generated from a Gaussian distribution.

2.2. Discriminators

Identifying both consecutive long-term dependencies and periodic

patterns plays a crucial role in modeling realistic audio [9]. In the

proposed HiFi-WaveGAN, we employ two independent discrimina-

tors to evaluate singing voices from these two aspects.

The ﬁrst discriminator is MRSD, adapted from UnivNet [21],

which identiﬁes consecutive long-term dependencies in singing

voices from the spectrogram. We transform both real and fake

singing voices into spectrograms using different combinations of

FFT size, window length, and shift size. Then, two-dimensional

convolutional layers are applied to the spectrograms. As depicted

in Fig. 2, the model employs Ksub-discriminators, each utilizing a

speciﬁc combination of spectrogram inputs. In our implementation,

Kis set to four.

The second discriminator is MPD, identical to the one used in

HiFiGAN [9]. It transforms the one-dimensional waveform with

length Tinto 2-d data with height T/p and width pby setting the

periods pto Mdifferent values, resulting in Mindependent sub-

discriminators within MPD. In this paper, we set Mto ﬁve and pto

[2,3,5,7,11]. As described in [9], this design allows the discrim-

inator to capture distinct implicit structures by examining different

parts of the input audio.

2.3. Loss function

Similar to other models [9, 19], we adopt a weighted combination of

multiple loss terms as the ﬁnal loss function formulated by Eq. (2)

and Eq. (3) to supervise the training process of our HiFi-WaveGAN.

LD=Ladv (D;G),(2)

LG=λ1∗ Ladv (G;D) + λ2∗ Laux +λ3∗ Lf m ,(3)

where Ladv ,Laux, and Lf m denote adversarial loss, auxiliary

spectrogram-phase loss, and feature match loss, respectively. In this

paper, λ1,λ2, and λ3are set to 1,120, and 10, respectively.

2.3.1. Adversarial loss

For the adversarial loss, we adopt the format in LS-GAN [25] to

avoid the gradient vanishing. The formula is shown as

Ladv (G;D) = Ez∼N (0,1)[(1 −D(G(z)))2],(4)

Ladv (D;G) = Ey∼pdata [(1 −D(y))2] + Ez∼N (0,1)[D(G(z))2],

(5)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HIFI-WAVEGAN:GENERATIVEADVERSARIALNETWORKWITHAUXILIARYSPECTROGRAM-PHASELOSSFORHIGH-FIDELITYSINGINGVOICEGENERATION∗ChunhuiWang1,∗ChangZeng2,3,JunChen4,XingHe11BeijingBombaxXiaoIceTechnologyCo.,Ltd,China2NationalInstituteofInformatics,Japan3SOKENDAI,Japan4ShenzhenInternationalGraduateSchool,TsinghuaUn...

展开>> 收起<<

HIFI-WA VEGAN GENERATIVE ADVERSARIAL NETWORK WITH AUXILIARY SPECTROGRAM-PHASE LOSS FOR HIGH-FIDELITY SINGING VOICE GENERATION Chunhui Wang1Chang Zeng23 Jun Chen4 Xing He1.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

HIFI-WA VEGAN GENERATIVE ADVERSARIAL NETWORK WITH AUXILIARY SPECTROGRAM-PHASE LOSS FOR HIGH-FIDELITY SINGING VOICE GENERATION Chunhui Wang1Chang Zeng23 Jun Chen4 Xing He1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: