HIFI-WA VEGAN GENERATIVE ADVERSARIAL NETWORK WITH AUXILIARY SPECTROGRAM-PHASE LOSS FOR HIGH-FIDELITY SINGING VOICE GENERATION Chunhui Wang1Chang Zeng23 Jun Chen4 Xing He1

2025-05-06 0 0 1.17MB 6 页 10玖币
侵权投诉
HIFI-WAVEGAN: GENERATIVE ADVERSARIAL NETWORK WITH AUXILIARY
SPECTROGRAM-PHASE LOSS FOR HIGH-FIDELITY SINGING VOICE GENERATION
Chunhui Wang1,Chang Zeng2,3, Jun Chen4, Xing He1
1Beijing Bombax XiaoIce Technology Co., Ltd, China
2National Institute of Informatics, Japan 3SOKENDAI, Japan
4Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
ABSTRACT
Entertainment-oriented singing voice synthesis (SVS) requires a
vocoder to generate high-fidelity (e.g. 48kHz) audio. However, most
text-to-speech (TTS) vocoders cannot reconstruct the waveform well
in this scenario. In this paper, we propose HiFi-WaveGAN to synthe-
size the 48kHz high-quality singing voices in real-time. Specifically,
it consists of an Extended WaveNet served as a generator, a multi-
period discriminator proposed in HiFiGAN, and a multi-resolution
spectrogram discriminator borrowed from UnivNet. To better recon-
struct the high-frequency part from the full-band mel-spectrogram,
we incorporate a pulse extractor to generate the constraint for the
synthesized waveform. Additionally, an auxiliary spectrogram-
phase loss is utilized to approximate the real distribution further.
The experimental results show that our proposed HiFi-WaveGAN
obtains 4.23 in the mean opinion score (MOS) metric for the 48kHz
SVS task, significantly outperforming other neural vocoders.
Index Termsgenerative adversarial network, vocoder, high-
fidelity, singing voice generation
1. INTRODUCTION
Speech synthesis has been a prominent area of research in the speech
community for an extended period. With the advent of deep learn-
ing, neural networks have replaced various components in the speech
synthesis pipeline, including front-end text analysis [1, 2], acoustic
model [3, 4, 5, 6, 7], and vocoder [8, 9, 10]. Singing voice synthe-
sis (SVS), given its similarity to speech synthesis in workflow, can
benefit significantly from the advancements in Text-to-Speech (TTS)
[11, 12, 13, 14].
For instance, [13] proposed an acoustic model for SVS based on
Fastspeech [4, 15], originally designed for speech synthesis. How-
ever, when it comes to the vocoder, most studies [13, 16] have sim-
ply borrowed existing models such as WORLD [17] and WaveRNN
[18] from the speech synthesis domain without customizing them
to account for the unique characteristics of singing voices. Conse-
quently, these vocoders struggle to accurately reconstruct the high-
frequency components of the mel-spectrogram, particularly when
dealing with entertainment-oriented scenarios that require generat-
ing high-fidelity (e.g., 48kHz) singing voices.
The SingGAN vocoder [19] designed for SVS highlighted sev-
eral reasons why vanilla TTS vocoders are not well-suited for gen-
erating natural singing voices, including:
Long continuous pronunciation,
Strong expressiveness,
*These authors contributed equally to this work.
Higher sampling rate than speech.
SingGAN addressed these characteristics by designing a generative
adversarial network capable of generating singing voices at a 24kHz
sampling rate. However, this sampling rate limitation may impact
the quality of the singing voices, particularly in the high-frequency
parts, as it cannot fully capture information beyond 12kHz, in ac-
cordance with Nyquist’s sampling law. Therefore, there is room to
enhance the quality further by generating singing voices at a higher
sampling rate.
To address this, we propose a novel HiFi-WaveGAN in this pa-
per, capable of generating high-fidelity singing voices at a human-
level quality with a 48kHz sampling rate. Our approach consists of
an Extended WaveNet [20] (ExWaveNet) serving as the generator, a
multi-period discriminator (MPD) proposed in HiFiGAN [9], and a
multi-resolution spectrogram discriminator (MRSD) borrowed from
UnivNet [21].
For the ExWaveNet generator, we expand the kernel size of the
convolutional layers to accommodate a larger receptive field, en-
abling it to handle the long continuous pronunciation in SVS. Rec-
ognizing that pitch carries greater expressiveness in SVS compared
to TTS, we concatenate it with the mel-spectrogram and input it into
an upsample network.
To address the challenge of accurately reconstructing the high-
frequency regions from the full-band mel-spectrogram, we introduce
an additional Pulse Extractor (PE) to generate a pulse sequence,
which is then incorporated as a conditional constraint input to the
ExWaveNet. This operation plays a crucial role in rectifying un-
expected distortions in the waveform. Furthermore, we leverage
the auxiliary spectrogram-phase loss [22], which is combined with
the adversarial training loss and feature match loss, to supervise the
training process of our HiFi-WaveGAN. The inclusion of the phase-
related term in the auxiliary loss enhances our model’s ability to ap-
proximate the real distribution more effectively when compared to
models that lack this component.
In the experiment, Xiaoicesing2 [23] is combined with the pro-
posed HiFi-WaveGAN and other neural vocoders, which are used as
the baseline models. The experimental results show that the quality
of the synthesized singing voice of our proposed HiFi-WaveGAN
not only outperforms all baselines but also comes very close to the
human level under the MOS metric.
The rest of this paper is organized as follows. HiFi-WaveGAN
is illustrated in Section 2 in detail, including the ExWaveNet gener-
ator, discriminators, and training loss functions. The experimental
settings and results*are reported in Section 3. Finally, we conclude
our paper in Section 4.
*Demo page: https://wavelandspeech.github.io/hifi-wavegan/
arXiv:2210.12740v3 [eess.AS] 17 Sep 2023
(a) PWG spectrogram (b) PWG waveform
Fig. 1. Spectrogram and waveform generated by PWG.
2. HIFI-WAVEGAN
Before introducing our proposed HiFi-WaveGAN, we conducted an
analysis of the limitations of TTS neural vocoders when synthesiz-
ing 48kHz singing voices, using the PWG vocoder as an example.
In Figure 1 (a), we observed glitches in the low-frequency part that
disrupt the continuity of the spectrogram. These glitches arise be-
cause the receptive field of PWG is insufficient to cover the long
continuous pronunciation required for singing voices. Furthermore,
the high-frequency harmonics in Figure 1 (a) appear blurry, indicat-
ing that the PWG neural vocoder struggles to accurately reconstruct
the high-frequency components necessary for SVS tasks. We further
investigated the source of these problems in the PWG spectrogram,
as well as in other neural TTS vocoders. Figure 1 (b) reveals periodic
distortions in the waveform generated by the PWG vocoder, leading
to trembling and low-quality singing voices.
To address the aforementioned issues, we present a detailed
illustration of our HiFi-WaveGAN in the subsequent part of this
section. As depicted in Figure 2, our HiFi-WaveGAN consists of
an ExWaveNet generator, responsible for generating high-quality
singing voices. Additionally, we employ two independent discrim-
inators: MPD (Multi-Period Discriminator) and MRSD (Multi-
Resolution Spectrogram Discriminator). These discriminators are
designed to distinguish real/fake waveforms from periodic patterns
and consecutive long dependencies, respectively.
2.1. Extended WaveNet
Similar to the generator in PWG [8], we also adopt a WaveNet-based
model as the generator. However, recognizing that singing voices ex-
hibit longer continuous pronunciation compared to speech [19], we
enhance the architecture of WaveNet by utilizing an 18-layer one-
dimensional CNN with larger kernel sizes, resulting in an improved
model called Extended WaveNet (ExWaveNet). Specifically, we
evenly divide the 18 layers of the generator into three stacks, and the
kernel sizes in each stack are set to {3,3,9,9,17,17}, determined
through neural architecture search [24]. This modification empowers
the network to effectively capture longer continuous pronunciation
in singing voices with high sampling rates, courtesy of the increased
receptive field.
In addition to modeling long continuous pronunciation, restor-
ing expressiveness in singing voices is also crucial. To achieve this,
we concatenate the pitch with the mel-spectrogram as the input to
the upsampling network, following the same approach as in PWG.
The upsampled representation is then concatenated with a pulse se-
quence T, which will be explained in detail in the next paragraph,
serving as the conditional input for generating singing voices with
strong expressiveness. Moreover, we observed that upsampling the
random input noise using an identical network also improves the ex-
pressiveness of the synthesized audio.
As mentioned in the previous paragraph, there are some peri-
odic distortions in the waveform generated by TTS neural vocoders,
leading to low-quality synthesized singing voices. To address this,
we propose an additional Pulse Extractor (PE) to generate a pulse
sequence as a constraint condition during waveform synthesis. As
shown in Fig. 2, the extractor takes the mel-spectrogram and pitch as
input. The pulse is extracted at each extreme point of the waveform
envelope, determined by the V/UV decision and mel-spectrogram,
and can be formulated as:
T[i] =
||M[i]||F, UV = 1, i =s
f0
0, UV = 1, i ̸=s
f0
noise, UV = 0
(1)
where Mrepresents the mel-spectrogram, iis the time index,
|| · ||Fdenotes the Frobenius norm, T[i]indicates the pulse value at
index i, and sand f0denote the sampling rate and pitch, respectively.
The noise in the formula is generated from a Gaussian distribution.
2.2. Discriminators
Identifying both consecutive long-term dependencies and periodic
patterns plays a crucial role in modeling realistic audio [9]. In the
proposed HiFi-WaveGAN, we employ two independent discrimina-
tors to evaluate singing voices from these two aspects.
The first discriminator is MRSD, adapted from UnivNet [21],
which identifies consecutive long-term dependencies in singing
voices from the spectrogram. We transform both real and fake
singing voices into spectrograms using different combinations of
FFT size, window length, and shift size. Then, two-dimensional
convolutional layers are applied to the spectrograms. As depicted
in Fig. 2, the model employs Ksub-discriminators, each utilizing a
specific combination of spectrogram inputs. In our implementation,
Kis set to four.
The second discriminator is MPD, identical to the one used in
HiFiGAN [9]. It transforms the one-dimensional waveform with
length Tinto 2-d data with height T/p and width pby setting the
periods pto Mdifferent values, resulting in Mindependent sub-
discriminators within MPD. In this paper, we set Mto five and pto
[2,3,5,7,11]. As described in [9], this design allows the discrim-
inator to capture distinct implicit structures by examining different
parts of the input audio.
2.3. Loss function
Similar to other models [9, 19], we adopt a weighted combination of
multiple loss terms as the final loss function formulated by Eq. (2)
and Eq. (3) to supervise the training process of our HiFi-WaveGAN.
LD=Ladv (D;G),(2)
LG=λ1∗ Ladv (G;D) + λ2∗ Laux +λ3∗ Lf m ,(3)
where Ladv ,Laux, and Lf m denote adversarial loss, auxiliary
spectrogram-phase loss, and feature match loss, respectively. In this
paper, λ1,λ2, and λ3are set to 1,120, and 10, respectively.
2.3.1. Adversarial loss
For the adversarial loss, we adopt the format in LS-GAN [25] to
avoid the gradient vanishing. The formula is shown as
Ladv (G;D) = Ez∼N (0,1)[(1 D(G(z)))2],(4)
Ladv (D;G) = Eypdata [(1 D(y))2] + Ez∼N (0,1)[D(G(z))2],
(5)
摘要:

HIFI-WAVEGAN:GENERATIVEADVERSARIALNETWORKWITHAUXILIARYSPECTROGRAM-PHASELOSSFORHIGH-FIDELITYSINGINGVOICEGENERATION∗ChunhuiWang1,∗ChangZeng2,3,JunChen4,XingHe11BeijingBombaxXiaoIceTechnologyCo.,Ltd,China2NationalInstituteofInformatics,Japan3SOKENDAI,Japan4ShenzhenInternationalGraduateSchool,TsinghuaUn...

展开>> 收起<<
HIFI-WA VEGAN GENERATIVE ADVERSARIAL NETWORK WITH AUXILIARY SPECTROGRAM-PHASE LOSS FOR HIGH-FIDELITY SINGING VOICE GENERATION Chunhui Wang1Chang Zeng23 Jun Chen4 Xing He1.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:1.17MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注