GAN YOU HEAR ME RECLAIMING UNCONDITIONAL SPEECH SYNTHESIS FROM DIFFUSION MODELS Matthew Baas and Herman Kamper

2025-05-06 0 0 625.45KB 6 页 10玖币
侵权投诉
GAN YOU HEAR ME?
RECLAIMING UNCONDITIONAL SPEECH SYNTHESIS FROM DIFFUSION MODELS
Matthew Baas and Herman Kamper
MediaLab, Electrical & Electronic Engineering, Stellenbosch University, South Africa
ABSTRACT
We propose AudioStyleGAN (ASGAN), a new generative adversar-
ial network (GAN) for unconditional speech synthesis. As in the
StyleGAN family of image synthesis models, ASGAN maps sam-
pled noise to a disentangled latent vector which is then mapped to
a sequence of audio features so that signal aliasing is suppressed at
every layer. To successfully train ASGAN, we introduce a num-
ber of new techniques, including a modification to adaptive dis-
criminator augmentation to probabilistically skip discriminator up-
dates. ASGAN achieves state-of-the-art results in unconditional
speech synthesis on the Google Speech Commands dataset. It is
also substantially faster than the top-performing diffusion models.
Through a design that encourages disentanglement, ASGAN is able
to perform voice conversion and speech editing without being ex-
plicitly trained to do so. ASGAN demonstrates that GANs are still
highly competitive with diffusion models. Code, models, samples:
https://github.com/RF5/simple-asgan/.
Index Terms
Unconditional speech synthesis, generative ad-
versarial networks, speech disentanglement, voice conversion.
1. INTRODUCTION
Unconditional speech synthesis is the task of generating coherent
speech without any conditioning inputs such as text or speaker la-
bels [1]. As in image synthesis [2], a well-performing unconditional
speech synthesis model would have several useful applications: from
latent interpolations between utterances and fine-grained tuning of
different aspects of the generated speech, to audio compression and
better probability density estimation of speech.
Spurred on by recent improvements in diffusion models [3] for
images [4
6], there has been a substantial improvement in uncon-
ditional speech synthesis in the last few years. The current best-
performing approaches are all trained as diffusion models [7,8]. Be-
fore this, most studies used generative adversarial networks (GANs)
[9] that map a latent vector to a sequence of speech features with a
single forward pass through the model. However, performance was
limited [1, 10], leading to GANs falling out of favour for this task.
Motivated by the StyleGAN literature [11
13] for image synthe-
sis, we aim to reinvigorate GANs for unconditional speech synthesis.
To this end, we propose AudioStyleGAN (ASGAN): a convolutional
GAN which maps a single latent vector to a sequence of audio fea-
tures, and is designed to have a disentangled latent space. The model
is based in large part on StyleGAN3 [13], which we adapt for audio
synthesis. Concretely, we adapt the style layers to remove signal
aliasing caused by the non-linearities in the network. This is accom-
plished with anti-aliasing filters to ensure that the Nyquist
-
Shannon
All experiments were performed on Stellenbosch University’s High Per-
formance Computing (HPC) cluster.
sampling limits are met in each layer. We also propose a modification
to adaptive discriminator augmentation [14] to stabilize training by
randomly dropping discriminator updates based on a guiding signal.
Using objective metrics to measure the quality and diversity of
generated samples [2, 15, 16], we show that ASGAN sets a new state-
of-the-art in unconditional speech synthesis on the Google Speech
Commands digits dataset [17]. It not only outperforms the best
existing models, but is also faster to train and faster in inference.
Mean opinion scores (MOS) also indicate that ASGAN’s generated
utterances sound more natural (MOS: 3.68) than the existing best
model (SaShiMi [7], MOS: 3.33).
Through ASGAN’s design, the model’s latent space is disentan-
gled during training, enabling the model – without any additional
training – to also perform voice conversion and speech editing in a
zero-shot fashion. Objective metrics that measure latent space disen-
tanglement indicate that ASGAN has smoother latent representations
compared to existing diffusion models.
2. RELATED WORK
We start by distinguishing what we call unconditional speech syn-
thesis to the related but different task of generative spoken language
modeling (GSLM). In GSLM, a large autoregressive language model
is typically trained on some discrete units (e.g. HuBERT [18] clusters
or clustered spectrogram features), similar to how a language model
is trained on text [19, 20]. While this also enables the generation
of speech without any conditioning input, GSLM implies a model
structure consisting of an encoder to discretize speech, a language
model, and a decoder [21]. This means that during generation, you
are bound by the discrete units in the model. E.g., it is not possible
to interpolate between two utterances in a latent space or to directly
control speaker characteristics during generation. If this is desired,
additional components must be explicitly built into the model [20].
In contrast, in unconditional speech synthesis we do not assume
any knowledge of particular aspects of speech beforehand. Instead
of using some intermediate discretization step, such models typically
use noise to directly generate speech, often via some latent repre-
sentation. The latent space should ideally be disentangled, allowing
for modelling and control of the generated speech. In contrast to
GSLM, the synthesis model should learn to disentangle without being
explicitly designed to control specific speech characteristics. In some
sense this is a more challenging task than GSLM, which is why most
unconditional speech synthesis models are still evaluated on short
utterances of isolated spoken words [1] (as we also do here).
Within unconditional speech synthesis, a substantial body of work
focuses on either autoregressive [22] models – generating a current
sample based on previous outputs – or diffusion models [8]. Diffusion
models iteratively de-noise a sampled signal into a waveform through
a Markov chain with a constant number of steps [3]. At each inference
step, the original noise signal is slightly de-noised until – in the last
978-1-6654-7189-3/22/$31.00 ©2023 IEEE
arXiv:2210.05271v1 [cs.SD] 11 Oct 2022
1024
512
512
log mel-
spectrogram
FF
Modulated
Conv1D
512
x5
Modulated
Conv1D
Linear
Upsample 2x / 4x
LPF
Leaky ReLU
LPF
Downsample 2x
x4
512
Style Block
x3
256
x2
128
Style Block
Linear
128
Style Block
(L, 128)
Conv1D k:5 s:1
LPF
Downsample 2x
Conv1D k:5 s:1
Conv1D k:5 s:2
Leaky ReLU
Leaky ReLU
LPF
ConvD Block
512
MB std
Conv1D k:5 s:1
Leaky ReLU
Linear
Linear
Leaky ReLU
512
Flatten
1
512 512
512
Leaky ReLU
Linear
Linear
Leaky ReLU
Linear
Linear
Leaky ReLU
Linear
512 512 512 512
network
Conv1D k:5 s:1
Conv1D k:5 s:1
Conv1D k:5 s:1
Leaky ReLU
Leaky ReLU
x2
512 512
512
x4
Style Block
Fig. 1
: The ASGAN generator (left) and discriminator (right). FF, LPF, Conv1D indicate Fourier feature [13], low-pass filter, and 1D
convolution layers, respectively. The numbers above linear and convolutional layers indicate the number of output features/channels for that
layer. Stacked blocks indicate a layer repeated sequentially, with the number of repeats indicated above the block (e.g. x3”).
step – it resembles coherent speech. Autoregressive and diffusion
models are relatively slow because they require repeated forward
passes through the model during inference.
Earlier studies [1, 10] attempted to use GANs [9] for uncondi-
tional speech synthesis, which has the advantage of requiring only
a single pass through the model. While results showed some initial
promise, performance was poor in terms of speech quality and di-
versity, with the more recent diffusion models performing much
better [7]. However, there have been substantial improvements
in GAN-based modelling for image synthesis in the intervening
years [11, 12, 14]. Our goal is to improve the performance of the
earlier GAN-based unconditional speech synthesis models by adapt-
ing lessons from these recent image synthesis studies.
Some of these innovations in GANs are modality-agnostic:
R1
regularization [23] and exponential moving averaging of generator
weights [24] can be directly transferred from the vision domain to
speech. Other techniques, such as the carefully designed anti-aliasing
filters between layers in StyleGAN3 [13] require specific adaptation;
in contrast to images, there is little meaningful information in speech
below 300 Hz, necessitating a redesign of the anti-aliasing filters.
In a very related research direction, Begu
ˇ
s [10, 25] has been
studying how GAN-based unconditional speech synthesis models
internally perform lexical and phonological learning, and how this
relates to human learning. These studies, however, have been relying
on the older GAN synthesis models. We hope that by developing
better performing GANs for unconditional speech synthesis, such
investigations will also be improved. Recently, [26] attempted to
directly use StyleGAN2 for conditional and unconditional synthesis
of emotional vocal bursts. This further motivates a reinvestigation
of GANs, but here we look specifically at the generation of speech
rather than paralinguistic sounds.
3. ASGAN: AUDIO STYLE GAN
Our model is based on the StyleGAN family of models [11] for image
synthesis. We adapt and extend the approach to audio, and therefore
dub our model AudioStyleGAN (ASGAN). The model follows the
setup of a standard GAN with a single generator network
G
and
a single discriminator
D
[9]. The generator
G
accepts a vector
z
sampled from a normal distribution and processes it into a sequence
of speech features
X
. In this work, we restrict the sequence of
speech features
X
to always have a fixed pre-specified duration. The
discriminator
D
accepts a sequence of speech features
X
and yields
a scalar output.
D
is optimized to raise its output for
X
sampled
from real data and lower its output for
X
produced by the generator.
Meanwhile,
G
is optimized to maximize
D(X)
for
X
sampled from
the generator, i.e. when
X=G(z)
. The features
X
are converted
to a waveform using a pretrained HiFi-GAN vocoder [27]. During
training, a new adaptive discriminator updating technique is added to
ensure stability and convergence, as discussed in Sec. 4.
3.1. Generator
The architecture of the generator
G
is shown on the left of Fig. 1. It
consists of a latent mapping network
W
that converts
z
to a disentan-
gled latent space, a special Fourier feature (FF) layer which converts a
single vector from this latent space into a sequence of cosine features
of fixed length, and finally a convolutional encoder which iteratively
refines the cosine features into the final speech features X.
Mapping network:
The mapping network
W
is a multi-layer
perceptron with leaky ReLU activations. As input it takes in a vector
sampled from a normal distribution
zZ=N(0,I)
; we use a
512-dimensional multi-variate normal vector,
zR512
. Passing
z
through the mapping network produces a latent vector
w=W(z)
of the same dimensionality as
z
. As explained in [11], the primary
purpose of
W
is to learn to map noise to a linearly disentangled space,
as this will allow for controllable and understandable synthesis.
W
is
coaxed into learning such a disentangled representation because it can
only linearly modulate channels of the cosine features in each layer
of the convolutional encoder (see details below). This means that
W
must learn to map the random normal
Z
-space into a
W
-space that
linearly disentangles common factors of speech variation.
Convolutional encoder:
The convolutional encoder begins by
linearly projecting
w
as the input to an FF layer [28]. We use the
Gaussian Fourier feature mapping [28] and incorporate the trans-
formation from StyleGAN3 [13]. The Gaussian FF layer samples
a frequency and phase from a Gaussian distribution for each out-
put channel. The layer then linearly projects the input vector to a
vector of phases which are added to the random phases. The out-
put is calculated as the cosine functions of these frequencies and
phases, one frequency/phase for each output channel. The result is
that
w
is converted into a sequence of vectors at the output of the
FF layer. This is iteratively passed through several
Style Blocks
.
In each
Style Block
layer, the input sequence is passed through
a modulated convolution layer [12] whereby the final convolution
kernel is computed by multiplying the layer’s learnt kernel with the
style vector derived from
w
, broadcasted over the length of the ker-
nel. To ensure the signal does not experience aliasing due to the
non-linearity, the leaky ReLU layers are surrounded by layers respon-
sible for anti-aliasing (explained below). All these layers comprise a
Style Block
, which is repeated in groups of 5, 4, 3, and finally
2 blocks. The last block in each group upsamples by
4×
instead of
2×
, thereby increasing the sequence length by a factor of 2 for each
group. A final 1D convolution projects the output from the last group
into the audio feature space (e.g. log mel-spectrogram or HuBERT
features [18]), as illustrated in the middle of Fig. 1.
Anti-aliasing filters:
From image synthesis with GANs [13], we
know that the generator must include anti-aliasing filters for the sig-
nal propagating through the network to satisfy the Nyquist-Shannon
sampling theorem. This is why, before and after a non-linearity, we
include upsampling, low-pass filter (LPF), and downsampling layers
摘要:

GANYOUHEARME?RECLAIMINGUNCONDITIONALSPEECHSYNTHESISFROMDIFFUSIONMODELSMatthewBaasandHermanKamperMediaLab,Electrical&ElectronicEngineering,StellenboschUniversity,SouthAfricaABSTRACTWeproposeAudioStyleGAN(ASGAN),anewgenerativeadversar-ialnetwork(GAN)forunconditionalspeechsynthesis.AsintheStyleGANfamil...

展开>> 收起<<
GAN YOU HEAR ME RECLAIMING UNCONDITIONAL SPEECH SYNTHESIS FROM DIFFUSION MODELS Matthew Baas and Herman Kamper.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:625.45KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注