GAN YOU HEAR ME RECLAIMING UNCONDITIONAL SPEECH SYNTHESIS FROM DIFFUSION MODELS Matthew Baas and Herman Kamper

2025-05-06 0 0 625.45KB 6 页 10玖币

侵权投诉

GAN YOU HEAR ME?

RECLAIMING UNCONDITIONAL SPEECH SYNTHESIS FROM DIFFUSION MODELS

Matthew Baas and Herman Kamper

MediaLab, Electrical & Electronic Engineering, Stellenbosch University, South Africa

ABSTRACT

We propose AudioStyleGAN (ASGAN), a new generative adversar-

ial network (GAN) for unconditional speech synthesis. As in the

StyleGAN family of image synthesis models, ASGAN maps sam-

pled noise to a disentangled latent vector which is then mapped to

a sequence of audio features so that signal aliasing is suppressed at

every layer. To successfully train ASGAN, we introduce a num-

ber of new techniques, including a modiﬁcation to adaptive dis-

criminator augmentation to probabilistically skip discriminator up-

dates. ASGAN achieves state-of-the-art results in unconditional

speech synthesis on the Google Speech Commands dataset. It is

also substantially faster than the top-performing diffusion models.

Through a design that encourages disentanglement, ASGAN is able

to perform voice conversion and speech editing without being ex-

plicitly trained to do so. ASGAN demonstrates that GANs are still

highly competitive with diffusion models. Code, models, samples:

https://github.com/RF5/simple-asgan/.

Index Terms—

Unconditional speech synthesis, generative ad-

versarial networks, speech disentanglement, voice conversion.

1. INTRODUCTION

Unconditional speech synthesis is the task of generating coherent

speech without any conditioning inputs such as text or speaker la-

bels [1]. As in image synthesis [2], a well-performing unconditional

speech synthesis model would have several useful applications: from

latent interpolations between utterances and ﬁne-grained tuning of

different aspects of the generated speech, to audio compression and

better probability density estimation of speech.

Spurred on by recent improvements in diffusion models [3] for

images [4

–

6], there has been a substantial improvement in uncon-

ditional speech synthesis in the last few years. The current best-

performing approaches are all trained as diffusion models [7,8]. Be-

fore this, most studies used generative adversarial networks (GANs)

[9] that map a latent vector to a sequence of speech features with a

single forward pass through the model. However, performance was

limited [1, 10], leading to GANs falling out of favour for this task.

Motivated by the StyleGAN literature [11

–

13] for image synthe-

sis, we aim to reinvigorate GANs for unconditional speech synthesis.

To this end, we propose AudioStyleGAN (ASGAN): a convolutional

GAN which maps a single latent vector to a sequence of audio fea-

tures, and is designed to have a disentangled latent space. The model

is based in large part on StyleGAN3 [13], which we adapt for audio

synthesis. Concretely, we adapt the style layers to remove signal

aliasing caused by the non-linearities in the network. This is accom-

plished with anti-aliasing ﬁlters to ensure that the Nyquist

Shannon

All experiments were performed on Stellenbosch University’s High Per-

formance Computing (HPC) cluster.

sampling limits are met in each layer. We also propose a modiﬁcation

to adaptive discriminator augmentation [14] to stabilize training by

randomly dropping discriminator updates based on a guiding signal.

Using objective metrics to measure the quality and diversity of

generated samples [2, 15, 16], we show that ASGAN sets a new state-

of-the-art in unconditional speech synthesis on the Google Speech

Commands digits dataset [17]. It not only outperforms the best

existing models, but is also faster to train and faster in inference.

Mean opinion scores (MOS) also indicate that ASGAN’s generated

utterances sound more natural (MOS: 3.68) than the existing best

model (SaShiMi [7], MOS: 3.33).

Through ASGAN’s design, the model’s latent space is disentan-

gled during training, enabling the model – without any additional

training – to also perform voice conversion and speech editing in a

zero-shot fashion. Objective metrics that measure latent space disen-

tanglement indicate that ASGAN has smoother latent representations

compared to existing diffusion models.

2. RELATED WORK

We start by distinguishing what we call unconditional speech syn-

thesis to the related but different task of generative spoken language

modeling (GSLM). In GSLM, a large autoregressive language model

is typically trained on some discrete units (e.g. HuBERT [18] clusters

or clustered spectrogram features), similar to how a language model

is trained on text [19, 20]. While this also enables the generation

of speech without any conditioning input, GSLM implies a model

structure consisting of an encoder to discretize speech, a language

model, and a decoder [21]. This means that during generation, you

are bound by the discrete units in the model. E.g., it is not possible

to interpolate between two utterances in a latent space or to directly

control speaker characteristics during generation. If this is desired,

additional components must be explicitly built into the model [20].

In contrast, in unconditional speech synthesis we do not assume

any knowledge of particular aspects of speech beforehand. Instead

of using some intermediate discretization step, such models typically

use noise to directly generate speech, often via some latent repre-

sentation. The latent space should ideally be disentangled, allowing

for modelling and control of the generated speech. In contrast to

GSLM, the synthesis model should learn to disentangle without being

explicitly designed to control speciﬁc speech characteristics. In some

sense this is a more challenging task than GSLM, which is why most

unconditional speech synthesis models are still evaluated on short

utterances of isolated spoken words [1] (as we also do here).

Within unconditional speech synthesis, a substantial body of work

focuses on either autoregressive [22] models – generating a current

sample based on previous outputs – or diffusion models [8]. Diffusion

models iteratively de-noise a sampled signal into a waveform through

a Markov chain with a constant number of steps [3]. At each inference

step, the original noise signal is slightly de-noised until – in the last

arXiv:2210.05271v1 [cs.SD] 11 Oct 2022

1024

512

log mel-

spectrogram

Modulated

Conv1D

512

Modulated

Conv1D

Linear

Upsample 2x / 4x

LPF

Leaky ReLU

LPF

Downsample 2x

512

Style Block

256

128

Style Block

Linear

128

Style Block

(L, 128)

Conv1D k:5 s:1

LPF

Downsample 2x

Conv1D k:5 s:1

Conv1D k:5 s:2

Leaky ReLU

LPF

ConvD Block

512

MB std

Conv1D k:5 s:1

Leaky ReLU

Linear

Leaky ReLU

512

Flatten

512 512

512

Leaky ReLU

Linear

Leaky ReLU

Linear

Leaky ReLU

Linear

512 512 512 512

network

Conv1D k:5 s:1

Leaky ReLU

512 512

512

Style Block

Fig. 1

: The ASGAN generator (left) and discriminator (right). FF, LPF, Conv1D indicate Fourier feature [13], low-pass ﬁlter, and 1D

convolution layers, respectively. The numbers above linear and convolutional layers indicate the number of output features/channels for that

layer. Stacked blocks indicate a layer repeated sequentially, with the number of repeats indicated above the block (e.g. “x3”).

step – it resembles coherent speech. Autoregressive and diffusion

models are relatively slow because they require repeated forward

passes through the model during inference.

Earlier studies [1, 10] attempted to use GANs [9] for uncondi-

tional speech synthesis, which has the advantage of requiring only

a single pass through the model. While results showed some initial

promise, performance was poor in terms of speech quality and di-

versity, with the more recent diffusion models performing much

better [7]. However, there have been substantial improvements

in GAN-based modelling for image synthesis in the intervening

years [11, 12, 14]. Our goal is to improve the performance of the

earlier GAN-based unconditional speech synthesis models by adapt-

ing lessons from these recent image synthesis studies.

Some of these innovations in GANs are modality-agnostic:

regularization [23] and exponential moving averaging of generator

weights [24] can be directly transferred from the vision domain to

speech. Other techniques, such as the carefully designed anti-aliasing

ﬁlters between layers in StyleGAN3 [13] require speciﬁc adaptation;

in contrast to images, there is little meaningful information in speech

below 300 Hz, necessitating a redesign of the anti-aliasing ﬁlters.

In a very related research direction, Begu

s [10, 25] has been

studying how GAN-based unconditional speech synthesis models

internally perform lexical and phonological learning, and how this

relates to human learning. These studies, however, have been relying

on the older GAN synthesis models. We hope that by developing

better performing GANs for unconditional speech synthesis, such

investigations will also be improved. Recently, [26] attempted to

directly use StyleGAN2 for conditional and unconditional synthesis

of emotional vocal bursts. This further motivates a reinvestigation

of GANs, but here we look speciﬁcally at the generation of speech

rather than paralinguistic sounds.

3. ASGAN: AUDIO STYLE GAN

Our model is based on the StyleGAN family of models [11] for image

synthesis. We adapt and extend the approach to audio, and therefore

dub our model AudioStyleGAN (ASGAN). The model follows the

setup of a standard GAN with a single generator network

and

a single discriminator

[9]. The generator

accepts a vector

sampled from a normal distribution and processes it into a sequence

of speech features

. In this work, we restrict the sequence of

speech features

to always have a ﬁxed pre-speciﬁed duration. The

discriminator

accepts a sequence of speech features

and yields

a scalar output.

is optimized to raise its output for

sampled

from real data and lower its output for

produced by the generator.

Meanwhile,

is optimized to maximize

D(X)

for

sampled from

the generator, i.e. when

X=G(z)

. The features

are converted

to a waveform using a pretrained HiFi-GAN vocoder [27]. During

training, a new adaptive discriminator updating technique is added to

ensure stability and convergence, as discussed in Sec. 4.

3.1. Generator

The architecture of the generator

is shown on the left of Fig. 1. It

consists of a latent mapping network

that converts

to a disentan-

gled latent space, a special Fourier feature (FF) layer which converts a

single vector from this latent space into a sequence of cosine features

of ﬁxed length, and ﬁnally a convolutional encoder which iteratively

reﬁnes the cosine features into the ﬁnal speech features X.

Mapping network:

The mapping network

is a multi-layer

perceptron with leaky ReLU activations. As input it takes in a vector

sampled from a normal distribution

z∼Z=N(0,I)

; we use a

512-dimensional multi-variate normal vector,

z∈R512

. Passing

through the mapping network produces a latent vector

w=W(z)

of the same dimensionality as

. As explained in [11], the primary

purpose of

is to learn to map noise to a linearly disentangled space,

as this will allow for controllable and understandable synthesis.

coaxed into learning such a disentangled representation because it can

only linearly modulate channels of the cosine features in each layer

of the convolutional encoder (see details below). This means that

must learn to map the random normal

-space into a

-space that

linearly disentangles common factors of speech variation.

Convolutional encoder:

The convolutional encoder begins by

linearly projecting

as the input to an FF layer [28]. We use the

Gaussian Fourier feature mapping [28] and incorporate the trans-

formation from StyleGAN3 [13]. The Gaussian FF layer samples

a frequency and phase from a Gaussian distribution for each out-

put channel. The layer then linearly projects the input vector to a

vector of phases which are added to the random phases. The out-

put is calculated as the cosine functions of these frequencies and

phases, one frequency/phase for each output channel. The result is

that

is converted into a sequence of vectors at the output of the

FF layer. This is iteratively passed through several

Style Blocks

In each

Style Block

layer, the input sequence is passed through

a modulated convolution layer [12] whereby the ﬁnal convolution

kernel is computed by multiplying the layer’s learnt kernel with the

style vector derived from

, broadcasted over the length of the ker-

nel. To ensure the signal does not experience aliasing due to the

non-linearity, the leaky ReLU layers are surrounded by layers respon-

sible for anti-aliasing (explained below). All these layers comprise a

Style Block

, which is repeated in groups of 5, 4, 3, and ﬁnally

2 blocks. The last block in each group upsamples by

4×

instead of

2×

, thereby increasing the sequence length by a factor of 2 for each

group. A ﬁnal 1D convolution projects the output from the last group

into the audio feature space (e.g. log mel-spectrogram or HuBERT

features [18]), as illustrated in the middle of Fig. 1.

Anti-aliasing ﬁlters:

From image synthesis with GANs [13], we

know that the generator must include anti-aliasing ﬁlters for the sig-

nal propagating through the network to satisfy the Nyquist-Shannon

sampling theorem. This is why, before and after a non-linearity, we

include upsampling, low-pass ﬁlter (LPF), and downsampling layers

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GANYOUHEARME?RECLAIMINGUNCONDITIONALSPEECHSYNTHESISFROMDIFFUSIONMODELSMatthewBaasandHermanKamperMediaLab,Electrical&ElectronicEngineering,StellenboschUniversity,SouthAfricaABSTRACTWeproposeAudioStyleGAN(ASGAN),anewgenerativeadversar-ialnetwork(GAN)forunconditionalspeechsynthesis.AsintheStyleGANfamil...

展开>> 收起<<

GAN YOU HEAR ME RECLAIMING UNCONDITIONAL SPEECH SYNTHESIS FROM DIFFUSION MODELS Matthew Baas and Herman Kamper.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

GAN YOU HEAR ME RECLAIMING UNCONDITIONAL SPEECH SYNTHESIS FROM DIFFUSION MODELS Matthew Baas and Herman Kamper

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: