
1024
512
512
log mel-
spectrogram
FF
Modulated
Conv1D
512
x5
Modulated
Conv1D
Linear
Upsample 2x / 4x
LPF
Leaky ReLU
LPF
Downsample 2x
x4
512
Style Block
x3
256
x2
128
Style Block
Linear
128
Style Block
(L, 128)
Conv1D k:5 s:1
LPF
Downsample 2x
Conv1D k:5 s:1
Conv1D k:5 s:2
Leaky ReLU
Leaky ReLU
LPF
ConvD Block
512
MB std
Conv1D k:5 s:1
Leaky ReLU
Linear
Linear
Leaky ReLU
512
Flatten
1
512 512
512
Leaky ReLU
Linear
Linear
Leaky ReLU
Linear
Linear
Leaky ReLU
Linear
512 512 512 512
network
Conv1D k:5 s:1
Conv1D k:5 s:1
Conv1D k:5 s:1
Leaky ReLU
Leaky ReLU
x2
512 512
512
x4
Style Block
Fig. 1
: The ASGAN generator (left) and discriminator (right). FF, LPF, Conv1D indicate Fourier feature [13], low-pass filter, and 1D
convolution layers, respectively. The numbers above linear and convolutional layers indicate the number of output features/channels for that
layer. Stacked blocks indicate a layer repeated sequentially, with the number of repeats indicated above the block (e.g. “x3”).
step – it resembles coherent speech. Autoregressive and diffusion
models are relatively slow because they require repeated forward
passes through the model during inference.
Earlier studies [1, 10] attempted to use GANs [9] for uncondi-
tional speech synthesis, which has the advantage of requiring only
a single pass through the model. While results showed some initial
promise, performance was poor in terms of speech quality and di-
versity, with the more recent diffusion models performing much
better [7]. However, there have been substantial improvements
in GAN-based modelling for image synthesis in the intervening
years [11, 12, 14]. Our goal is to improve the performance of the
earlier GAN-based unconditional speech synthesis models by adapt-
ing lessons from these recent image synthesis studies.
Some of these innovations in GANs are modality-agnostic:
R1
regularization [23] and exponential moving averaging of generator
weights [24] can be directly transferred from the vision domain to
speech. Other techniques, such as the carefully designed anti-aliasing
filters between layers in StyleGAN3 [13] require specific adaptation;
in contrast to images, there is little meaningful information in speech
below 300 Hz, necessitating a redesign of the anti-aliasing filters.
In a very related research direction, Begu
ˇ
s [10, 25] has been
studying how GAN-based unconditional speech synthesis models
internally perform lexical and phonological learning, and how this
relates to human learning. These studies, however, have been relying
on the older GAN synthesis models. We hope that by developing
better performing GANs for unconditional speech synthesis, such
investigations will also be improved. Recently, [26] attempted to
directly use StyleGAN2 for conditional and unconditional synthesis
of emotional vocal bursts. This further motivates a reinvestigation
of GANs, but here we look specifically at the generation of speech
rather than paralinguistic sounds.
3. ASGAN: AUDIO STYLE GAN
Our model is based on the StyleGAN family of models [11] for image
synthesis. We adapt and extend the approach to audio, and therefore
dub our model AudioStyleGAN (ASGAN). The model follows the
setup of a standard GAN with a single generator network
G
and
a single discriminator
D
[9]. The generator
G
accepts a vector
z
sampled from a normal distribution and processes it into a sequence
of speech features
X
. In this work, we restrict the sequence of
speech features
X
to always have a fixed pre-specified duration. The
discriminator
D
accepts a sequence of speech features
X
and yields
a scalar output.
D
is optimized to raise its output for
X
sampled
from real data and lower its output for
X
produced by the generator.
Meanwhile,
G
is optimized to maximize
D(X)
for
X
sampled from
the generator, i.e. when
X=G(z)
. The features
X
are converted
to a waveform using a pretrained HiFi-GAN vocoder [27]. During
training, a new adaptive discriminator updating technique is added to
ensure stability and convergence, as discussed in Sec. 4.
3.1. Generator
The architecture of the generator
G
is shown on the left of Fig. 1. It
consists of a latent mapping network
W
that converts
z
to a disentan-
gled latent space, a special Fourier feature (FF) layer which converts a
single vector from this latent space into a sequence of cosine features
of fixed length, and finally a convolutional encoder which iteratively
refines the cosine features into the final speech features X.
Mapping network:
The mapping network
W
is a multi-layer
perceptron with leaky ReLU activations. As input it takes in a vector
sampled from a normal distribution
z∼Z=N(0,I)
; we use a
512-dimensional multi-variate normal vector,
z∈R512
. Passing
z
through the mapping network produces a latent vector
w=W(z)
of the same dimensionality as
z
. As explained in [11], the primary
purpose of
W
is to learn to map noise to a linearly disentangled space,
as this will allow for controllable and understandable synthesis.
W
is
coaxed into learning such a disentangled representation because it can
only linearly modulate channels of the cosine features in each layer
of the convolutional encoder (see details below). This means that
W
must learn to map the random normal
Z
-space into a
W
-space that
linearly disentangles common factors of speech variation.
Convolutional encoder:
The convolutional encoder begins by
linearly projecting
w
as the input to an FF layer [28]. We use the
Gaussian Fourier feature mapping [28] and incorporate the trans-
formation from StyleGAN3 [13]. The Gaussian FF layer samples
a frequency and phase from a Gaussian distribution for each out-
put channel. The layer then linearly projects the input vector to a
vector of phases which are added to the random phases. The out-
put is calculated as the cosine functions of these frequencies and
phases, one frequency/phase for each output channel. The result is
that
w
is converted into a sequence of vectors at the output of the
FF layer. This is iteratively passed through several
Style Blocks
.
In each
Style Block
layer, the input sequence is passed through
a modulated convolution layer [12] whereby the final convolution
kernel is computed by multiplying the layer’s learnt kernel with the
style vector derived from
w
, broadcasted over the length of the ker-
nel. To ensure the signal does not experience aliasing due to the
non-linearity, the leaky ReLU layers are surrounded by layers respon-
sible for anti-aliasing (explained below). All these layers comprise a
Style Block
, which is repeated in groups of 5, 4, 3, and finally
2 blocks. The last block in each group upsamples by
4×
instead of
2×
, thereby increasing the sequence length by a factor of 2 for each
group. A final 1D convolution projects the output from the last group
into the audio feature space (e.g. log mel-spectrogram or HuBERT
features [18]), as illustrated in the middle of Fig. 1.
Anti-aliasing filters:
From image synthesis with GANs [13], we
know that the generator must include anti-aliasing filters for the sig-
nal propagating through the network to satisfy the Nyquist-Shannon
sampling theorem. This is why, before and after a non-linearity, we
include upsampling, low-pass filter (LPF), and downsampling layers