
ARTICULATION GAN: UNSUPERVISED MODELING OF ARTICULATORY LEARNING
Gaˇ
sper Beguˇ
s1∗, Alan Zhou2∗, Peter Wu1†, Gopala K. Anumanchipalli1†
1University of California, Berkeley, 2Johns Hopkins University
ABSTRACT
Generative deep neural networks are widely used for speech
synthesis, but most existing models directly generate waveforms or
spectral outputs. Humans, however, produce speech by control-
ling articulators, which results in the production of speech sounds
through physical properties of sound propagation. We introduce
the Articulatory Generator to the Generative Adversarial Network
paradigm, a new unsupervised generative model of speech produc-
tion/synthesis. The Articulatory Generator more closely mimics hu-
man speech production by learning to generate articulatory repre-
sentations (electromagnetic articulography or EMA) in a fully unsu-
pervised manner. A separate pre-trained physical model (ema2wav)
then transforms the generated EMA representations to speech wave-
forms, which get sent to the Discriminator for evaluation. Articula-
tory analysis suggests that the network learns to control articulators
in a similar manner to humans during speech production. Acoustic
analysis of the outputs suggests that the network learns to generate
words that are both present and absent in the training distribution.
We additionally discuss implications of articulatory representations
for cognitive models of human language and speech technology in
general.
Index Terms—articulatory phonetics, unsupervised learning,
electromagnetic articulography, deep generative learning
1. INTRODUCTION
Humans produce spoken language with articulatory gestures [1].
Sounds of speech are generated by airflow from the lungs passing
through articulators, which causes air pressure fluctuations that con-
stitute sounds of speech. The main mechanism in speech production
is thus control of the articulators and airflow [1]. During language
acquisition, children need to learn to control articulators and pro-
duce articulatory gestures such that the generated sounds correspond
to the sounds of language they are exposed to.
This learning is complicated by the fact that sound is an en-
tirely different modality compared to articulatory gestures. Children
need to learn to control and move articulators from sound input with-
out direct access to the articulatory data of their caregivers. While
some articulators are visible (such as lips and tongue tip, jaw move-
ment), many are not (vocal folds, tongue dorsum). There is debate
on whether spoken language acquisition is fully unsupervised due to
direct and indirect negative evidence [2]. Articulatory learning, how-
ever, is likely fully unsupervised. Caregivers ordinarily do not pro-
vide any explicit feedback about articulatory gestures to language-
acquiring children.
Most models of human speech production output audio data of
speech without articulatory representations. In actual speech, how-
∗Gaˇ
sper Beguˇ
s and Alan Zhou contributed equally to this work. Corre-
sponding author: Gaˇ
sper Beguˇ
s (begus@berkeley.edu).
†G.K.A. and P.W. are supported by NSF #2106928.
ever, humans control articulators and airflow, while a separate phys-
ical process results in sounds of speech.
To build a more realistic model of human spoken language, we
propose a new deep learning architecture within the GAN framework
[3, 4, 5, 6, 7, 8]. In our proposal, the decoder (synthesizer or the Gen-
erator network) learns to output approximates of human articulatory
gestures while never accessing articulatory data. The generated ar-
ticulatory gestures are represented with thirteen channels that match
the twelve channels used to record human articulators during electro-
magnetic articulography (EMA) plus an additional channel for voic-
ing. The generated articulatory movements are then passed through
a separate physical model of sound generation that takes articulatory
channels and converts them into waveforms. This physical model is
taken from a pre-trained EMA-to-speech model (ema2wav) which
transforms electromagnetic articulography into speech waveforms
[9]. This physical model component is a model of physical sound
propagation and is cognitively irrelevant, which is why its weights
are not updated during training.
Articulatory learning in this model needs to happen in a fully
unsupervised manner. The Articulatory Generator needs to trans-
form random noise in the latent space into the thirteen channels such
that the independent pre-trained EMA-to-speech physical model will
generate speech. The Discriminator receives waveform data syn-
thesized based on the Articulatory Generator’s generated channels.
The Generator in our model never directly accesses articulatory data.
Like humans, it needs to learn to control articulators without ever di-
rectly accessing them (e.g. vocal folds or tongue dorsum are never
visible during speech acquisition). The only information available to
humans during acquisition and our model during training is the au-
ditory feedback from the perception component of speech that cor-
responds to the Discriminator network in our model.
1.1. Prior work
Speech synthesis from articulatory representations has recently
been performed using deep neural networks [10, 11, 12, 13, 14, 9].
The objective in most existing proposals, however, is to synthesize
waveforms from articulatory representations in a supervised setting,
rather than a fully unsupervised generation of the articulatory rep-
resentations themselves. [15, 16] proposes an autoencoder model
that learns to encode and decode between motor parameters and
auditory representations in an unsupervised manner. However, this
model trains both the encoding and decoding aspects of the model
simultaneously, and focuses on the relationship between auditory
representations and a motor latent space. By contrast, our GAN
model is trained with a static pretrained articulatory model simi-
lar to how children learn to speak with a full set of articulators.
In addition, rather than decoding back and forth between motor
and auditory information, our model is able to generate articula-
tory parameters directly by sampling from a general-purpose latent
space. To our knowledge, this paper presents the first architecture in
arXiv:2210.15173v2 [cs.SD] 12 Mar 2023