ARTICULATION GAN UNSUPERVISED MODELING OF ARTICULATORY LEARNING Gaˇsper Begu ˇs1 Alan Zhou2 Peter Wu1y Gopala K. Anumanchipalli1y 1University of California Berkeley2Johns Hopkins University

2025-04-30 0 0 3.57MB 5 页 10玖币

侵权投诉

ARTICULATION GAN: UNSUPERVISED MODELING OF ARTICULATORY LEARNING

Gaˇ

sper Beguˇ

s1∗, Alan Zhou2∗, Peter Wu1†, Gopala K. Anumanchipalli1†

1University of California, Berkeley, 2Johns Hopkins University

ABSTRACT

Generative deep neural networks are widely used for speech

synthesis, but most existing models directly generate waveforms or

spectral outputs. Humans, however, produce speech by control-

ling articulators, which results in the production of speech sounds

through physical properties of sound propagation. We introduce

the Articulatory Generator to the Generative Adversarial Network

paradigm, a new unsupervised generative model of speech produc-

tion/synthesis. The Articulatory Generator more closely mimics hu-

man speech production by learning to generate articulatory repre-

sentations (electromagnetic articulography or EMA) in a fully unsu-

pervised manner. A separate pre-trained physical model (ema2wav)

then transforms the generated EMA representations to speech wave-

forms, which get sent to the Discriminator for evaluation. Articula-

tory analysis suggests that the network learns to control articulators

in a similar manner to humans during speech production. Acoustic

analysis of the outputs suggests that the network learns to generate

words that are both present and absent in the training distribution.

We additionally discuss implications of articulatory representations

for cognitive models of human language and speech technology in

general.

Index Terms—articulatory phonetics, unsupervised learning,

electromagnetic articulography, deep generative learning

1. INTRODUCTION

Humans produce spoken language with articulatory gestures [1].

Sounds of speech are generated by airﬂow from the lungs passing

through articulators, which causes air pressure ﬂuctuations that con-

stitute sounds of speech. The main mechanism in speech production

is thus control of the articulators and airﬂow [1]. During language

acquisition, children need to learn to control articulators and pro-

duce articulatory gestures such that the generated sounds correspond

to the sounds of language they are exposed to.

This learning is complicated by the fact that sound is an en-

tirely different modality compared to articulatory gestures. Children

need to learn to control and move articulators from sound input with-

out direct access to the articulatory data of their caregivers. While

some articulators are visible (such as lips and tongue tip, jaw move-

ment), many are not (vocal folds, tongue dorsum). There is debate

on whether spoken language acquisition is fully unsupervised due to

direct and indirect negative evidence [2]. Articulatory learning, how-

ever, is likely fully unsupervised. Caregivers ordinarily do not pro-

vide any explicit feedback about articulatory gestures to language-

acquiring children.

Most models of human speech production output audio data of

speech without articulatory representations. In actual speech, how-

∗Gaˇ

sper Beguˇ

s and Alan Zhou contributed equally to this work. Corre-

sponding author: Gaˇ

sper Beguˇ

s (begus@berkeley.edu).

†G.K.A. and P.W. are supported by NSF #2106928.

ever, humans control articulators and airﬂow, while a separate phys-

ical process results in sounds of speech.

To build a more realistic model of human spoken language, we

propose a new deep learning architecture within the GAN framework

[3, 4, 5, 6, 7, 8]. In our proposal, the decoder (synthesizer or the Gen-

erator network) learns to output approximates of human articulatory

gestures while never accessing articulatory data. The generated ar-

ticulatory gestures are represented with thirteen channels that match

the twelve channels used to record human articulators during electro-

magnetic articulography (EMA) plus an additional channel for voic-

ing. The generated articulatory movements are then passed through

a separate physical model of sound generation that takes articulatory

channels and converts them into waveforms. This physical model is

taken from a pre-trained EMA-to-speech model (ema2wav) which

transforms electromagnetic articulography into speech waveforms

[9]. This physical model component is a model of physical sound

propagation and is cognitively irrelevant, which is why its weights

are not updated during training.

Articulatory learning in this model needs to happen in a fully

unsupervised manner. The Articulatory Generator needs to trans-

form random noise in the latent space into the thirteen channels such

that the independent pre-trained EMA-to-speech physical model will

generate speech. The Discriminator receives waveform data syn-

thesized based on the Articulatory Generator’s generated channels.

The Generator in our model never directly accesses articulatory data.

Like humans, it needs to learn to control articulators without ever di-

rectly accessing them (e.g. vocal folds or tongue dorsum are never

visible during speech acquisition). The only information available to

humans during acquisition and our model during training is the au-

ditory feedback from the perception component of speech that cor-

responds to the Discriminator network in our model.

1.1. Prior work

Speech synthesis from articulatory representations has recently

been performed using deep neural networks [10, 11, 12, 13, 14, 9].

The objective in most existing proposals, however, is to synthesize

waveforms from articulatory representations in a supervised setting,

rather than a fully unsupervised generation of the articulatory rep-

resentations themselves. [15, 16] proposes an autoencoder model

that learns to encode and decode between motor parameters and

auditory representations in an unsupervised manner. However, this

model trains both the encoding and decoding aspects of the model

simultaneously, and focuses on the relationship between auditory

representations and a motor latent space. By contrast, our GAN

model is trained with a static pretrained articulatory model simi-

lar to how children learn to speak with a full set of articulators.

In addition, rather than decoding back and forth between motor

and auditory information, our model is able to generate articula-

tory parameters directly by sampling from a general-purpose latent

space. To our knowledge, this paper presents the ﬁrst architecture in

arXiv:2210.15173v2 [cs.SD] 12 Mar 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ARTICULATIONGAN:UNSUPERVISEDMODELINGOFARTICULATORYLEARNINGGasperBegus1,AlanZhou2,PeterWu1y,GopalaK.Anumanchipalli1y1UniversityofCalifornia,Berkeley,2JohnsHopkinsUniversityABSTRACTGenerativedeepneuralnetworksarewidelyusedforspeechsynthesis,butmostexistingmodelsdirectlygeneratewaveformsorspectralo...

展开>> 收起<<

ARTICULATION GAN UNSUPERVISED MODELING OF ARTICULATORY LEARNING Gaˇsper Begu ˇs1 Alan Zhou2 Peter Wu1y Gopala K. Anumanchipalli1y 1University of California Berkeley2Johns Hopkins University.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ARTICULATION GAN UNSUPERVISED MODELING OF ARTICULATORY LEARNING Gaˇsper Begu ˇs1 Alan Zhou2 Peter Wu1y Gopala K. Anumanchipalli1y 1University of California Berkeley2Johns Hopkins University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: