EVIDENCE OF VOCAL TRACT ARTICULATION IN SELF-SUPERVISED LEARNING OF SPEECH Cheol Jun Cho1Peter Wu1Abdelrahman Mohamed2Gopala K. Anumanchipalli1

2025-05-06 0 0 937.62KB 5 页 10玖币
侵权投诉
EVIDENCE OF VOCAL TRACT ARTICULATION IN SELF-SUPERVISED LEARNING OF
SPEECH
Cheol Jun Cho1Peter Wu1Abdelrahman Mohamed2Gopala K. Anumanchipalli1
1UC Berkeley, EECS, CA
2Meta AI
ABSTRACT
Recent self-supervised learning (SSL) models have proven to
learn rich representations of speech, which can readily be uti-
lized by diverse downstream tasks. To understand such util-
ities, various analyses have been done for speech SSL mod-
els to reveal which and how information is encoded in the
learned representations. Although the scope of previous anal-
yses is extensive in acoustic, phonetic, and semantic perspec-
tives, the physical grounding by speech production has not yet
received full attention. To bridge this gap, we conduct a com-
prehensive analysis to link speech representations to articula-
tory trajectories measured by electromagnetic articulography
(EMA). Our analysis is based on a linear probing approach
where we measure articulatory score as an average correla-
tion of linear mapping to EMA. We analyze a set of SSL
models selected from the leaderboard of the SUPERB bench-
mark [1] and perform further layer-wise analyses on two most
successful models, Wav2Vec 2.0 [2] and HuBERT [3]. Sur-
prisingly, representations from the recent speech SSL models
are highly correlated with EMA traces (best: r= 0.81), and
only 5 minutes are sufficient to train a linear model with high
performance (r= 0.77). Our findings suggest that SSL mod-
els learn to align closely with continuous articulations, and
provide a novel insight into speech SSL.
Index TermsSpeech, Self-supervised learning, Elec-
tromagnetic articulography (EMA), Speech representation,
Probing analysis, Acoustic-to-articulatory inversion
1. INTRODUCTION
Self-supervised learning (SSL) has been suggested as a pre-
training method to learn representations without requiring a
huge amount of labeled data. Recently proposed SSL models
for speech provide rich representations which can be readily
utilized for a broad range of spoken language tasks. When
fine-tuned to downstream tasks, the SSL-based models are
able to surpass supervised-only models [4, 1]. Understand-
ing how SSL models work is crucial to explain such success
and to improve speech SSL. Previous studies have revealed
acoustic, phonetic, and semantic representations encoded in
representations of speech SSL, using diverse analytic tools in-
Pre-trained SSL model
representations EMA traces
Linear probe
Upper Lip (UL)
Lower Lip (LL)
Lower Incisor (LI)
Tongue Tip (TT)
Tongue Blade (TB)
Tongue Dorsum(TD)
Fig. 1. General framework of our analysis approach: linear
probing of representations from pre-trained SSL models on
EMA
cluding (non-) linear probing, mutual information, or canon-
ical correlation analysis [5, 3, 6, 7, 8]. Although those anal-
yses have provided insights of information process in speech
SSL models, the models still largely remain black box. Since
the advent of data-driven deep learning strategies for spoken
language engineering, they have grown increasingly discon-
nected from human mechanisms and insights from speech
production.
To bridge this gap, we answer the question of how well
the SSL representations are aligned with principles of speech
production. We conduct a linear probing analysis of SSL rep-
resentations against electromagnetic articulography (EMA)
(Fig. 1). EMA measures real-time continuous displacements
of 6 articulators (Fig. 1), which are timely synced with the
produced speech [9, 10]. As EMA tracks the actual phys-
ical dynamics of vocal tract, the modality is suitable for
investigating physical grounding of speech representation.
Moreover, a large portion of speech features are naturally
subsumed by articulatory trajectories and a full speech can be
reconstructed from EMA [11]. Underlying neurobiological
process of speech production can also be explained by vocal
tract articulation [12]. Therefore, we target EMA traces as
principled and grounded reference for probing representa-
tions of SSL models. We firstly introduce articulatory score
as an average correlation of linear prediction and EMA. Then,
we conduct a comprehensive analysis by controlling model
space, data size, and layers for probing. Our work bridges
arXiv:2210.11723v3 [eess.AS] 21 Jul 2023
摘要:

EVIDENCEOFVOCALTRACTARTICULATIONINSELF-SUPERVISEDLEARNINGOFSPEECHCheolJunCho1PeterWu1AbdelrahmanMohamed2GopalaK.Anumanchipalli11UCBerkeley,EECS,CA2MetaAIABSTRACTRecentself-supervisedlearning(SSL)modelshaveproventolearnrichrepresentationsofspeech,whichcanreadilybeuti-lizedbydiversedownstreamtasks.Tou...

展开>> 收起<<
EVIDENCE OF VOCAL TRACT ARTICULATION IN SELF-SUPERVISED LEARNING OF SPEECH Cheol Jun Cho1Peter Wu1Abdelrahman Mohamed2Gopala K. Anumanchipalli1.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:5 页 大小:937.62KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注