MASKED AUTOENCODERS ARE ARTICULATORY LEARNERS Ahmed Adel Attia and Carol Y. Espy-Wilson University Of Maryland College Park

2025-05-02 0 0 1.4MB 5 页 10玖币
侵权投诉
MASKED AUTOENCODERS ARE ARTICULATORY LEARNERS
Ahmed Adel Attia and Carol Y. Espy-Wilson
University Of Maryland, College Park
Institute For Systems Research
Maryland, USA
ABSTRACT
Articulatory recordings track the positions and motion of dif-
ferent articulators along the vocal tract and are widely used to
study speech production and to develop speech technologies
such as articulatory based speech synthesizers and speech
inversion systems. The University of Wisconsin X-Ray Mi-
crobeam (XRMB) dataset is one of various datasets that
provide articulatory recordings synced with audio recordings.
The XRMB articulatory recordings employ pellets placed on
a number of articulators which can be tracked by the mi-
crobeam. However, a significant portion of the articulatory
recordings are mistracked, and have been so far unusable.
In this work, we present a deep learning based approach
using Masked Autoencoders to accurately reconstruct the
mistracked articulatory recordings for 41 out of 47 speakers
of the XRMB dataset. Our model is able to reconstruct articu-
latory trajectories that closely match ground truth, even when
three out of eight articulators are mistracked, and retrieve
3.28 out of 3.4 hours of previously unusable recordings.
Index TermsX-ray microbeam, articulatory, data re-
construction, masked autoencoder, deep learning
1. INTRODUCTION
Articulatory data have been foundational in helping speech
scientists study the articulatory habits of speakers that can dif-
fer between individuals, and across languages and dialects of
the same language. At present, specialized equipment such
as X-ray Microbeam (XRMB) [1] Electromagnetic Articu-
lometry (EMA) [2] and real-time Magnetic Resonance Imag-
ing (rt-MRI) [3] is needed to observe articulatory movements
directly to help researchers and clinicians understand these
habits. Regardless of the technique, current technologies used
in articulatory recordings sometime fail to correctly capture
one or more of the articulators, resulting in mistracked seg-
ments. Given the considerable time and cost involved in data
collection and the potential prolonged exposure to harmful
signals, these mistracked pellets are seldom re-recorded leav-
ing some recordings unusable.
There has been considerable work to develop techniques
This work was supported by the National Science Foundation grant
IIS1764010.
to reconstruct missing articulatory recordings. In [4], the au-
thors proposed an algorithm for recovering missing data by
learning a density model of the vocal tract shapes. Their
model was limited to recovering only one mistracked articu-
lator at a time. Also, their approach requires handcrafting the
model parameters for every speaker and every articulator. In
[5, 6], a Kalman smoother and maximum a-posteriori estima-
tor fills in the missing samples. Their work was also limited
to one articulator at a time, and was additionally dependent
on the length of the mistracked samples.
Our approach does not suffer from the same limitations
of previous works. The model hyper-parameters do not re-
quire any handcrafting for different speakers, and one model
can predict any missing articulator. As a result, our model is
easily applicable for the majority of the dataset. We are also
not limited to reconstructing one missing articulator, or lim-
ited to a certain duration of mistracked segments, but are able
to reconstruct up to three missing articulators that follow the
ground truth closely for the entire duration of the recording.
We approach the problem as a masked reconstruction
problem. Recently, there has been some major breakthroughs
in image reconstruction using masked autoencoders [7],
where an image can be realistically reconstructed with 90%
of it being masked in patches. This work follows a similar
approach with articulatory data, where a portion of an in-
put frame is masked and then fed to an autoencoder, which
then reconstructs the entire frame including the masked por-
tions. We explain our approach in greater detail in section 3.
Section 2 outlines the dataset description. We demonstrate
test set results in section 4, and discuss the limitations of
the proposed approach in 5. We end with a conclusion and
discussion about future directions in section 6.
2. DATASET DESCRIPTION
In this study, we use the University of Wisconsin XRMB
dataset. However, there is no reason the proposed approach
cannot be applied to other datasets.
2.1. XRMB dataset
The XRMB dataset contains articulatory recordings synced
with audio recordings. Each speaker had 8 gold pellets glued
arXiv:2210.15195v3 [eess.AS] 18 May 2023
摘要:

MASKEDAUTOENCODERSAREARTICULATORYLEARNERSAhmedAdelAttiaandCarolY.Espy-WilsonUniversityOfMaryland,CollegeParkInstituteForSystemsResearchMaryland,USAABSTRACTArticulatoryrecordingstrackthepositionsandmotionofdif-ferentarticulatorsalongthevocaltractandarewidelyusedtostudyspeechproductionandtodevelopspee...

展开>> 收起<<
MASKED AUTOENCODERS ARE ARTICULATORY LEARNERS Ahmed Adel Attia and Carol Y. Espy-Wilson University Of Maryland College Park.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:1.4MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注