MASKED AUTOENCODERS ARE ARTICULATORY LEARNERS Ahmed Adel Attia and Carol Y. Espy-Wilson University Of Maryland College Park

2025-05-02 0 0 1.4MB 5 页 10玖币

侵权投诉

MASKED AUTOENCODERS ARE ARTICULATORY LEARNERS

Ahmed Adel Attia and Carol Y. Espy-Wilson

University Of Maryland, College Park

Institute For Systems Research

Maryland, USA

ABSTRACT

Articulatory recordings track the positions and motion of dif-

ferent articulators along the vocal tract and are widely used to

study speech production and to develop speech technologies

such as articulatory based speech synthesizers and speech

inversion systems. The University of Wisconsin X-Ray Mi-

crobeam (XRMB) dataset is one of various datasets that

provide articulatory recordings synced with audio recordings.

The XRMB articulatory recordings employ pellets placed on

a number of articulators which can be tracked by the mi-

crobeam. However, a signiﬁcant portion of the articulatory

recordings are mistracked, and have been so far unusable.

In this work, we present a deep learning based approach

using Masked Autoencoders to accurately reconstruct the

mistracked articulatory recordings for 41 out of 47 speakers

of the XRMB dataset. Our model is able to reconstruct articu-

latory trajectories that closely match ground truth, even when

three out of eight articulators are mistracked, and retrieve

3.28 out of 3.4 hours of previously unusable recordings.

Index Terms—X-ray microbeam, articulatory, data re-

construction, masked autoencoder, deep learning

1. INTRODUCTION

Articulatory data have been foundational in helping speech

scientists study the articulatory habits of speakers that can dif-

fer between individuals, and across languages and dialects of

the same language. At present, specialized equipment such

as X-ray Microbeam (XRMB) [1] Electromagnetic Articu-

lometry (EMA) [2] and real-time Magnetic Resonance Imag-

ing (rt-MRI) [3] is needed to observe articulatory movements

directly to help researchers and clinicians understand these

habits. Regardless of the technique, current technologies used

in articulatory recordings sometime fail to correctly capture

one or more of the articulators, resulting in mistracked seg-

ments. Given the considerable time and cost involved in data

collection and the potential prolonged exposure to harmful

signals, these mistracked pellets are seldom re-recorded leav-

ing some recordings unusable.

There has been considerable work to develop techniques

This work was supported by the National Science Foundation grant

IIS1764010.

to reconstruct missing articulatory recordings. In [4], the au-

thors proposed an algorithm for recovering missing data by

learning a density model of the vocal tract shapes. Their

model was limited to recovering only one mistracked articu-

lator at a time. Also, their approach requires handcrafting the

model parameters for every speaker and every articulator. In

[5, 6], a Kalman smoother and maximum a-posteriori estima-

tor ﬁlls in the missing samples. Their work was also limited

to one articulator at a time, and was additionally dependent

on the length of the mistracked samples.

Our approach does not suffer from the same limitations

of previous works. The model hyper-parameters do not re-

quire any handcrafting for different speakers, and one model

can predict any missing articulator. As a result, our model is

easily applicable for the majority of the dataset. We are also

not limited to reconstructing one missing articulator, or lim-

ited to a certain duration of mistracked segments, but are able

to reconstruct up to three missing articulators that follow the

ground truth closely for the entire duration of the recording.

We approach the problem as a masked reconstruction

problem. Recently, there has been some major breakthroughs

in image reconstruction using masked autoencoders [7],

where an image can be realistically reconstructed with 90%

of it being masked in patches. This work follows a similar

approach with articulatory data, where a portion of an in-

put frame is masked and then fed to an autoencoder, which

then reconstructs the entire frame including the masked por-

tions. We explain our approach in greater detail in section 3.

Section 2 outlines the dataset description. We demonstrate

test set results in section 4, and discuss the limitations of

the proposed approach in 5. We end with a conclusion and

discussion about future directions in section 6.

2. DATASET DESCRIPTION

In this study, we use the University of Wisconsin XRMB

dataset. However, there is no reason the proposed approach

cannot be applied to other datasets.

2.1. XRMB dataset

The XRMB dataset contains articulatory recordings synced

with audio recordings. Each speaker had 8 gold pellets glued

arXiv:2210.15195v3 [eess.AS] 18 May 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MASKEDAUTOENCODERSAREARTICULATORYLEARNERSAhmedAdelAttiaandCarolY.Espy-WilsonUniversityOfMaryland,CollegeParkInstituteForSystemsResearchMaryland,USAABSTRACTArticulatoryrecordingstrackthepositionsandmotionofdif-ferentarticulatorsalongthevocaltractandarewidelyusedtostudyspeechproductionandtodevelopspee...

展开>> 收起<<

MASKED AUTOENCODERS ARE ARTICULATORY LEARNERS Ahmed Adel Attia and Carol Y. Espy-Wilson University Of Maryland College Park.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MASKED AUTOENCODERS ARE ARTICULATORY LEARNERS Ahmed Adel Attia and Carol Y. Espy-Wilson University Of Maryland College Park

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: