
MASKED AUTOENCODERS ARE ARTICULATORY LEARNERS
Ahmed Adel Attia and Carol Y. Espy-Wilson
University Of Maryland, College Park
Institute For Systems Research
Maryland, USA
ABSTRACT
Articulatory recordings track the positions and motion of dif-
ferent articulators along the vocal tract and are widely used to
study speech production and to develop speech technologies
such as articulatory based speech synthesizers and speech
inversion systems. The University of Wisconsin X-Ray Mi-
crobeam (XRMB) dataset is one of various datasets that
provide articulatory recordings synced with audio recordings.
The XRMB articulatory recordings employ pellets placed on
a number of articulators which can be tracked by the mi-
crobeam. However, a significant portion of the articulatory
recordings are mistracked, and have been so far unusable.
In this work, we present a deep learning based approach
using Masked Autoencoders to accurately reconstruct the
mistracked articulatory recordings for 41 out of 47 speakers
of the XRMB dataset. Our model is able to reconstruct articu-
latory trajectories that closely match ground truth, even when
three out of eight articulators are mistracked, and retrieve
3.28 out of 3.4 hours of previously unusable recordings.
Index Terms—X-ray microbeam, articulatory, data re-
construction, masked autoencoder, deep learning
1. INTRODUCTION
Articulatory data have been foundational in helping speech
scientists study the articulatory habits of speakers that can dif-
fer between individuals, and across languages and dialects of
the same language. At present, specialized equipment such
as X-ray Microbeam (XRMB) [1] Electromagnetic Articu-
lometry (EMA) [2] and real-time Magnetic Resonance Imag-
ing (rt-MRI) [3] is needed to observe articulatory movements
directly to help researchers and clinicians understand these
habits. Regardless of the technique, current technologies used
in articulatory recordings sometime fail to correctly capture
one or more of the articulators, resulting in mistracked seg-
ments. Given the considerable time and cost involved in data
collection and the potential prolonged exposure to harmful
signals, these mistracked pellets are seldom re-recorded leav-
ing some recordings unusable.
There has been considerable work to develop techniques
This work was supported by the National Science Foundation grant
IIS1764010.
to reconstruct missing articulatory recordings. In [4], the au-
thors proposed an algorithm for recovering missing data by
learning a density model of the vocal tract shapes. Their
model was limited to recovering only one mistracked articu-
lator at a time. Also, their approach requires handcrafting the
model parameters for every speaker and every articulator. In
[5, 6], a Kalman smoother and maximum a-posteriori estima-
tor fills in the missing samples. Their work was also limited
to one articulator at a time, and was additionally dependent
on the length of the mistracked samples.
Our approach does not suffer from the same limitations
of previous works. The model hyper-parameters do not re-
quire any handcrafting for different speakers, and one model
can predict any missing articulator. As a result, our model is
easily applicable for the majority of the dataset. We are also
not limited to reconstructing one missing articulator, or lim-
ited to a certain duration of mistracked segments, but are able
to reconstruct up to three missing articulators that follow the
ground truth closely for the entire duration of the recording.
We approach the problem as a masked reconstruction
problem. Recently, there has been some major breakthroughs
in image reconstruction using masked autoencoders [7],
where an image can be realistically reconstructed with 90%
of it being masked in patches. This work follows a similar
approach with articulatory data, where a portion of an in-
put frame is masked and then fed to an autoencoder, which
then reconstructs the entire frame including the masked por-
tions. We explain our approach in greater detail in section 3.
Section 2 outlines the dataset description. We demonstrate
test set results in section 4, and discuss the limitations of
the proposed approach in 5. We end with a conclusion and
discussion about future directions in section 6.
2. DATASET DESCRIPTION
In this study, we use the University of Wisconsin XRMB
dataset. However, there is no reason the proposed approach
cannot be applied to other datasets.
2.1. XRMB dataset
The XRMB dataset contains articulatory recordings synced
with audio recordings. Each speaker had 8 gold pellets glued
arXiv:2210.15195v3 [eess.AS] 18 May 2023