WEAK-SUPERVISED DYSARTHRIA-INVARIANT FEATURES FOR SPOKEN LANGUAGE
UNDERSTANDING USING AN FHVAE AND ADVERSARIAL TRAINING
Jinzi Qi1, Hugo Van hamme1
1KULeuven, Department Electrical Engineering-ESAT-PSI,
Kasteelpark Arenberg 10, Leuven, Belgium
ABSTRACT
The scarcity of training data and the large speaker vari-
ation in dysarthric speech lead to poor accuracy and poor
speaker generalization of spoken language understanding
systems for dysarthric speech. Through work on the speech
features, we focus on improving the model generalization
ability with limited dysarthric data. Factorized Hierarchi-
cal Variational Auto-Encoders (FHVAE) trained unsupervis-
edly have shown their advantage in disentangling content
and speaker representations. Earlier work showed that the
dysarthria shows in both feature vectors. Here, we add ad-
versarial training to bridge the gap between the control and
dysarthric speech data domains. We extract dysarthric and
speaker invariant features using weak supervision. The ex-
tracted features are evaluated on a Spoken Language Under-
standing task and yield a higher accuracy on unseen speakers
with more severe dysarthria compared to features from the
basic FHVAE model or plain filterbanks.
Index Terms—Dysarthric speech, FHVAE, adversarial
training, weak supervision, end-to-end spoken language un-
derstanding.
1. INTRODUCTION
Dysarthric speech data, especially accurately labeled data, are
scarce due to difficulties in recruitment, collection and label-
ing. This data insufficiency problem has been a constant bar-
rier for automatic speech recognition (ASR) model training.
Recent ASR models[1, 2, 3] tend to be more complex and em-
ploy more parameters, which increases the difficulty to train
from limited dysarthric data even further.
Current research mainly solves the dysarthric speech data
deficiency problem in three ways. One solution is taking
abundant canonical (control) speech data into account during
training, e.g. pretraining on control speech and finetuning on
dysarthric speech [4, 5, 6]. A second approach is to decrease
the model size [7], or to train an inserted small module in-
stead of finetuning the whole model [8, 9], so the number of
parameters learned on the dysarthric data is limited. Thirdly
and differently from the solutions that work on training strat-
egy or model structure, [10, 11, 12, 13] focus directly on the
data and do augmentation to generate more dysarthric speech
for use in training.
The features used in most methods preserve the speaker
characteristics and the dysarthria information, and thus bound
the model generalization ability. However, in practical sce-
narios, we would like the ASR model to have great general-
ization ability and work well on unseen speakers (users), to
reduce any training effort for dysarthric users. In this work,
we focus on improving the automatic recognition model gen-
eralization ability with limited dysarthric data. Instead of
training and testing the model with speech features that pre-
serve variability between dysarthric speakers, we propose to
use dysarthric-invariant and speaker independent speech fea-
tures for automatic recognition use, which can both boost the
model generalization ability on unseen speakers and gives the
possibility to involve abundant canonical speech into training.
Factorized Hierarchical Variational Auto-Encoders (FH-
VAE) [14, 15] have shown their ability to disentangle the con-
tent space and the speaker space in input speech. An FHVAE
models the generative process of a sequence of segments in
a hierarchical structure. It encodes a speech utterance into a
segment-related variable (short time scale) and a sequence-
related variable (long time scale) via two linked encoders.
The segment-related latent variable represents the informa-
tion that only appears in a single segment in the sequence,
such as the content of the segment, and is conditioned on the
sequence-related variable. The sequence-related variable re-
flects the features of the whole sequence, like the acoustic
environment or speaker characteristics. The model hence of-
fers a separation between speaker characteristics (sequence)
and content (segment). The content variable (segment-related
variable) appears to fit the speaker-independence desideratum
outlined above. However, our previous work [16] shows that
the dysarthria is reflected in both latent variables.
To achieve a greater deal of dysarthria-invariance in the
content variable, instead of using forced regularization with
phoneme alignment [16], inspired by [17], we introduce ad-
versarial training [18] to the FHVAE model. We feed the FH-
VAE model with both control and dysarthric speech, regard-
ing the content variable encoder as the generator. We add a
discriminator that tries to distinguish control from dysarthric
content variables. With adversarial training between the gen-
Copyright 2023 IEEE. This is a preprint version. The original version is to be published in the 2022 IEEE Spoken Language Technology Workshop (SLT) (SLT 2022), scheduled for 09-12 January 2023 in Doha, Qatar. Personal use of this material is
permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other
works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.
arXiv:2210.13144v1 [eess.AS] 24 Oct 2022