
AUTOMATIC SEVERITY CLASSIFICATION OF DYSARTHRIC SPEECH
BY USING SELF-SUPERVISED MODEL WITH MULTI-TASK LEARNING
Eun Jung Yeo1∗, Kwanghee Choi2∗, Sunhee Kim3, Minhwa Chung1
Department of Linguistics, Seoul National University, Republic of Korea1
Department of Computer Science and Engineering, Sogang University, Republic of Korea2
Department of French Language Education, Seoul National University, Republic of Korea3
ABSTRACT
Automatic assessment of dysarthric speech is essential for
sustained treatments and rehabilitation. However, obtaining
atypical speech is challenging, often leading to data scarcity
issues. To tackle the problem, we propose a novel automatic
severity assessment method for dysarthric speech, using the
self-supervised model in conjunction with multi-task learn-
ing. Wav2vec 2.0 XLS-R is jointly trained for two different
tasks: severity classification and auxiliary automatic speech
recognition (ASR). For the baseline experiments, we employ
hand-crafted acoustic features and machine learning classi-
fiers such as SVM, MLP, and XGBoost. Explored on the
Korean dysarthric speech QoLT database, our model out-
performs the traditional baseline methods, with a relative
percentage increase of 1.25% for F1-score. In addition, the
proposed model surpasses the model trained without ASR
head, achieving 10.61% relative percentage improvements.
Furthermore, we present how multi-task learning affects the
severity classification performance by analyzing the latent
representations and regularization effect.
Index Terms—dysarthric speech, automatic assessment,
self-supervised learning, multi-task learning
1. INTRODUCTION
Dysarthria is a group of motor speech disorders resulting
from neuromuscular control disturbances, which affects di-
verse speech dimensions such as respiration, phonation, res-
onance, articulation, and prosody [1]. Accordingly, people
with dysarthria often suffer from degraded speech intelligi-
bility, repeated communication failures, and, consequently,
poor quality of life. Hence, accurate and reliable speech as-
sessment is essential in the clinical field, as it helps track the
condition of patients and the effectiveness of treatments.
The most common way of assessing severity levels of
dysarthria is by conducting standardized tests such as Fren-
chay Dysarthria Assessment (FDA) [2]. However, these tests
heavily rely on human perceptual evaluations, which can be
∗Equal contributors.
subjective and laborious. Therefore, automatic assessments
that are highly consistent with the experts will have great po-
tential for assisting clinicians in diagnosis and therapy.
Research on automatic assessment of dysarthria can be
grouped into two approaches. The first is to investigate a
novel feature set. For instance, paralinguistic features such
as eGeMAPS were explored on their usability for atypical
speech analysis [3]. On the other hand, common symptoms
of dysarthric speech provided insights into new feature sets -
glottal [4], resonance [5], pronunciation [6, 7], and prosody
features [8, 9]. Furthermore, representations extracted from
deep neural networks were also examined, such as spectro-
temporal subspace [10], i-vectors [11], and deepspeech pos-
teriors [12]. While this approach can provide intuitive de-
scriptions of the acoustic cues used in assessments, it has the
drawback of losing the information that may be valuable to
the task.
The second approach is to explore the network architec-
tures which take raw waveforms as input. Networks include
but are not limited to distance-based neural networks [13],
LSTM-based models [14, 15] and CNN-RNN hybrid models
[16, 17]. As neural networks are often data-hungry, many
researchers suffer from the data scarcity of atypical speech.
Consequently, research has often been limited to dysarthria
detection, which is a binary classification task. However,
multi-class classification should also be considered for more
detailed diagnoses. Recently, self-supervised representation
learning has arisen to alleviate such problems, presenting
successes in various downstream tasks with a small amount
of data [18, 19]. Promising results were also reported for dif-
ferent tasks for atypical speech, including automatic speech
recognition (ASR) [20, 21] and assessments [22, 23, 24].
However, limited explorations were performed on the sever-
ity assessment of dysarthric speech.
This paper proposes a novel automatic severity classifi-
cation method for dysarthric speech using a self-supervised
learning model fine-tuned with multi-task learning (MTL).
The model handles 1) a five-way multi-class classification of
dysarthria severity levels as the main task and 2) automatic
speech recognition as the auxiliary task. We expect MTL to
arXiv:2210.15387v3 [cs.CL] 28 Apr 2023