
A KNOWLEDGE-DRIVEN VOWEL-BASED APPROACH OF DEPRESSION
CLASSIFICATION FROM SPEECH USING DATA AUGMENTATION
Kexin Feng and Theodora Chaspari
Computer Science and Engineering
Texas A&M University
{kexin, chaspari}@tamu.edu
ABSTRACT
We propose a novel explainable machine learning (ML)
model that identifies depression from speech, by modeling
the temporal dependencies across utterances and utilizing the
spectrotemporal information at the vowel level. Our method
first models the variable-length utterances at the local-level
into a fixed-size vowel-based embedding using a convolu-
tional neural network with a spatial pyramid pooling layer
(“vowel CNN”). Following that, the depression is classified
at the global-level from a group of vowel CNN embeddings
that serve as the input of another 1D CNN (“depression
CNN”). Different data augmentation methods are designed
for both the training of vowel CNN and depression CNN. We
investigate the performance of the proposed system at var-
ious temporal granularities when modeling short, medium,
and long analysis windows, corresponding to 10, 21, and
42 utterances, respectively. The proposed method reaches
comparable performance with previous state-of-the-art ap-
proaches and depicts explainable properties with respect to
the depression outcome. The findings from this work may
benefit clinicians by providing additional intuitions during
joint human-ML decision-making tasks.
Index Terms—Mental health, speech vowel, knowledge-
driven, convolutional neural network, data augmentation
1. INTRODUCTION
Depression is a mental health (MH) condition with large
worldwide prevalence [1], whose diagnosis and treatment
is challenging due lack of access to MH care resources and
stigma [2]. Speech-based machine learning (ML) systems
have shown promising results in identifying depression due
to their ability to learn clinically-relevant acoustic patterns,
such as monotonous pitch and reduced loudness [3]. In addi-
tion, these systems can potentially mitigate social stigma and
increase accessibility to MH care resources, since they can
This work is supported by the National Science Foundation (CAREER:
Enabling Trustworthy Speech Technologies for Mental Health Care: From
Speech Anonymization to Fair Human-centered Machine Intelligence,
#2046118). The code is available at: https://github.com/HUBBS-
Lab-TAMU/ICASSP-2023-Augmented-Knowledge-Driven-
Speech-Based-Method-of-Depression-Detection.
run locally on users’ smartphone devices. Various ML mod-
els including support vector machines (SVM), convolutional
neural network (CNN), and long short-term memory (LSTM)
have been explored for depression estimation [4]. However,
the majority of these methods are designed independently of
MH clinicians, thus depicting challenges in transparency and
explainability.
Interactions between humans and ML are evolving into
collaborative relationships, where the two parties work to-
gether to achieve a set of common goals, especially when
it comes to complex and highly subjective decision-making
tasks, such as the ones pertaining to MH care. An explainable
ML model of depression estimation would allow clinicians
to gain insights into the ML logic and decision-making pro-
cesses and contribute toward better calibrating their trust to
the model output [5]. Previously proposed conceptual frame-
works for building human-centered explainable ML suggest
that users may be able to develop a mental model of the
algorithm based on a collection of “how explanations” that
demonstrate how the model works based on multiple in-
stances [6]. In addition, it is important to provide both global
explanations that describe holistically how the model works,
and local explanations that demonstrate the relationship be-
tween inputs and outputs [7].
Here, we design an explainable ML model for depression
classification based on speech. We leverage knowledge from
speech production indicating that depression can influence the
motor control and consequently the formant frequencies and
spectrotemporal variations at the vowel-level [8]. We propose
a vowel-dependent CNN (vowel CNN) with an spatial pyra-
mid pooling (SPP) layer that learns the spectrotemporal infor-
mation of short-term speech segments (i.e., 250ms) through-
out the utterance. The depression is estimated from a group of
vowel CNN embeddings using another 1D CNN (depression
CNN). The vowel CNN captures depression information at
the local-level from parts of speech that are theoretically pos-
tulated to be most affected by the MH condition [8]. The SPP
layer maps utterances of any size into a fix-size embedding
that contributes to modeling explanations at the utterance-
level, which can provide a global view of the depression out-
arXiv:2210.15261v1 [cs.SD] 27 Oct 2022