Toward Knowledge-Driven Speech-Based Models of Depression Leveraging Spectrotemporal Variations in Speech V owels

2025-05-06 0 0 350.58KB 7 页 10玖币
侵权投诉
Toward Knowledge-Driven Speech-Based Models of
Depression: Leveraging Spectrotemporal Variations
in Speech Vowels
Kexin Feng
Computer Science and Engineering
Texas A&M University
kexin@tamu.edu
Theodora Chaspari
Computer Science and Engineering
Texas A&M University
chaspari@tamu.edu
Abstract—Psychomotor retardation associated with depression
has been linked with tangible differences in vowel production.
This paper investigates a knowledge-driven machine learning
(ML) method that integrates spectrotemporal information of
speech at the vowel-level to identify the depression. Low-level
speech descriptors are learned by a convolutional neural network
(CNN) that is trained for vowel classification. The temporal
evolution of those low-level descriptors is modeled at the high-
level within and across utterances via a long short-term mem-
ory (LSTM) model that takes the final depression decision.
A modified version of the Local Interpretable Model-agnostic
Explanations (LIME) is further used to identify the impact of the
low-level spectrotemporal vowel variation on the decisions and
observe the high-level temporal change of the depression like-
lihood. The proposed method outperforms baselines that model
the spectrotemporal information in speech without integrating
the vowel-based information, as well as ML models trained
with conventional prosodic and spectrotemporal features. The
conducted explainability analysis indicates that spectrotemporal
information corresponding to non-vowel segments less important
than the vowel-based information. Explainability of the high-
level information capturing the segment-by-segment decisions is
further inspected for participants with and without depression.
The findings from this work can provide the foundation toward
knowledge-driven interpretable decision-support systems that can
assist clinicians to better understand fine-grain temporal changes
in speech data, ultimately augmenting mental health diagnosis
and care.
Index Terms—Mental health, depression, speech vowels, con-
volutional neural network (CNN), long short-term memory
(LSTM), explainable machine learning
I. INTRODUCTION
Depression is the most common mental health (MH) con-
dition influencing approximately 280 million people world-
wide [1]. Patients with depression experience among oth-
ers feelings of sadness, irritability, and emptiness, loss of
pleasure or interest in activities, and feelings of guilt or
low self-worth. Despite the high prevalence depression and
This work is supported by the National Science Foundation
(CAREER: Enabling Trustworthy Speech Technologies for Mental
Health Care: From Speech Anonymization to Fair Human-centered
Machine Intelligence, #2046118). The code developed as part of this
work is publicly available at https://github.com/HUBBS-Lab-TAMU/
2dCNN-LSTM-depression-identification
its impact on patients’ functioning and quality of life, only
approximately half of respondents with depressive or anxi-
ety disorders receive treatment [2] due to lack of insurance
coverage, unequal access to evidence-based practices, stigma,
MH workforce shortages, and geographical maldistribution of
providers. These challenges are particularly prevalent amongst
racial-ethnic minority groups and underserved communities
that also frequently suffer from diagnostic assessment bias and
diagnostic errors [3].
Qualitative and quantitative evidence suggests that speech
production mechanisms are influenced by depression, and
that patients with depression depict slowed speech rate,
monotonous pitch, and reduced loudness, which is reflected
in prosodic measures, such as fundamental frequency (F0),
intensity, and speaking rate [4]. In addition, speech technolo-
gies in tandem with machine learning (ML) can be ubiquitous,
therefore rendering access to in-situ ecologically valid data and
allowing just-in-time support during states of opportunity and
states of vulnerability [5]. Yet, converting ML-derived deci-
sions into effective action remains a challenge [6]. Clinicians
may find it difficult to trust complex ML algorithms over their
own intuition, unless they are provided with explanations that
can facilitate alert interpretation and promote the transparency
of the “black-box” system [7].
Depression can influence the motor control and conse-
quently, speech production. Speech is produced as a sound
generated by the glottis and modulated by the vocal tract
which acts as a resonant filter, causing formant frequencies and
spectrotemporal variations [8]. Physiological and cognitive im-
pairments associated with MH affect the phonological loop [9]
and produce noticeable energy variations in speech vowels. An
early study by Shimizu et al. suggests that people with depres-
sion depict decreased laryngeal vagal function, which controls
the chaotic motion of the vocal folds, and thus, is manifested
as an increased chaotic pattern in vowel sounds [10]. Scherer
et al. found that individuals with depression depict reduced
vowel space, defined as the frequency range between the first
and second formant (i.e., F1 and F2) of the vowels, compared
to their healthy counterparts [11]. Vlasenko et al. analyzed the
effect of depression on formant dynamics for female and male
arXiv:2210.02527v1 [cs.LG] 5 Oct 2022
speakers separately [12]. Results demonstrate the potential of
using gender-dependent vowel-based features for depression
detection, which outperformed a range of turn-level acoustic
features that were extracted without considering vowel-based
information. Finally, Stasak et al. examined linguistic stress
differences between non-depressed and clinically depressed
individuals and found statistically significant differences in
vowel articulatory parameters with shorter vowel durations and
less variance for ’low’, ’back’, and ’rounded’ vowel positions
for individuals with depression [13]. Evidence from these
studies suggests that the effects of psychomotor retardation
associated with depression are linked to tangible differences
in vowel production, whose measures can be particularly
informative when detecting depression from speech.
Leveraging this evidence, we proposed a vowel-dependent
deep learning approach for classifying depression from speech.
We extract low-level vowel-dependent feature embeddings via
a convolutional neural network (CNN), which is trained for the
task of vowel classification, thus, capturing energy variations
of the speech spectrogram within the corresponding vowel
region. The learned vowel-based embeddings comprise the
input to a long short-term memory (LSTM) model that models
high-level temporal dependencies across speech segments and
outputs the depression outcome. We compare the proposed
approach against prior work on deep learning models that does
not leverage vowel-dependent information and find that the
proposed method depicts better performance compared to the
considered baselines. We further conduct an empirical analysis
using a modified version of the Local Interpretable Model-
agnostic Explanations (LIME) to identify the parts of the
speech spectrogram that contribute most to the decision about
the depression condition. Results from this work could po-
tentially contribute to reliable and interpretable speech-based
ML models of depression that consider low-level fine-grain
spectrotemporal information along with high-level transitions
within and across an utterance, and could be eventually used
to assist MH clinicians in decision-making.
II. PREVIOUS WORK
Early work in depression estimation from speech has ex-
plored acoustic biomarkers of depression that quantify char-
acteristics of the vocal source, tract, formants, and prosodics.
Psychomotor retardation associated with depression can slow
the movement of articulatory muscles, therefore causing de-
creases in speech rate and phoneme rate, as well as decreases
in pitch and energy variability [4], [14]. Other work has
quantified changes in coordination of the vocal tract motion
by measuring correlation coefficients over different temporal
scales across formant frequencies and the first-order deriva-
tives of the Mel-frequency cepstral coefficients (MFCC) [15].
Formant information has been considered by measuring the
distance between F1 and F2 (the first and second formant)
coordinates in certain vowels (i.e., /i/, /u/, /a/), and showing
that this distance is reduced for patients with depression
compared to healthy individuals [11]. Finally, other work has
modeled the acoustic variability of MFCCs at the utterance-
level indicating the patients with depression depict reduced
variability in regards to this spectral measure [16]. The above
measures along with commonly used acoustic features of
prosody, energy, intonation, and spectral information have
been employed as the input to various ML models for the
task of depression recognition [17], [18].
Other work has leveraged recent advances in deep learn-
ing to reliably estimate depression from speech. Ma et al.
proposed a deep model, called “DepAudioNet” that encoded
the Mel-scale filter bank of speech via a 1-dimensional CNN
followed by a LSTM [19]. The convolutional transformations
implemented in the 1D CNN encode low-level short-term
spectrotemporal speech variations, the max-pooling layers of
the 1D CNN capture mid-level variations, while the LSTM
extracts long-term information within an utterance. Muzammel
et al. incorporated speech spectrograms at the phoneme-level
as the input to 2-dimensional CNNs [20]. Vowels and conso-
nants were separated from speech signals and served as the
input to a 2-D CNN that performed depression classification.
Results demonstrate that the fusion of consonant-based and
vowel-based embeddings depict the best performance. Saidi
et al. used the speech spectrogram as the input of a 2-D
CNN that was trained for depression classification [21]. The
learned embeddings of the CNN were flattened and comprised
the input to a support vector machine (SVM). The CNN-
SVM combination depicted an absolute of 10% increase in
depression classification accuracy compared to the CNN alone.
Srimadhur & Lalitha compared a 2-D CNN whose input was
the speech spectrogram with an end-to-end 1D CNN whose
input was the raw speech signal, with the latter yielding
improved results [22]. Zhao et al. proposed the hierarchical
attention transfer network (HATN), a two-level hierarchical
network with an attention mechanism that models speech
variations at the frame and sentence level [23]. This model
learns attention weights for a speech recognition task (source
task), implemented via a deep teacher network, and transfers
those weights to the depression severity estimation task (target
task), implemented via a shallow student network.
The contributions of this work compared to the previous
research are: (1) In contrast to the majority of prior work
in deep learning, this paper proposes a CNN-based architec-
ture that learns vowel-based embeddings, therefore explicitly
incorporating low-level spectrotemporal information that is
valuable for identifying depression [10]–[13], [24]. While
phoneme-based information has been modeled before [20],
it was only incorporated at the extraction of spectrogram
features, without further deriving high-level representations;
(2) High-level information is extracted from the vowel-based
embeddings via the LSTM that models the evolution of this
information within and across utterances, a technique that is
fairly unexplored in depression detection and has not been
combined with vowel-based information [19]; and (3) The
utility of learned embeddings is qualitatively explored in terms
of its interpretability with respect to the depression outcome.
摘要:

TowardKnowledge-DrivenSpeech-BasedModelsofDepression:LeveragingSpectrotemporalVariationsinSpeechVowelsKexinFengComputerScienceandEngineeringTexasA&MUniversitykexin@tamu.eduTheodoraChaspariComputerScienceandEngineeringTexasA&MUniversitychaspari@tamu.eduAbstract—Psychomotorretardationassociatedwithdep...

展开>> 收起<<
Toward Knowledge-Driven Speech-Based Models of Depression Leveraging Spectrotemporal Variations in Speech V owels.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:350.58KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注