Toward Knowledge-Driven Speech-Based Models of Depression Leveraging Spectrotemporal Variations in Speech V owels

2025-05-06 0 0 350.58KB 7 页 10玖币

侵权投诉

Toward Knowledge-Driven Speech-Based Models of

Depression: Leveraging Spectrotemporal Variations

in Speech Vowels

Kexin Feng

Computer Science and Engineering

Texas A&M University

kexin@tamu.edu

Theodora Chaspari

Computer Science and Engineering

Texas A&M University

chaspari@tamu.edu

Abstract—Psychomotor retardation associated with depression

has been linked with tangible differences in vowel production.

This paper investigates a knowledge-driven machine learning

(ML) method that integrates spectrotemporal information of

speech at the vowel-level to identify the depression. Low-level

speech descriptors are learned by a convolutional neural network

(CNN) that is trained for vowel classiﬁcation. The temporal

evolution of those low-level descriptors is modeled at the high-

level within and across utterances via a long short-term mem-

ory (LSTM) model that takes the ﬁnal depression decision.

A modiﬁed version of the Local Interpretable Model-agnostic

Explanations (LIME) is further used to identify the impact of the

low-level spectrotemporal vowel variation on the decisions and

observe the high-level temporal change of the depression like-

lihood. The proposed method outperforms baselines that model

the spectrotemporal information in speech without integrating

the vowel-based information, as well as ML models trained

with conventional prosodic and spectrotemporal features. The

conducted explainability analysis indicates that spectrotemporal

information corresponding to non-vowel segments less important

than the vowel-based information. Explainability of the high-

level information capturing the segment-by-segment decisions is

further inspected for participants with and without depression.

The ﬁndings from this work can provide the foundation toward

knowledge-driven interpretable decision-support systems that can

assist clinicians to better understand ﬁne-grain temporal changes

in speech data, ultimately augmenting mental health diagnosis

and care.

Index Terms—Mental health, depression, speech vowels, con-

volutional neural network (CNN), long short-term memory

(LSTM), explainable machine learning

I. INTRODUCTION

Depression is the most common mental health (MH) con-

dition inﬂuencing approximately 280 million people world-

wide [1]. Patients with depression experience among oth-

ers feelings of sadness, irritability, and emptiness, loss of

pleasure or interest in activities, and feelings of guilt or

low self-worth. Despite the high prevalence depression and

This work is supported by the National Science Foundation

(CAREER: Enabling Trustworthy Speech Technologies for Mental

Health Care: From Speech Anonymization to Fair Human-centered

Machine Intelligence, #2046118). The code developed as part of this

work is publicly available at https://github.com/HUBBS-Lab-TAMU/

2dCNN-LSTM-depression-identiﬁcation

its impact on patients’ functioning and quality of life, only

approximately half of respondents with depressive or anxi-

ety disorders receive treatment [2] due to lack of insurance

coverage, unequal access to evidence-based practices, stigma,

MH workforce shortages, and geographical maldistribution of

providers. These challenges are particularly prevalent amongst

racial-ethnic minority groups and underserved communities

that also frequently suffer from diagnostic assessment bias and

diagnostic errors [3].

Qualitative and quantitative evidence suggests that speech

production mechanisms are inﬂuenced by depression, and

that patients with depression depict slowed speech rate,

monotonous pitch, and reduced loudness, which is reﬂected

in prosodic measures, such as fundamental frequency (F0),

intensity, and speaking rate [4]. In addition, speech technolo-

gies in tandem with machine learning (ML) can be ubiquitous,

therefore rendering access to in-situ ecologically valid data and

allowing just-in-time support during states of opportunity and

states of vulnerability [5]. Yet, converting ML-derived deci-

sions into effective action remains a challenge [6]. Clinicians

may ﬁnd it difﬁcult to trust complex ML algorithms over their

own intuition, unless they are provided with explanations that

can facilitate alert interpretation and promote the transparency

of the “black-box” system [7].

Depression can inﬂuence the motor control and conse-

quently, speech production. Speech is produced as a sound

generated by the glottis and modulated by the vocal tract

which acts as a resonant ﬁlter, causing formant frequencies and

spectrotemporal variations [8]. Physiological and cognitive im-

pairments associated with MH affect the phonological loop [9]

and produce noticeable energy variations in speech vowels. An

early study by Shimizu et al. suggests that people with depres-

sion depict decreased laryngeal vagal function, which controls

the chaotic motion of the vocal folds, and thus, is manifested

as an increased chaotic pattern in vowel sounds [10]. Scherer

et al. found that individuals with depression depict reduced

vowel space, deﬁned as the frequency range between the ﬁrst

and second formant (i.e., F1 and F2) of the vowels, compared

to their healthy counterparts [11]. Vlasenko et al. analyzed the

effect of depression on formant dynamics for female and male

arXiv:2210.02527v1 [cs.LG] 5 Oct 2022

speakers separately [12]. Results demonstrate the potential of

using gender-dependent vowel-based features for depression

detection, which outperformed a range of turn-level acoustic

features that were extracted without considering vowel-based

information. Finally, Stasak et al. examined linguistic stress

differences between non-depressed and clinically depressed

individuals and found statistically signiﬁcant differences in

vowel articulatory parameters with shorter vowel durations and

less variance for ’low’, ’back’, and ’rounded’ vowel positions

for individuals with depression [13]. Evidence from these

studies suggests that the effects of psychomotor retardation

associated with depression are linked to tangible differences

in vowel production, whose measures can be particularly

informative when detecting depression from speech.

Leveraging this evidence, we proposed a vowel-dependent

deep learning approach for classifying depression from speech.

We extract low-level vowel-dependent feature embeddings via

a convolutional neural network (CNN), which is trained for the

task of vowel classiﬁcation, thus, capturing energy variations

of the speech spectrogram within the corresponding vowel

region. The learned vowel-based embeddings comprise the

input to a long short-term memory (LSTM) model that models

high-level temporal dependencies across speech segments and

outputs the depression outcome. We compare the proposed

approach against prior work on deep learning models that does

not leverage vowel-dependent information and ﬁnd that the

proposed method depicts better performance compared to the

considered baselines. We further conduct an empirical analysis

using a modiﬁed version of the Local Interpretable Model-

agnostic Explanations (LIME) to identify the parts of the

speech spectrogram that contribute most to the decision about

the depression condition. Results from this work could po-

tentially contribute to reliable and interpretable speech-based

ML models of depression that consider low-level ﬁne-grain

spectrotemporal information along with high-level transitions

within and across an utterance, and could be eventually used

to assist MH clinicians in decision-making.

II. PREVIOUS WORK

Early work in depression estimation from speech has ex-

plored acoustic biomarkers of depression that quantify char-

acteristics of the vocal source, tract, formants, and prosodics.

Psychomotor retardation associated with depression can slow

the movement of articulatory muscles, therefore causing de-

creases in speech rate and phoneme rate, as well as decreases

in pitch and energy variability [4], [14]. Other work has

quantiﬁed changes in coordination of the vocal tract motion

by measuring correlation coefﬁcients over different temporal

scales across formant frequencies and the ﬁrst-order deriva-

tives of the Mel-frequency cepstral coefﬁcients (MFCC) [15].

Formant information has been considered by measuring the

distance between F1 and F2 (the ﬁrst and second formant)

coordinates in certain vowels (i.e., /i/, /u/, /a/), and showing

that this distance is reduced for patients with depression

compared to healthy individuals [11]. Finally, other work has

modeled the acoustic variability of MFCCs at the utterance-

level indicating the patients with depression depict reduced

variability in regards to this spectral measure [16]. The above

measures along with commonly used acoustic features of

prosody, energy, intonation, and spectral information have

been employed as the input to various ML models for the

task of depression recognition [17], [18].

Other work has leveraged recent advances in deep learn-

ing to reliably estimate depression from speech. Ma et al.

proposed a deep model, called “DepAudioNet” that encoded

the Mel-scale ﬁlter bank of speech via a 1-dimensional CNN

followed by a LSTM [19]. The convolutional transformations

implemented in the 1D CNN encode low-level short-term

spectrotemporal speech variations, the max-pooling layers of

the 1D CNN capture mid-level variations, while the LSTM

extracts long-term information within an utterance. Muzammel

et al. incorporated speech spectrograms at the phoneme-level

as the input to 2-dimensional CNNs [20]. Vowels and conso-

nants were separated from speech signals and served as the

input to a 2-D CNN that performed depression classiﬁcation.

Results demonstrate that the fusion of consonant-based and

vowel-based embeddings depict the best performance. Saidi

et al. used the speech spectrogram as the input of a 2-D

CNN that was trained for depression classiﬁcation [21]. The

learned embeddings of the CNN were ﬂattened and comprised

the input to a support vector machine (SVM). The CNN-

SVM combination depicted an absolute of 10% increase in

depression classiﬁcation accuracy compared to the CNN alone.

Srimadhur & Lalitha compared a 2-D CNN whose input was

the speech spectrogram with an end-to-end 1D CNN whose

input was the raw speech signal, with the latter yielding

improved results [22]. Zhao et al. proposed the hierarchical

attention transfer network (HATN), a two-level hierarchical

network with an attention mechanism that models speech

variations at the frame and sentence level [23]. This model

learns attention weights for a speech recognition task (source

task), implemented via a deep teacher network, and transfers

those weights to the depression severity estimation task (target

task), implemented via a shallow student network.

The contributions of this work compared to the previous

research are: (1) In contrast to the majority of prior work

in deep learning, this paper proposes a CNN-based architec-

ture that learns vowel-based embeddings, therefore explicitly

incorporating low-level spectrotemporal information that is

valuable for identifying depression [10]–[13], [24]. While

phoneme-based information has been modeled before [20],

it was only incorporated at the extraction of spectrogram

features, without further deriving high-level representations;

(2) High-level information is extracted from the vowel-based

embeddings via the LSTM that models the evolution of this

information within and across utterances, a technique that is

fairly unexplored in depression detection and has not been

combined with vowel-based information [19]; and (3) The

utility of learned embeddings is qualitatively explored in terms

of its interpretability with respect to the depression outcome.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardKnowledge-DrivenSpeech-BasedModelsofDepression:LeveragingSpectrotemporalVariationsinSpeechVowelsKexinFengComputerScienceandEngineeringTexasA&MUniversitykexin@tamu.eduTheodoraChaspariComputerScienceandEngineeringTexasA&MUniversitychaspari@tamu.eduAbstractPsychomotorretardationassociatedwithdep...

展开>> 收起<<

Toward Knowledge-Driven Speech-Based Models of Depression Leveraging Spectrotemporal Variations in Speech V owels.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Toward Knowledge-Driven Speech-Based Models of Depression Leveraging Spectrotemporal Variations in Speech V owels

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: