A KNOWLEDGE-DRIVEN VOWEL-BASED APPROACH OF DEPRESSION CLASSIFICATION FROM SPEECH USING DATA AUGMENTATION Kexin Feng and Theodora Chaspari

2025-04-30 3 0 185.31KB 5 页 10玖币

侵权投诉

A KNOWLEDGE-DRIVEN VOWEL-BASED APPROACH OF DEPRESSION

CLASSIFICATION FROM SPEECH USING DATA AUGMENTATION

Kexin Feng and Theodora Chaspari

Computer Science and Engineering

Texas A&M University

{kexin, chaspari}@tamu.edu

ABSTRACT

We propose a novel explainable machine learning (ML)

model that identiﬁes depression from speech, by modeling

the temporal dependencies across utterances and utilizing the

spectrotemporal information at the vowel level. Our method

ﬁrst models the variable-length utterances at the local-level

into a ﬁxed-size vowel-based embedding using a convolu-

tional neural network with a spatial pyramid pooling layer

(“vowel CNN”). Following that, the depression is classiﬁed

at the global-level from a group of vowel CNN embeddings

that serve as the input of another 1D CNN (“depression

CNN”). Different data augmentation methods are designed

for both the training of vowel CNN and depression CNN. We

investigate the performance of the proposed system at var-

ious temporal granularities when modeling short, medium,

and long analysis windows, corresponding to 10, 21, and

42 utterances, respectively. The proposed method reaches

comparable performance with previous state-of-the-art ap-

proaches and depicts explainable properties with respect to

the depression outcome. The ﬁndings from this work may

beneﬁt clinicians by providing additional intuitions during

joint human-ML decision-making tasks.

Index Terms—Mental health, speech vowel, knowledge-

driven, convolutional neural network, data augmentation

1. INTRODUCTION

Depression is a mental health (MH) condition with large

worldwide prevalence [1], whose diagnosis and treatment

is challenging due lack of access to MH care resources and

stigma [2]. Speech-based machine learning (ML) systems

have shown promising results in identifying depression due

to their ability to learn clinically-relevant acoustic patterns,

such as monotonous pitch and reduced loudness [3]. In addi-

tion, these systems can potentially mitigate social stigma and

increase accessibility to MH care resources, since they can

This work is supported by the National Science Foundation (CAREER:

Enabling Trustworthy Speech Technologies for Mental Health Care: From

Speech Anonymization to Fair Human-centered Machine Intelligence,

#2046118). The code is available at: https://github.com/HUBBS-

Lab-TAMU/ICASSP-2023-Augmented-Knowledge-Driven-

Speech-Based-Method-of-Depression-Detection.

run locally on users’ smartphone devices. Various ML mod-

els including support vector machines (SVM), convolutional

neural network (CNN), and long short-term memory (LSTM)

have been explored for depression estimation [4]. However,

the majority of these methods are designed independently of

MH clinicians, thus depicting challenges in transparency and

explainability.

Interactions between humans and ML are evolving into

collaborative relationships, where the two parties work to-

gether to achieve a set of common goals, especially when

it comes to complex and highly subjective decision-making

tasks, such as the ones pertaining to MH care. An explainable

ML model of depression estimation would allow clinicians

to gain insights into the ML logic and decision-making pro-

cesses and contribute toward better calibrating their trust to

the model output [5]. Previously proposed conceptual frame-

works for building human-centered explainable ML suggest

that users may be able to develop a mental model of the

algorithm based on a collection of “how explanations” that

demonstrate how the model works based on multiple in-

stances [6]. In addition, it is important to provide both global

explanations that describe holistically how the model works,

and local explanations that demonstrate the relationship be-

tween inputs and outputs [7].

Here, we design an explainable ML model for depression

classiﬁcation based on speech. We leverage knowledge from

speech production indicating that depression can inﬂuence the

motor control and consequently the formant frequencies and

spectrotemporal variations at the vowel-level [8]. We propose

a vowel-dependent CNN (vowel CNN) with an spatial pyra-

mid pooling (SPP) layer that learns the spectrotemporal infor-

mation of short-term speech segments (i.e., 250ms) through-

out the utterance. The depression is estimated from a group of

vowel CNN embeddings using another 1D CNN (depression

CNN). The vowel CNN captures depression information at

the local-level from parts of speech that are theoretically pos-

tulated to be most affected by the MH condition [8]. The SPP

layer maps utterances of any size into a ﬁx-size embedding

that contributes to modeling explanations at the utterance-

level, which can provide a global view of the depression out-

arXiv:2210.15261v1 [cs.SD] 27 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AKNOWLEDGE-DRIVENVOWEL-BASEDAPPROACHOFDEPRESSIONCLASSIFICATIONFROMSPEECHUSINGDATAAUGMENTATIONKexinFengandTheodoraChaspariComputerScienceandEngineeringTexasA&MUniversityfkexin,chasparig@tamu.eduABSTRACTWeproposeanovelexplainablemachinelearning(ML)modelthatidentiesdepressionfromspeech,bymodelingthete...

展开>> 收起<<

A KNOWLEDGE-DRIVEN VOWEL-BASED APPROACH OF DEPRESSION CLASSIFICATION FROM SPEECH USING DATA AUGMENTATION Kexin Feng and Theodora Chaspari.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A KNOWLEDGE-DRIVEN VOWEL-BASED APPROACH OF DEPRESSION CLASSIFICATION FROM SPEECH USING DATA AUGMENTATION Kexin Feng and Theodora Chaspari

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: