A KNOWLEDGE-DRIVEN VOWEL-BASED APPROACH OF DEPRESSION CLASSIFICATION FROM SPEECH USING DATA AUGMENTATION Kexin Feng and Theodora Chaspari

2025-04-30 0 0 185.31KB 5 页 10玖币
侵权投诉
A KNOWLEDGE-DRIVEN VOWEL-BASED APPROACH OF DEPRESSION
CLASSIFICATION FROM SPEECH USING DATA AUGMENTATION
Kexin Feng and Theodora Chaspari
Computer Science and Engineering
Texas A&M University
{kexin, chaspari}@tamu.edu
ABSTRACT
We propose a novel explainable machine learning (ML)
model that identifies depression from speech, by modeling
the temporal dependencies across utterances and utilizing the
spectrotemporal information at the vowel level. Our method
first models the variable-length utterances at the local-level
into a fixed-size vowel-based embedding using a convolu-
tional neural network with a spatial pyramid pooling layer
(“vowel CNN”). Following that, the depression is classified
at the global-level from a group of vowel CNN embeddings
that serve as the input of another 1D CNN (“depression
CNN”). Different data augmentation methods are designed
for both the training of vowel CNN and depression CNN. We
investigate the performance of the proposed system at var-
ious temporal granularities when modeling short, medium,
and long analysis windows, corresponding to 10, 21, and
42 utterances, respectively. The proposed method reaches
comparable performance with previous state-of-the-art ap-
proaches and depicts explainable properties with respect to
the depression outcome. The findings from this work may
benefit clinicians by providing additional intuitions during
joint human-ML decision-making tasks.
Index TermsMental health, speech vowel, knowledge-
driven, convolutional neural network, data augmentation
1. INTRODUCTION
Depression is a mental health (MH) condition with large
worldwide prevalence [1], whose diagnosis and treatment
is challenging due lack of access to MH care resources and
stigma [2]. Speech-based machine learning (ML) systems
have shown promising results in identifying depression due
to their ability to learn clinically-relevant acoustic patterns,
such as monotonous pitch and reduced loudness [3]. In addi-
tion, these systems can potentially mitigate social stigma and
increase accessibility to MH care resources, since they can
This work is supported by the National Science Foundation (CAREER:
Enabling Trustworthy Speech Technologies for Mental Health Care: From
Speech Anonymization to Fair Human-centered Machine Intelligence,
#2046118). The code is available at: https://github.com/HUBBS-
Lab-TAMU/ICASSP-2023-Augmented-Knowledge-Driven-
Speech-Based-Method-of-Depression-Detection.
run locally on users’ smartphone devices. Various ML mod-
els including support vector machines (SVM), convolutional
neural network (CNN), and long short-term memory (LSTM)
have been explored for depression estimation [4]. However,
the majority of these methods are designed independently of
MH clinicians, thus depicting challenges in transparency and
explainability.
Interactions between humans and ML are evolving into
collaborative relationships, where the two parties work to-
gether to achieve a set of common goals, especially when
it comes to complex and highly subjective decision-making
tasks, such as the ones pertaining to MH care. An explainable
ML model of depression estimation would allow clinicians
to gain insights into the ML logic and decision-making pro-
cesses and contribute toward better calibrating their trust to
the model output [5]. Previously proposed conceptual frame-
works for building human-centered explainable ML suggest
that users may be able to develop a mental model of the
algorithm based on a collection of “how explanations” that
demonstrate how the model works based on multiple in-
stances [6]. In addition, it is important to provide both global
explanations that describe holistically how the model works,
and local explanations that demonstrate the relationship be-
tween inputs and outputs [7].
Here, we design an explainable ML model for depression
classification based on speech. We leverage knowledge from
speech production indicating that depression can influence the
motor control and consequently the formant frequencies and
spectrotemporal variations at the vowel-level [8]. We propose
a vowel-dependent CNN (vowel CNN) with an spatial pyra-
mid pooling (SPP) layer that learns the spectrotemporal infor-
mation of short-term speech segments (i.e., 250ms) through-
out the utterance. The depression is estimated from a group of
vowel CNN embeddings using another 1D CNN (depression
CNN). The vowel CNN captures depression information at
the local-level from parts of speech that are theoretically pos-
tulated to be most affected by the MH condition [8]. The SPP
layer maps utterances of any size into a fix-size embedding
that contributes to modeling explanations at the utterance-
level, which can provide a global view of the depression out-
arXiv:2210.15261v1 [cs.SD] 27 Oct 2022
摘要:

AKNOWLEDGE-DRIVENVOWEL-BASEDAPPROACHOFDEPRESSIONCLASSIFICATIONFROMSPEECHUSINGDATAAUGMENTATIONKexinFengandTheodoraChaspariComputerScienceandEngineeringTexasA&MUniversityfkexin,chasparig@tamu.eduABSTRACTWeproposeanovelexplainablemachinelearning(ML)modelthatidentiesdepressionfromspeech,bymodelingthete...

展开>> 收起<<
A KNOWLEDGE-DRIVEN VOWEL-BASED APPROACH OF DEPRESSION CLASSIFICATION FROM SPEECH USING DATA AUGMENTATION Kexin Feng and Theodora Chaspari.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:185.31KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注