Audio MFCC-gram Transformers for respiratory insufficiency detection in COVID-19 Marcelo Matheus Gauy1 Marcelo Finger1

2025-05-02 0 0 259.59KB 10 页 10玖币
侵权投诉
Audio MFCC-gram Transformers for respiratory insufficiency
detection in COVID-19
Marcelo Matheus Gauy1, Marcelo Finger1
1Instituto de Matem´
atica e Estat´
ıstica – Universidade de S˜
ao Paulo (USP)
Abstract. This work explores speech as a biomarker and investigates the de-
tection of respiratory insufficiency (RI) by analyzing speech samples. Previous
work [Casanova et al. 2021] constructed a dataset of respiratory insufficiency
COVID-19 patient utterances and analyzed it by means of a convolutional neu-
ral network achieving an accuracy of 87.04%, validating the hypothesis that one
can detect RI through speech. Here, we study how Transformer neural network
architectures can improve the performance on RI detection. This approach en-
ables construction of an acoustic model. By choosing the correct pretraining
technique, we generate a self-supervised acoustic model, leading to improved
performance (96.53%) of Transformers for RI detection.
1. Introduction
COVID-19 is the cause of a major pandemic that threatens to collapse the healthcare sys-
tems in many regions of the world. Respiratory insuficiency (RI) is one of COVID-19
symptoms, which often requires hospitalization and is aggravated by a common COVID-
19 condition called silent hipoxia, low blood oxygen concentration without breath short-
ness [Tobin et al. 2020]. This work aims to help deal with the COVID-19 pandemic by
providing an automated system, based on deep learning techniques, capable of detect-
ing RI in COVID-19 patients. Such an automated system could, for example, support
cellphone-based patient triage procedures alleviating the burden on health personnel.
We explore the view of speech as a biomarker, by building upon a recently shown
fact: it is possible to detect respiratory insufficiency through analyzing spoken utterances
in real-life conditions (typically a moderately large sentence). This hypothesis has been
previously verified [Casanova et al. 2021] by using a CNN-based deep neural network.
This CNN received a moderately large sentence spoken in real life conditions and had to
predict whether it came from a patient with RI or from the control group. In this work, we
aim to further analyze that hypothesis by studying other network architectures (namely,
Transformers [Vaswani et al. 2017]), in an attempt to improve the results previously ob-
tained in [Casanova et al. 2021], with a view of extending it in the future to RI originated
from other causes, such as influenza, heart disease or mental illness.
In this work we find that Transformers can be used for detecting respiratory insuf-
ficiency with an accuracy of 96.38% up from 87.04% in [Casanova et al. 2021]. To reach
that level of performance, we feed the Transformers with a sequence of Mel Frequency
Cepstral Coefficients (MFCC) obtained from the patients’ audios (henceforth called
MFCC-gram Transformers). Like CNN-based detection from [Casanova et al. 2021], the
Transformer performance drops significantly (to 82.87%) if we feed it standard spectro-
gram coefficients (called Spectrogram Transformers after [Gong et al. 2021]).
arXiv:2210.14085v1 [cs.SD] 25 Oct 2022
The Transformers [Vaswani et al. 2017] were shown to be very effective when
divided in two parts [Devlin et al. 2018]. The pretraining phase generates a language-
based acoustic model with unsupervised (or self-supervised) learning by optimizing a
generic language prediction task with a large amount of generic data. Then, the acoustic
model undergoes a task-specific refinement phase in which both the acoustic model and
additional task-specific neural modules are trained on smaller-size application data. A
baseline transformer is one in which pretraining is a random assignment of weights.
Here, we find that MFCC-gram Transformers benefit from being pretrained with
large quantities of spoken Brazilian Portuguese audios, which is later refined for the target
task of detecting respiratory insufficiency. For pretraining, we explore three known tech-
niques from the literature [Liu et al. 2020b, Liu et al. 2020a] and find that they generally
lead to some performance improvement over baseline transformers. Performance reaches
96.53% using the best of the available techniques.
2. Related Work
In addition to [Casanova et al. 2021] there have been other works [Pinkas et al. 2020,
Laguarta et al. 2020] which study COVID-19 with deep learning using voice related data.
[Pinkas et al. 2020] attempt to detect SARS-COV-2 (the virus that causes COVID-19)
from voice audio data, while this work and [Casanova et al. 2021] attempt to detect RI.
Furthermore, there have been previous works which support the view of speech as a
biomarker [Botelho et al. 2019, Nevler et al. 2019, Robin et al. 2020].
Transformers were designed for NLP [Vaswani et al. 2017, Devlin et al. 2018],
and were also later used in audio processing tasks [Liu et al. 2020b, Liu et al. 2020a,
Schneider et al. 2019, Baevski et al. 2020, Baevski et al. 2019, Song et al. 2019]. In
Mockingjay and Tera [Liu et al. 2020b, Liu et al. 2020a], it was used in phoneme clas-
sification and speaker recognition tasks. There it was shown that variants of the
Cloze task [Taylor 1953, Devlin et al. 2018] for audio could be used for unsuper-
vised pretraining of Transformers. In Wav2Vec and its variants [Schneider et al. 2019,
Baevski et al. 2020, Baevski et al. 2019], a contrastive loss is used to enable unsupervised
pretraining, which is later finetuned to speech and phoneme recognition tasks. In Speech-
XLNet [Song et al. 2019], a speech based version of the XLNet [Yang et al. 2019] was
proposed. The XLNet is a network that maximizes the expected log likelihood of a se-
quence of words with respect to all possible autoregressive factorization orders.
3. Methodology
3.1. Datasets
For the task of respiratory insufficiency detection, the data used in the refinement phase
is the same one used in [Casanova et al. 2021]. There, COVID patient utterances were
collected by medical students at COVID wards from patients with blood oxygenation level
below 92%, as an indication of RI. Control data was collected by voice donations over
the internet without any access to blood oxygenation measurements and were therefore
assumed healthy. As COVID wards are noisy locations, an extra collection was made
consisting of samples of pure background noise (no voice). This is a crucial step in
preventing the network to overfit to the background noise differences in data collection.
The gathered audios contained 3utterances:
摘要:

AudioMFCC-gramTransformersforrespiratoryinsufciencydetectioninCOVID-19MarceloMatheusGauy1,MarceloFinger11InstitutodeMatem´aticaeEstat´stica–UniversidadedeS˜aoPaulo(USP)Abstract.Thisworkexploresspeechasabiomarkerandinvestigatesthede-tectionofrespiratoryinsufciency(RI)byanalyzingspeechsamples.Previ...

展开>> 收起<<
Audio MFCC-gram Transformers for respiratory insufficiency detection in COVID-19 Marcelo Matheus Gauy1 Marcelo Finger1.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:259.59KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注