Audio MFCC-gram Transformers for respiratory insufﬁciency detection in COVID-19 Marcelo Matheus Gauy1 Marcelo Finger1

2025-05-02 0 0 259.59KB 10 页 10玖币

侵权投诉

Audio MFCC-gram Transformers for respiratory insufﬁciency

detection in COVID-19

Marcelo Matheus Gauy1, Marcelo Finger1

1Instituto de Matem´

atica e Estat´

ıstica – Universidade de S˜

ao Paulo (USP)

Abstract. This work explores speech as a biomarker and investigates the de-

tection of respiratory insufﬁciency (RI) by analyzing speech samples. Previous

work [Casanova et al. 2021] constructed a dataset of respiratory insufﬁciency

COVID-19 patient utterances and analyzed it by means of a convolutional neu-

ral network achieving an accuracy of 87.04%, validating the hypothesis that one

can detect RI through speech. Here, we study how Transformer neural network

architectures can improve the performance on RI detection. This approach en-

ables construction of an acoustic model. By choosing the correct pretraining

technique, we generate a self-supervised acoustic model, leading to improved

performance (96.53%) of Transformers for RI detection.

1. Introduction

COVID-19 is the cause of a major pandemic that threatens to collapse the healthcare sys-

tems in many regions of the world. Respiratory insuﬁciency (RI) is one of COVID-19

symptoms, which often requires hospitalization and is aggravated by a common COVID-

19 condition called silent hipoxia, low blood oxygen concentration without breath short-

ness [Tobin et al. 2020]. This work aims to help deal with the COVID-19 pandemic by

providing an automated system, based on deep learning techniques, capable of detect-

ing RI in COVID-19 patients. Such an automated system could, for example, support

cellphone-based patient triage procedures alleviating the burden on health personnel.

We explore the view of speech as a biomarker, by building upon a recently shown

fact: it is possible to detect respiratory insufﬁciency through analyzing spoken utterances

in real-life conditions (typically a moderately large sentence). This hypothesis has been

previously veriﬁed [Casanova et al. 2021] by using a CNN-based deep neural network.

This CNN received a moderately large sentence spoken in real life conditions and had to

predict whether it came from a patient with RI or from the control group. In this work, we

aim to further analyze that hypothesis by studying other network architectures (namely,

Transformers [Vaswani et al. 2017]), in an attempt to improve the results previously ob-

tained in [Casanova et al. 2021], with a view of extending it in the future to RI originated

from other causes, such as inﬂuenza, heart disease or mental illness.

In this work we ﬁnd that Transformers can be used for detecting respiratory insuf-

ﬁciency with an accuracy of 96.38% up from 87.04% in [Casanova et al. 2021]. To reach

that level of performance, we feed the Transformers with a sequence of Mel Frequency

Cepstral Coefﬁcients (MFCC) obtained from the patients’ audios (henceforth called

MFCC-gram Transformers). Like CNN-based detection from [Casanova et al. 2021], the

Transformer performance drops signiﬁcantly (to 82.87%) if we feed it standard spectro-

gram coefﬁcients (called Spectrogram Transformers after [Gong et al. 2021]).

arXiv:2210.14085v1 [cs.SD] 25 Oct 2022

The Transformers [Vaswani et al. 2017] were shown to be very effective when

divided in two parts [Devlin et al. 2018]. The pretraining phase generates a language-

based acoustic model with unsupervised (or self-supervised) learning by optimizing a

generic language prediction task with a large amount of generic data. Then, the acoustic

model undergoes a task-speciﬁc reﬁnement phase in which both the acoustic model and

additional task-speciﬁc neural modules are trained on smaller-size application data. A

baseline transformer is one in which pretraining is a random assignment of weights.

Here, we ﬁnd that MFCC-gram Transformers beneﬁt from being pretrained with

large quantities of spoken Brazilian Portuguese audios, which is later reﬁned for the target

task of detecting respiratory insufﬁciency. For pretraining, we explore three known tech-

niques from the literature [Liu et al. 2020b, Liu et al. 2020a] and ﬁnd that they generally

lead to some performance improvement over baseline transformers. Performance reaches

96.53% using the best of the available techniques.

2. Related Work

In addition to [Casanova et al. 2021] there have been other works [Pinkas et al. 2020,

Laguarta et al. 2020] which study COVID-19 with deep learning using voice related data.

[Pinkas et al. 2020] attempt to detect SARS-COV-2 (the virus that causes COVID-19)

from voice audio data, while this work and [Casanova et al. 2021] attempt to detect RI.

Furthermore, there have been previous works which support the view of speech as a

biomarker [Botelho et al. 2019, Nevler et al. 2019, Robin et al. 2020].

Transformers were designed for NLP [Vaswani et al. 2017, Devlin et al. 2018],

and were also later used in audio processing tasks [Liu et al. 2020b, Liu et al. 2020a,

Schneider et al. 2019, Baevski et al. 2020, Baevski et al. 2019, Song et al. 2019]. In

Mockingjay and Tera [Liu et al. 2020b, Liu et al. 2020a], it was used in phoneme clas-

siﬁcation and speaker recognition tasks. There it was shown that variants of the

Cloze task [Taylor 1953, Devlin et al. 2018] for audio could be used for unsuper-

vised pretraining of Transformers. In Wav2Vec and its variants [Schneider et al. 2019,

Baevski et al. 2020, Baevski et al. 2019], a contrastive loss is used to enable unsupervised

pretraining, which is later ﬁnetuned to speech and phoneme recognition tasks. In Speech-

XLNet [Song et al. 2019], a speech based version of the XLNet [Yang et al. 2019] was

proposed. The XLNet is a network that maximizes the expected log likelihood of a se-

quence of words with respect to all possible autoregressive factorization orders.

3. Methodology

3.1. Datasets

For the task of respiratory insufﬁciency detection, the data used in the reﬁnement phase

is the same one used in [Casanova et al. 2021]. There, COVID patient utterances were

collected by medical students at COVID wards from patients with blood oxygenation level

below 92%, as an indication of RI. Control data was collected by voice donations over

the internet without any access to blood oxygenation measurements and were therefore

assumed healthy. As COVID wards are noisy locations, an extra collection was made

consisting of samples of pure background noise (no voice). This is a crucial step in

preventing the network to overﬁt to the background noise differences in data collection.

The gathered audios contained 3utterances:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AudioMFCC-gramTransformersforrespiratoryinsufciencydetectioninCOVID-19MarceloMatheusGauy1,MarceloFinger11InstitutodeMatem´aticaeEstat´sticaUniversidadedeSaoPaulo(USP)Abstract.Thisworkexploresspeechasabiomarkerandinvestigatesthede-tectionofrespiratoryinsufciency(RI)byanalyzingspeechsamples.Previ...

展开>> 收起<<

Audio MFCC-gram Transformers for respiratory insufﬁciency detection in COVID-19 Marcelo Matheus Gauy1 Marcelo Finger1.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Audio MFCC-gram Transformers for respiratory insufﬁciency detection in COVID-19 Marcelo Matheus Gauy1 Marcelo Finger1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: