The Transformers [Vaswani et al. 2017] were shown to be very effective when
divided in two parts [Devlin et al. 2018]. The pretraining phase generates a language-
based acoustic model with unsupervised (or self-supervised) learning by optimizing a
generic language prediction task with a large amount of generic data. Then, the acoustic
model undergoes a task-specific refinement phase in which both the acoustic model and
additional task-specific neural modules are trained on smaller-size application data. A
baseline transformer is one in which pretraining is a random assignment of weights.
Here, we find that MFCC-gram Transformers benefit from being pretrained with
large quantities of spoken Brazilian Portuguese audios, which is later refined for the target
task of detecting respiratory insufficiency. For pretraining, we explore three known tech-
niques from the literature [Liu et al. 2020b, Liu et al. 2020a] and find that they generally
lead to some performance improvement over baseline transformers. Performance reaches
96.53% using the best of the available techniques.
2. Related Work
In addition to [Casanova et al. 2021] there have been other works [Pinkas et al. 2020,
Laguarta et al. 2020] which study COVID-19 with deep learning using voice related data.
[Pinkas et al. 2020] attempt to detect SARS-COV-2 (the virus that causes COVID-19)
from voice audio data, while this work and [Casanova et al. 2021] attempt to detect RI.
Furthermore, there have been previous works which support the view of speech as a
biomarker [Botelho et al. 2019, Nevler et al. 2019, Robin et al. 2020].
Transformers were designed for NLP [Vaswani et al. 2017, Devlin et al. 2018],
and were also later used in audio processing tasks [Liu et al. 2020b, Liu et al. 2020a,
Schneider et al. 2019, Baevski et al. 2020, Baevski et al. 2019, Song et al. 2019]. In
Mockingjay and Tera [Liu et al. 2020b, Liu et al. 2020a], it was used in phoneme clas-
sification and speaker recognition tasks. There it was shown that variants of the
Cloze task [Taylor 1953, Devlin et al. 2018] for audio could be used for unsuper-
vised pretraining of Transformers. In Wav2Vec and its variants [Schneider et al. 2019,
Baevski et al. 2020, Baevski et al. 2019], a contrastive loss is used to enable unsupervised
pretraining, which is later finetuned to speech and phoneme recognition tasks. In Speech-
XLNet [Song et al. 2019], a speech based version of the XLNet [Yang et al. 2019] was
proposed. The XLNet is a network that maximizes the expected log likelihood of a se-
quence of words with respect to all possible autoregressive factorization orders.
3. Methodology
3.1. Datasets
For the task of respiratory insufficiency detection, the data used in the refinement phase
is the same one used in [Casanova et al. 2021]. There, COVID patient utterances were
collected by medical students at COVID wards from patients with blood oxygenation level
below 92%, as an indication of RI. Control data was collected by voice donations over
the internet without any access to blood oxygenation measurements and were therefore
assumed healthy. As COVID wards are noisy locations, an extra collection was made
consisting of samples of pure background noise (no voice). This is a crucial step in
preventing the network to overfit to the background noise differences in data collection.
The gathered audios contained 3utterances: