speakers separately [12]. Results demonstrate the potential of
using gender-dependent vowel-based features for depression
detection, which outperformed a range of turn-level acoustic
features that were extracted without considering vowel-based
information. Finally, Stasak et al. examined linguistic stress
differences between non-depressed and clinically depressed
individuals and found statistically significant differences in
vowel articulatory parameters with shorter vowel durations and
less variance for ’low’, ’back’, and ’rounded’ vowel positions
for individuals with depression [13]. Evidence from these
studies suggests that the effects of psychomotor retardation
associated with depression are linked to tangible differences
in vowel production, whose measures can be particularly
informative when detecting depression from speech.
Leveraging this evidence, we proposed a vowel-dependent
deep learning approach for classifying depression from speech.
We extract low-level vowel-dependent feature embeddings via
a convolutional neural network (CNN), which is trained for the
task of vowel classification, thus, capturing energy variations
of the speech spectrogram within the corresponding vowel
region. The learned vowel-based embeddings comprise the
input to a long short-term memory (LSTM) model that models
high-level temporal dependencies across speech segments and
outputs the depression outcome. We compare the proposed
approach against prior work on deep learning models that does
not leverage vowel-dependent information and find that the
proposed method depicts better performance compared to the
considered baselines. We further conduct an empirical analysis
using a modified version of the Local Interpretable Model-
agnostic Explanations (LIME) to identify the parts of the
speech spectrogram that contribute most to the decision about
the depression condition. Results from this work could po-
tentially contribute to reliable and interpretable speech-based
ML models of depression that consider low-level fine-grain
spectrotemporal information along with high-level transitions
within and across an utterance, and could be eventually used
to assist MH clinicians in decision-making.
II. PREVIOUS WORK
Early work in depression estimation from speech has ex-
plored acoustic biomarkers of depression that quantify char-
acteristics of the vocal source, tract, formants, and prosodics.
Psychomotor retardation associated with depression can slow
the movement of articulatory muscles, therefore causing de-
creases in speech rate and phoneme rate, as well as decreases
in pitch and energy variability [4], [14]. Other work has
quantified changes in coordination of the vocal tract motion
by measuring correlation coefficients over different temporal
scales across formant frequencies and the first-order deriva-
tives of the Mel-frequency cepstral coefficients (MFCC) [15].
Formant information has been considered by measuring the
distance between F1 and F2 (the first and second formant)
coordinates in certain vowels (i.e., /i/, /u/, /a/), and showing
that this distance is reduced for patients with depression
compared to healthy individuals [11]. Finally, other work has
modeled the acoustic variability of MFCCs at the utterance-
level indicating the patients with depression depict reduced
variability in regards to this spectral measure [16]. The above
measures along with commonly used acoustic features of
prosody, energy, intonation, and spectral information have
been employed as the input to various ML models for the
task of depression recognition [17], [18].
Other work has leveraged recent advances in deep learn-
ing to reliably estimate depression from speech. Ma et al.
proposed a deep model, called “DepAudioNet” that encoded
the Mel-scale filter bank of speech via a 1-dimensional CNN
followed by a LSTM [19]. The convolutional transformations
implemented in the 1D CNN encode low-level short-term
spectrotemporal speech variations, the max-pooling layers of
the 1D CNN capture mid-level variations, while the LSTM
extracts long-term information within an utterance. Muzammel
et al. incorporated speech spectrograms at the phoneme-level
as the input to 2-dimensional CNNs [20]. Vowels and conso-
nants were separated from speech signals and served as the
input to a 2-D CNN that performed depression classification.
Results demonstrate that the fusion of consonant-based and
vowel-based embeddings depict the best performance. Saidi
et al. used the speech spectrogram as the input of a 2-D
CNN that was trained for depression classification [21]. The
learned embeddings of the CNN were flattened and comprised
the input to a support vector machine (SVM). The CNN-
SVM combination depicted an absolute of 10% increase in
depression classification accuracy compared to the CNN alone.
Srimadhur & Lalitha compared a 2-D CNN whose input was
the speech spectrogram with an end-to-end 1D CNN whose
input was the raw speech signal, with the latter yielding
improved results [22]. Zhao et al. proposed the hierarchical
attention transfer network (HATN), a two-level hierarchical
network with an attention mechanism that models speech
variations at the frame and sentence level [23]. This model
learns attention weights for a speech recognition task (source
task), implemented via a deep teacher network, and transfers
those weights to the depression severity estimation task (target
task), implemented via a shallow student network.
The contributions of this work compared to the previous
research are: (1) In contrast to the majority of prior work
in deep learning, this paper proposes a CNN-based architec-
ture that learns vowel-based embeddings, therefore explicitly
incorporating low-level spectrotemporal information that is
valuable for identifying depression [10]–[13], [24]. While
phoneme-based information has been modeled before [20],
it was only incorporated at the extraction of spectrogram
features, without further deriving high-level representations;
(2) High-level information is extracted from the vowel-based
embeddings via the LSTM that models the evolution of this
information within and across utterances, a technique that is
fairly unexplored in depression detection and has not been
combined with vowel-based information [19]; and (3) The
utility of learned embeddings is qualitatively explored in terms
of its interpretability with respect to the depression outcome.