
EXTRACTING SPEAKER AND EMOTION INFORMATION FROM SELF-SUPERVISED
SPEECH MODELS VIA CHANNEL-WISE CORRELATIONS
Themos Stafylakis1†, Ladislav Moˇ
sner2†, Sofoklis Kakouros3†, Oldˇ
rich Plchot2, Luk´
aˇ
s Burget2, Jan ˇ
Cernock´
y2
1Omilia - Conversational Intelligence, Athens, Greece
2Brno University of Technology, Faculty of Information Technology, Speech@FIT, Czechia
3University of Helsinki, Finland
ABSTRACT
Self-supervised learning of speech representations from large
amounts of unlabeled data has enabled state-of-the-art results
in several speech processing tasks. Aggregating these speech
representations across time is typically approached by using
descriptive statistics, and in particular, using the first- and
second-order statistics of representation coefficients. In this
paper, we examine an alternative way of extracting speaker
and emotion information from self-supervised trained mod-
els, based on the correlations between the coefficients of the
representations — correlation pooling. We show improve-
ments over mean pooling and further gains when the pooling
methods are combined via fusion. The code is available at
github.com/Lamomal/s3prl_correlation.
Index Terms—Speaker identification, speaker verifica-
tion, emotion recognition, self-supervised models
1. INTRODUCTION
Large speech models trained in a self-supervised manner,
such as Wav2Vec 2.0, HuBERT, and WavLM, have shown
exceptional performance when finetuned on the downstream
tasks [1, 2, 3]. However, finetuning the weights of such
models on each task is a non-scalable solution for produc-
tion systems performing several of these downstream tasks in
real-time. For such systems, the preferable solution would be
to extract a single set of speech features from a shared model,
followed by a task-specific lightweight classifier.
To this end, the Speech processing Universal PERfor-
mance Benchmark (SUPERB) challenge [4] was recently
introduced, with the goal of benchmarking the performance
of such speech models on a variety of speech tasks (e.g. ASR,
keyword spotting, spoken language understanding, emotion
recognition, speaker recognition, identification and diariza-
tion, a.o.) keeping the models’ weights frozen and employing
a lightweight task-specific classifier [4].
Among the interesting findings of SUPERB is the ability
of such models to encode speaker and emotion information
†Equal contribution
in the intermediate layers. The models are trained using an
implicitly phonetic loss (typically a masked-language model
style loss over quantized vector representations). It means
that speaker and emotion modeling is not directly encour-
aged. However, the models must be performing some kind of
speaker and emotion modeling in the intermediate layers, in
order to suppress this nuisance variability in the output layer.
For those tasks requiring utterance-level classification or
representation learning, the SUPERB benchmark employs a
lightweight trainable classifier incorporating pooling. The
pooling methods used are (a) mean pooling, and (b) statis-
tics pooling (concatenated mean and standard deviation, std,
vectors) in speaker verification (SV), which is the standard
pooling method for x-vectors [5].
Although mean pooling typically yields good perfor-
mance in several speaker and emotion modeling tasks [6], we
should emphasize a crucial difference between these mod-
els and the SUPERB setup [4]; the fact that the models are
frozen, after being pretrained using a loss function that does
not directly encourage speaker or emotion modeling. As a
result, methods using mean pooling make simplifying as-
sumptions about the way the information is encoded into the
network’s internal representations.
Mean pooling, including the statistics pooling variant, as-
sumes that the correlations between different feature dimen-
sions (or channels) are of little or no importance. This might
be true if the model is trained or finetuned this way, i.e. with
a classifier that extracts fixed-length representations via mean
pooling (e.g. using a softmax classifier with cross-entropy
loss). After all, deep neural networks have the capacity to ex-
tract information relevant to the task in a useful form. How-
ever, in the case of pretrained models, such an assumption is
at least questionable, and methods that take into account cor-
relations between feature channels should be considered.
In this paper, we focus on three SUPERB tasks that re-
quire sentence-level representations, namely speaker verifica-
tion and identification, and emotion recognition. We show
that a significant portion of the information related to speaker
and emotion is encoded in the channel-wise correlations of
the intermediate layers. The idea is based on the correlation
978-1-6654-7189-3/22/$31.00 ©2023 IEEE
arXiv:2210.09513v1 [eess.AS] 15 Oct 2022