2. RELATED WORK
There is no doubt that large-scale speech models using SSL
are becoming integral in speech processing tasks. Most of
them can be divided into generative or discriminative ap-
proaches. The generative approaches generate future frames
from past frames, or masked frames from unmasked frames
by learning to minimize reconstruction loss [16, 22, 23]. On
the other hand, the discriminative approaches discriminate
positive samples from negative samples while minimizing
contrastive prediction loss [24, 17, 18]. These self-supervised
models are generally trained on Librispeech [25], a corpus
based on public domain audio books primarily used for ASR
research. Although self-supervised objectives are general, the
design of popular SSL models has been primarily driven by
the goal of improving automatic transcription.
Unlike traditional speech modeling approaches that have
been extensively researched, these SSL models have just
started to be explored in very recent years, with wav2vec
2.0 (W2V2) attracting the most attention for its wide appli-
cation potential. For example, Pasad et al. [26] conducted
layer-wise analysis of W2V2 using a suite of tools and found
1) acoustic and linguistic properties are encoded in different
layers; 2) the pre-trained model follows an autoencoder-style
behavior; 3) the model encodes some non-trivial word mean-
ing information. Fan et al. [20] showed that W2V2 has the
ability to discriminate between speakers and also languages,
and this distinction is more obvious in lower layers. They
hence proposed multi-task learning of speaker verification
and language identification, and verified its feasibility. Li et
al. [27] noticed the recognition of longer emotional utter-
ances that contain more contextual information benefits from
the contextual characteristic of W2V2. They proposed a joint
training scheme by hierarchically fusing multiple W2V2 out-
puts for SER. Yang et al. [28] set up benchmark performance
using self-supervised speech models on a range of tasks.
Nevertheless, these self-supervised speech models are
still understudied and the above-mentioned works have lim-
itations. For example, [26] did not extend their exploration
to non-ASR downstream tasks. In [20] and [27], only a por-
tion of the layer difference was shown, so misses a thorough
layer-wise analysis. In [28], they presented downstream task
performance without further explanation. Furthermore, none
of those studies investigated paralinguistic characteristics in
W2V2 representations. As such, in the current work, we
build on previous work while adding new perspectives from
detailed quantitative analysis on emotional corpora.
3. CORPORA AND MODEL DESCRIPTION
3.1. Corpora
IEMOCAP (IEM) [29] has five dyadic sessions with ten ac-
tors (five male and five female), each with a scripted and im-
provised multimodal interaction. The corpus consists of ap-
proximately 12 hours of speech that has been annotated by
three annotators with ten emotion classes. Following prior re-
search [27], we combined Happy and Excited, and removed
utterances that do not have transcripts, bringing the total num-
ber of utterances used in this study to 5,500, each with one
label from four classes: Angry, Happy, Neutral, and Sad.
RAVDESS (RAV) [30] contains a speech set and a song set.
We only use the speech set, which has 1,440 utterances from
24 actors (12 female, 12 male) in eight emotions: Calm,
Happy, Sad, Angry, Fear, Surprise, Disgust, and Neutral.
Ratings were provided by untrained individuals. In the pro-
cess of collecting data, the actors spoke two fixed sentences
with different classes of emotion, so the corpus has a good
balance of emotions. The actors were given two trials for
each utterance and asked to produce 60 speech clips in total.
The major reason that we choose to use RAV is that, even
though other corpora may have a larger size, it provides fixed
sentences with different emotional expressions. Such a set-
ting excludes the lexical influence by “forcing” different emo-
tions to have the same linguistic content, thus helping us to
better explore the effects of the acoustic properties of W2V2
by eliminating the effects raised by lexical content (e.g., word
pronunciation causing prosody variation).
3.2. Model
We look at W2V2 [17], a SSL framework comprised of three
major components: a CNN-based local encoder that extracts a
sequence of embeddings from raw audio as latent speech rep-
resentation Z, a Transformer network for obtaining context
representation C, and a quantization module for discretiz-
ing Zinto Q. Following previous work [26], we focus our
attention on the latent representations learned by the Trans-
former module of W2V2. In this work, we use wav2vec2-
base,wav2vec2-base-100h, and wav2vec2-base-960h mod-
els, which are the pre-trained and fine-tuned models (on 100h
and 960h of Librispeech) respectively. We refer to them as
PT,FT100, and FT960. We choose W2V2 because it is the
most widely used SSL speech model, with the expectation
that the exploratory approach can be generalized to similar
SSL models.
4. EXPERIMENTS AND RESULTS
We perform a set of probing experiments, including the fol-
lowing quantitative measures:
Probing SER performance. We first implement a layer-wise
analysis by using the output of every individual layer within
the Transformer network to demonstrate how information en-
coded by W2V2 contributes to SER. Next, as there is no com-
mon practice of how to utilize W2V2 representations as input
features for downstream tasks, we compare the performance
of three commonly used approaches of using W2V2 repre-
sentations as input features: 1) taking the last layer output
[31, 32, 33]; 2) taking the average of all layer outputs [34];