there is an abundance of data, with the goal of using the acquired knowledge to obtain
better performance on a related problem with limited data available. Transfer learning
has been effectively used in many fields of deep learning such as computer vision [
5
] and
language modelling [
6
]. Data augmentation is the method of increasing the amount of
data available by slightly modifying copies of the data. This can be done, for example,
by masking parts of the input or by adding Gaussian noise to it.
In this paper, we use transfer learning and data augmentation techniques to study
SER in Brazilian Portuguese speech. We participate in the shared task SER challenge,
a challenge for Brazilian Portuguese speech emotion recognition. This challenge made
available a labeled dataset of 625 audio files as training set for SER. Moreover, a dataset
of 308 files was made available as the test set. The training and test datasets consisted of
short snippets of Brazilian Portuguese speech, usually less than 15
𝑠
long, labeled neutral,
non-neutral female and non-neutral male (non-neutral for audios containing laughs, cries,
etc).
For transfer learning, we employ Pretrained audio neural network (PANN) [
7
], which
are convolutional neural networks trained on a large dataset of audios (AudioSet [
8
]),
consisting of 1
.
9million audio clips distributed across 527 sound classes. By using the
pretrained models made available by the developers, and finetuning on the SER dataset
for Brazilian Portuguese speech, we are able to beat the proposed baselines of prosodic
features and wav2vec features. We achieve (via CNN10) F1-score of 0
.
73, up from 0
.
54
from the baselines. During finetuning, we employ a data augmentation technique called
SpecAugment [9].
We also tested the use of Transformer neural networks, pretrained on a large amount
of Brazilian Portuguese audio data [
10
]. However, we find that, with the current amount
of available data for SER, Transformers do not generalize their training performance to
the validation and test sets. This holds even while using most common techniques to
prevent overfitting. The same behaviour was also observed for more complex PANNs,
such as CNN14.
2. Related Work
There is a large literature on SER in English [
11
,
12
,
13
,
14
,
15
,
16
,
17
,
18
]. Moreover,
there are a lot of small datasets for SER in English, such as, RAVDESS [
2
], SAVEE [
3
] and
IEMOCAP [
4
]. To the best of our knowledge, the SER dataset for Brazilian Portuguese
speech is the only available dataset on the language. In addition, English datasets are
usually classified in a different set of labels. RAVDESS [2], for example, has the classes
of calm, happy, angry, sad, fearful, surprise and disgust. This contrasts with the classes
of neutral, non-neutral female and non-neutral male present in the SER dataset for
Brazilian Portuguese speech. As such, comparisons of our work with the state of the art
in English language are not really possible. Nevertheless, the authors of [
18
], the most
recent work, obtain an average recall on RAVDESS of 84
.
3percent using wav2vec 2.0 [
19
].
On IEMOCAP, they obtain an average recall of 67.2percent, also using wav2vec 2.0.
Transfer learning is a very common technique in situations where the dataset available