Pretrained audio neural networks for Speech emotion recognition in Portuguese

2025-05-06 0 0 441.06KB 11 页 10玖币
侵权投诉
Pretrained audio neural networks for Speech
emotion recognition in Portuguese
Marcelo Matheus Gauy1,Marcelo Finger1
1Universidade de São Paulo, Rua do Matão 1010, São Paulo, Brazil
Abstract
The goal of speech emotion recognition (SER) is to identify the emotional aspects of speech.
The SER challenge for Brazilian Portuguese speech was proposed with short snippets of
Portuguese which are classified as neutral, non-neutral female and non-neutral male according
to paralinguistic elements (laughing, crying, etc). This dataset contains about 50 minutes of
Brazilian Portuguese speech. As the dataset leans on the small side, we investigate whether
a combination of transfer learning and data augmentation techniques can produce positive
results. Thus, by combining a data augmentation technique called SpecAugment, with the
use of Pretrained Audio Neural Networks (PANNs) for transfer learning we are able to obtain
interesting results. The PANNs (CNN6, CNN10 and CNN14) are pretrained on a large dataset
called AudioSet containing more than 5000 hours of audio. They were finetuned on the SER
dataset and the best performing model (CNN10) on the validation set was submitted to the
challenge, achieving an
𝐹
1score of 0
.
73 up from 0
.
54 from the baselines provided by the
challenge. Moreover, we also tested the use of Transformer neural architecture, pretrained on
about 600 hours of Brazilian Portuguese audio data. Transformers, as well as more complex
models of PANNs (CNN14), fail to generalize to the test set in the SER dataset and do not
beat the baseline. Considering the limitation of the dataset sizes, currently the best approach
for SER is using PANNs (specifically, CNN6 and CNN10).
Keywords
Speech emotion recognition, Pretrained audio neural networks, Transfer learning, Transformers
1. Introduction
Speech emotion recognition (SER) aims at identifying the emotional aspects of speech
independently from the actual semantic content. SER can be used to identify the
emotions of humans, e.g., when using mobile phones, an ability that may become crucial
in improving human-machine interactions in the future [
1
]. Several efforts to acquire
speech data classified with different emotional labels have been undertaken [
2
,
3
,
4
].
These datasets are typically small in size, even for languages such as English. In order to
tackle these datasets, the use of transfer learning and data augmentation techniques may
be instrumental.
Transfer learning is the method of training a network on a particular problem where
Proceedings of the First Workshop on Automatic Speech Recognition for Spontaneous and Prepared Speech
and Speech emotion recognition in Portuguese (SER 2022), co-located with PROPOR 2022. March 21st,
2022 (Online).
"marcelomatheusgauy@gmail.com (M. M. Gauy); mfinger@ime.usp.br (M. Finger)
©2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
there is an abundance of data, with the goal of using the acquired knowledge to obtain
better performance on a related problem with limited data available. Transfer learning
has been effectively used in many fields of deep learning such as computer vision [
5
] and
language modelling [
6
]. Data augmentation is the method of increasing the amount of
data available by slightly modifying copies of the data. This can be done, for example,
by masking parts of the input or by adding Gaussian noise to it.
In this paper, we use transfer learning and data augmentation techniques to study
SER in Brazilian Portuguese speech. We participate in the shared task SER challenge,
a challenge for Brazilian Portuguese speech emotion recognition. This challenge made
available a labeled dataset of 625 audio files as training set for SER. Moreover, a dataset
of 308 files was made available as the test set. The training and test datasets consisted of
short snippets of Brazilian Portuguese speech, usually less than 15
𝑠
long, labeled neutral,
non-neutral female and non-neutral male (non-neutral for audios containing laughs, cries,
etc).
For transfer learning, we employ Pretrained audio neural network (PANN) [
7
], which
are convolutional neural networks trained on a large dataset of audios (AudioSet [
8
]),
consisting of 1
.
9million audio clips distributed across 527 sound classes. By using the
pretrained models made available by the developers, and finetuning on the SER dataset
for Brazilian Portuguese speech, we are able to beat the proposed baselines of prosodic
features and wav2vec features. We achieve (via CNN10) F1-score of 0
.
73, up from 0
.
54
from the baselines. During finetuning, we employ a data augmentation technique called
SpecAugment [9].
We also tested the use of Transformer neural networks, pretrained on a large amount
of Brazilian Portuguese audio data [
10
]. However, we find that, with the current amount
of available data for SER, Transformers do not generalize their training performance to
the validation and test sets. This holds even while using most common techniques to
prevent overfitting. The same behaviour was also observed for more complex PANNs,
such as CNN14.
2. Related Work
There is a large literature on SER in English [
11
,
12
,
13
,
14
,
15
,
16
,
17
,
18
]. Moreover,
there are a lot of small datasets for SER in English, such as, RAVDESS [
2
], SAVEE [
3
] and
IEMOCAP [
4
]. To the best of our knowledge, the SER dataset for Brazilian Portuguese
speech is the only available dataset on the language. In addition, English datasets are
usually classified in a different set of labels. RAVDESS [2], for example, has the classes
of calm, happy, angry, sad, fearful, surprise and disgust. This contrasts with the classes
of neutral, non-neutral female and non-neutral male present in the SER dataset for
Brazilian Portuguese speech. As such, comparisons of our work with the state of the art
in English language are not really possible. Nevertheless, the authors of [
18
], the most
recent work, obtain an average recall on RAVDESS of 84
.
3percent using wav2vec 2.0 [
19
].
On IEMOCAP, they obtain an average recall of 67.2percent, also using wav2vec 2.0.
Transfer learning is a very common technique in situations where the dataset available
is small in size. It has been effectively employed in computer vision [
5
,
20
], language
modelling [
6
,
21
] and audio tasks [
7
,
22
,
18
]. In the original PANN paper [
7
], authors
propose several convolutional neural networks pretrained on AudioSet which can be
finetuned on other smaller datasets. In [
18
] the authors use wav2vec 2.0 pretrained
on Librispeech and finetuned on either RAVDESS or IEMOCAP for speech emotion
recognition. Finally, in [
22
] the authors provide a comprehensive review on transfer
learning methods used for speech and language processing tasks.
3. Methodology
The code for this paper can be found at GitHub. Below we describe the dataset and
architectures used.
3.1. SER Dataset
To perform SER on Brazilian Portuguese speech, we use the training dataset (CORAA
SER version 1
.
0) provided for the challenge. This dataset was built from the C-ORAL-
BRASIL I corpus [
23
], with 625 audio files, typically less than 15
𝑠
-long, containing
informal spontaneous Brazilian Portuguese speech. These audio files are labeled neutral,
non-neutral female, non-neutral male. An audio is labeled non-neutral male if it is a
male speaker and it contains paralinguistic elements in the speech (such as laughing,
crying, etc). Similarly, an audio is labeled non-neutral female if it is a female speaker
and the speech contains such paralinguistic elements.
We split the official training dataset into training (80%), validation (10%) and test sets
(10%). The split was done in an arbitrary way to ensure that the three datasets were
balanced (i.e. contained relatively the same proportion of neutral, non-neutral female
and non-neutral male files). The training dataset consisted of 500 files, the validation
dataset consisted of 63 files and the test set of 62 files. The results we report are for the
validation and test set performance.
As the official test dataset made available did not have labels, we have labeled it
ourselves, out of curiosity and to enable more consistent tests of the performance of the
networks. While the labels may not be perfect, they provide a close enough picture, so
the performance of the models can be measured as an average over multiple experiments
(as we were observing high variance). As such, we also provide results for the official test
set with our unofficial labels. We stress that we did not use the test set labels for any
form of model or parameter selection.
Lastly, the PANNs we use have been trained on the AudioSet [
8
] dataset containing
more than 5000 hours of audio distributed across 527 classes.
3.2. PANN Architectures
Table 1describes the three architectures we use. They are named CNN6, CNN10 and
CNN14 after the 6-layer, 10-layer and 14-layer CNNs they represent. These are the same
摘要:

PretrainedaudioneuralnetworksforSpeechemotionrecognitioninPortugueseMarceloMatheusGauy1,MarceloFinger11UniversidadedeSãoPaulo,RuadoMatão1010,SãoPaulo,BrazilAbstractThegoalofspeechemotionrecognition(SER)istoidentifytheemotionalaspectsofspeech.TheSERchallengeforBrazilianPortuguesespeechwasproposedwith...

展开>> 收起<<
Pretrained audio neural networks for Speech emotion recognition in Portuguese.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:441.06KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注