Pretrained audio neural networks for Speech emotion recognition in Portuguese

2025-05-06 0 0 441.06KB 11 页 10玖币

侵权投诉

Pretrained audio neural networks for Speech

emotion recognition in Portuguese

Marcelo Matheus Gauy1,Marcelo Finger1

1Universidade de São Paulo, Rua do Matão 1010, São Paulo, Brazil

Abstract

The goal of speech emotion recognition (SER) is to identify the emotional aspects of speech.

The SER challenge for Brazilian Portuguese speech was proposed with short snippets of

Portuguese which are classiﬁed as neutral, non-neutral female and non-neutral male according

to paralinguistic elements (laughing, crying, etc). This dataset contains about 50 minutes of

Brazilian Portuguese speech. As the dataset leans on the small side, we investigate whether

a combination of transfer learning and data augmentation techniques can produce positive

results. Thus, by combining a data augmentation technique called SpecAugment, with the

use of Pretrained Audio Neural Networks (PANNs) for transfer learning we are able to obtain

interesting results. The PANNs (CNN6, CNN10 and CNN14) are pretrained on a large dataset

called AudioSet containing more than 5000 hours of audio. They were ﬁnetuned on the SER

dataset and the best performing model (CNN10) on the validation set was submitted to the

challenge, achieving an

𝐹

1score of 0

73 up from 0

54 from the baselines provided by the

challenge. Moreover, we also tested the use of Transformer neural architecture, pretrained on

about 600 hours of Brazilian Portuguese audio data. Transformers, as well as more complex

models of PANNs (CNN14), fail to generalize to the test set in the SER dataset and do not

beat the baseline. Considering the limitation of the dataset sizes, currently the best approach

for SER is using PANNs (speciﬁcally, CNN6 and CNN10).

Keywords

Speech emotion recognition, Pretrained audio neural networks, Transfer learning, Transformers

1. Introduction

Speech emotion recognition (SER) aims at identifying the emotional aspects of speech

independently from the actual semantic content. SER can be used to identify the

emotions of humans, e.g., when using mobile phones, an ability that may become crucial

in improving human-machine interactions in the future [

]. Several eﬀorts to acquire

speech data classiﬁed with diﬀerent emotional labels have been undertaken [

These datasets are typically small in size, even for languages such as English. In order to

tackle these datasets, the use of transfer learning and data augmentation techniques may

be instrumental.

Transfer learning is the method of training a network on a particular problem where

Proceedings of the First Workshop on Automatic Speech Recognition for Spontaneous and Prepared Speech

and Speech emotion recognition in Portuguese (SER 2022), co-located with PROPOR 2022. March 21st,

2022 (Online).

"marcelomatheusgauy@gmail.com (M. M. Gauy); mﬁnger@ime.usp.br (M. Finger)

4.0 International (CC BY 4.0).

CEUR

Workshop

Proceedings

http://ceur-ws.org

ISSN 1613-0073

CEUR Workshop Proceedings (CEUR-WS.org)

there is an abundance of data, with the goal of using the acquired knowledge to obtain

better performance on a related problem with limited data available. Transfer learning

has been eﬀectively used in many ﬁelds of deep learning such as computer vision [

] and

language modelling [

]. Data augmentation is the method of increasing the amount of

data available by slightly modifying copies of the data. This can be done, for example,

by masking parts of the input or by adding Gaussian noise to it.

In this paper, we use transfer learning and data augmentation techniques to study

SER in Brazilian Portuguese speech. We participate in the shared task SER challenge,

a challenge for Brazilian Portuguese speech emotion recognition. This challenge made

available a labeled dataset of 625 audio ﬁles as training set for SER. Moreover, a dataset

of 308 ﬁles was made available as the test set. The training and test datasets consisted of

short snippets of Brazilian Portuguese speech, usually less than 15

𝑠

long, labeled neutral,

non-neutral female and non-neutral male (non-neutral for audios containing laughs, cries,

etc).

For transfer learning, we employ Pretrained audio neural network (PANN) [

], which

are convolutional neural networks trained on a large dataset of audios (AudioSet [

]),

consisting of 1

9million audio clips distributed across 527 sound classes. By using the

pretrained models made available by the developers, and ﬁnetuning on the SER dataset

for Brazilian Portuguese speech, we are able to beat the proposed baselines of prosodic

features and wav2vec features. We achieve (via CNN10) F1-score of 0

73, up from 0

from the baselines. During ﬁnetuning, we employ a data augmentation technique called

SpecAugment [9].

We also tested the use of Transformer neural networks, pretrained on a large amount

of Brazilian Portuguese audio data [

]. However, we ﬁnd that, with the current amount

of available data for SER, Transformers do not generalize their training performance to

the validation and test sets. This holds even while using most common techniques to

prevent overﬁtting. The same behaviour was also observed for more complex PANNs,

such as CNN14.

2. Related Work

There is a large literature on SER in English [

]. Moreover,

there are a lot of small datasets for SER in English, such as, RAVDESS [

], SAVEE [

] and

IEMOCAP [

]. To the best of our knowledge, the SER dataset for Brazilian Portuguese

speech is the only available dataset on the language. In addition, English datasets are

usually classiﬁed in a diﬀerent set of labels. RAVDESS [2], for example, has the classes

of calm, happy, angry, sad, fearful, surprise and disgust. This contrasts with the classes

of neutral, non-neutral female and non-neutral male present in the SER dataset for

Brazilian Portuguese speech. As such, comparisons of our work with the state of the art

in English language are not really possible. Nevertheless, the authors of [

], the most

recent work, obtain an average recall on RAVDESS of 84

3percent using wav2vec 2.0 [

On IEMOCAP, they obtain an average recall of 67.2percent, also using wav2vec 2.0.

Transfer learning is a very common technique in situations where the dataset available

is small in size. It has been eﬀectively employed in computer vision [

], language

modelling [

] and audio tasks [

]. In the original PANN paper [

], authors

propose several convolutional neural networks pretrained on AudioSet which can be

ﬁnetuned on other smaller datasets. In [

] the authors use wav2vec 2.0 pretrained

on Librispeech and ﬁnetuned on either RAVDESS or IEMOCAP for speech emotion

recognition. Finally, in [

] the authors provide a comprehensive review on transfer

learning methods used for speech and language processing tasks.

3. Methodology

The code for this paper can be found at GitHub. Below we describe the dataset and

architectures used.

3.1. SER Dataset

To perform SER on Brazilian Portuguese speech, we use the training dataset (CORAA

SER version 1

0) provided for the challenge. This dataset was built from the C-ORAL-

BRASIL I corpus [

], with 625 audio ﬁles, typically less than 15

𝑠

-long, containing

informal spontaneous Brazilian Portuguese speech. These audio ﬁles are labeled neutral,

non-neutral female, non-neutral male. An audio is labeled non-neutral male if it is a

male speaker and it contains paralinguistic elements in the speech (such as laughing,

crying, etc). Similarly, an audio is labeled non-neutral female if it is a female speaker

and the speech contains such paralinguistic elements.

We split the oﬃcial training dataset into training (80%), validation (10%) and test sets

(10%). The split was done in an arbitrary way to ensure that the three datasets were

balanced (i.e. contained relatively the same proportion of neutral, non-neutral female

and non-neutral male ﬁles). The training dataset consisted of 500 ﬁles, the validation

dataset consisted of 63 ﬁles and the test set of 62 ﬁles. The results we report are for the

validation and test set performance.

As the oﬃcial test dataset made available did not have labels, we have labeled it

ourselves, out of curiosity and to enable more consistent tests of the performance of the

networks. While the labels may not be perfect, they provide a close enough picture, so

the performance of the models can be measured as an average over multiple experiments

(as we were observing high variance). As such, we also provide results for the oﬃcial test

set with our unoﬃcial labels. We stress that we did not use the test set labels for any

form of model or parameter selection.

Lastly, the PANNs we use have been trained on the AudioSet [

] dataset containing

more than 5000 hours of audio distributed across 527 classes.

3.2. PANN Architectures

Table 1describes the three architectures we use. They are named CNN6, CNN10 and

CNN14 after the 6-layer, 10-layer and 14-layer CNNs they represent. These are the same

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PretrainedaudioneuralnetworksforSpeechemotionrecognitioninPortugueseMarceloMatheusGauy1,MarceloFinger11UniversidadedeSãoPaulo,RuadoMatão1010,SãoPaulo,BrazilAbstractThegoalofspeechemotionrecognition(SER)istoidentifytheemotionalaspectsofspeech.TheSERchallengeforBrazilianPortuguesespeechwasproposedwith...

展开>> 收起<<

Pretrained audio neural networks for Speech emotion recognition in Portuguese.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Pretrained audio neural networks for Speech emotion recognition in Portuguese

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: