Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using SVM

2025-05-06 1 0 521.23KB 46 页 10玖币

侵权投诉

arXiv:2210.14495v1 [cs.SD] 26 Oct 2022

Two-stage dimensional emotion recognition by fusing

predictions of acoustic and text networks using SVM

Bagus Tris Atmajaa,b,∗

, Masato Akagia

aJapan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa

923-1292, Japan

bSepuluh Nopember Insitute of Technology, Sukolilo, Surabaya 60111, Indonesia

Abstract

Automatic speech emotion recognition (SER) by a computer is a critical compo-

nent for more natural human-machine interaction. As in human-human interac-

tion, the capability to perceive emotion correctly is essential to take further steps

in a particular situation. One issue in SER is whether it is necessary to combine

acoustic features with other data, such as facial expressions, text, and motion

capture. This research proposes to combine acoustic and text information by

applying a late-fusion approach consisting of two steps. First, acoustic and text

features are trained separately in deep learning systems. Second, the prediction

results from the deep learning systems are fed into a support vector machine

(SVM) to predict the ﬁnal regression score. Furthermore, the task in this re-

search is dimensional emotion modeling, because it can enable a deeper analysis

of aﬀective states. Experimental results show that this two-stage, late-fusion

approach obtains higher performance than that of any one-stage processing,

with a linear correlation from one-stage to two-stage processing. This late-

fusion approach improves previous early fusion results measured in concordance

correlation coeﬃcients score.

Keywords: automatic speech emotion recognition, aﬀective computing, late

fusion, multimodal fusion, dimensional emotion

∗Corresponding author, E-mail address: bagus@ep.its.ac.id

Preprint submitted to Speech Communication October 27, 2022

1. Introduction

Understanding human emotion is important for responding properly in a

particular situation for both human-human communication and future machine-

human communication. Emotion can be recognized from many modalities: fa-

cial expressions, speech, and motion of body parts. In the absence of visual

features, speech is the only way to recognize emotion, as in the case of a tele-

phone call or a call-center application (Petrushin, 1999). By identifying caller

emotions automatically from a system, appropriate feedback can be applied

quickly and precisely.

Speech is a modality in which both acoustic and verbal information can be

extracted to recognize human emotion. Unfortunately, most speech emotion

recognition (SER) systems use only acoustic features for predicting categorical

emotions. In contrast, this research proposes to use both acoustic and text fea-

tures to improve dimensional SER performance. Text can be extracted from

speech, and it may contribute to emotion recognition. For example, an inter-

locutor can perceive emotion not only from prosodic information but also from

semantics. Grice (2002) stated in his implicature theory that what is implied de-

rives from what is said. For example, if someone says that he is angry but looks

happy, then the implication is that he is indeed angry. Hence, it is necessary to

use linguistic information to determine expressed emotion from speech. A fusion

of acoustic and linguistic information from speech is viable since (spoken) text

can be obtained from speech-to-text technology. This bimodal features fusion

strategy may improve the performance of SER over acoustic-only SER.

Besides the categorical approach, emotion can also be analyzed via a di-

mensional approach. In dimensional emotion, aﬀective states are lines in a

continuous space. Some researchers have used a two-dimensional (2D) space

comprising valance (positive or negative) and arousal (excited or apathetic).

Other researchers have proposed a 3D emotional space by adding either domi-

nance (degree of power over emotion) or liking/disliking. Although it is rare, a

4D emotional space has also been studied by adding expectancy or naturalness.

While some researchers, e.g., Russell (1980), argue that a 2D emotion model

is enough to characterize all categorical emotions, in this research, we choose a

3D emotion model with valence, arousal, and dominance as the emotion dimen-

sions/attributes.

Darwin argued that the biological category of a species, like emotion cat-

egories, does not have an essence due to the high variability of individuals

(Charles et al., 1872). Mehrabian and Russell (1974) developed a pleasure, arousal,

and dominance (PAD) model to assess environmental perception, experience,

and psychological responses, as an alternative to categorical emotion. The lat-

ter, also called as dimensional emotion, may represent human emotion better

than categorical emotion. This dimensional emotion view is also known as the

circumplex model of aﬀect, and the pleasure dimension is often replaced by va-

lence for the same meaning (the VAD model). Although most research used the

2D model (valence and arousal), recent research shows four dimensions needed

to represent the meaning of emotion words (Fontaine et al., 2017). However,

current datasets lack the availability of the fourth dimension label (i.e., ex-

pectancy). We evaluate the VAD emotion model since the datasets also present

the labels in 3D space.

Deep neural networks (DNN) have recently gained more interest in model-

ing human cognitive processing for several tasks. Fayek et al. (2017) evaluated

some DNN architectures for categorical SER. They found fully connected (FC)

networks and recurrent neural networks (RNN) worked well for SER tasks us-

ing acoustic features only. In neuropsychological science, the neural mechanism

that integrates acoustic (verbal) and linguistic (non-verbal) information remains

unclear (Berckmoes and Vingerhoets, 2004). The paper also stated that “the

various parameters of prosody [acoustics] are processed separately in speciﬁc

brain areas” while no information is given for linguistic processing. In this

understanding, separation of acoustic and linguistic/text processing is better

modeled by a late fusion than an early fusion. This research makes use of a

support vector machine (SVM) for a late-fusion prediction from DNN-based

acoustic and linguistic emotion recognitions. The small remaining test data

after used by DNNs is a reason to use SVM over DNN.

This study aims to evaluate the combination of acoustic and text features

to improve the performance of dimensional automatic SER by using two-stage

processing. Current research on pattern recognition has also shown that the use

of multimodal features from audio, visual, and motion-capture data increases

performance as compared to using a single modality (Hu and Flaxman, 2018;

Yoon et al., 2018; Tripathi and Beigi, 2018). Meanwhile, research on big data

has revealed that the use of more data will improve performance for results

from the same algorithm (Halevy et al., 2009). By using both acoustic and text

features, SER should obtain improved performance over acoustic-only and text-

only recognition. This assumption is also motivated by the fact that human

emotion perception uses multimodal sensing, peculiarly verbal and non-verbal

information. Many technologies, such as human-robot interaction, can poten-

tially beneﬁt from such improvement in emotion recognition.

The main contributions of this study then are: (1) a proposal of two-stage

processing for dimensional emotion recognition from acoustic and text features

using LSTM and SVM, and a comparison of the results with unimodal re-

sults and another fusion method on the same metric and dataset scenario; (2)

an evaluation of diﬀerent acoustic and text features to ﬁnd the best pair of

acoustic-text pair based on evaluated features, including a frame-based acoustic

feature and utterance-based statistical functions with and without silent pause

features; (3) evaluation of speaker-dependent vs. speaker-independent scenarios

in dimensional speech emotion recognition from text features; and (4) evalua-

tion of using text features on a dataset that originally contains target sentences

but removed to avoid the eﬀect of these target sentences.

The rest of this paper is organized as follows. “Related work” reviews closely

related work to this research, including the diﬀerence between this study and

previous research, “Datasets and features” outlines the datasets and feature

sets used in this research, “Two-stage bimodal emotion recognition” explains

the method to achieve the results, “Results and discussion” shows the results

and its discussion, and ﬁnally “Conclusions” concludes this study and proposes

future work.

2. Related work

Speech emotion recognition (SER) began to be seriously researched as part

of human-computer interaction with the work by e.g., Kleine-cosack (2006). The

amount of research has grown as datasets have become publicly available, includ-

ing the Berlin EMO-DB, IEMOCAP, MSP-IMPROV, and RAVDESS datasets.

To enable the analysis and comparison with previous research, we include the

following literature reviews of related work. We focus on comparing previous

work that used the same or similar datasets as this work does (speciﬁcally,

IEMOCAP, MSP-IMPROV, or both), and especially on research that focused

on dimensional rather than categorical emotion. While the focus here is on

bimodal emotion recognition using both acoustic and text data, some work on

speech-only or text-only emotion recognition is brieﬂy described.

2.1. Acoustic emotion recognition

Recognition of emotion within speech signals has been actively developed

since the success of recognizing emotion via facial expressions. From categor-

ical emotion detection, the paradigm of SER has shifted to predicting degrees

of emotion attributes or dimensional emotions. One of the earliest papers on

(categorical) SER (Petrushin, 1999) explored how well humans and computers

recognize emotion in speech. Since then, research on categorical emotion recog-

nition has grown following the development of aﬀective research in psychology.

Jin and Wang (2005) reported a ﬁrst trial on SER in categorical and two-

dimensional (2D) spaces. They found that acoustic features are helpful in de-

scribing and distinguishing emotion through the concept of emotion modeling

(2D space). In 2009, Giannakopoulos et al. (2009) re-investigated the associa-

tion of speech signals with an emotion wheel (continuous space). They proposed

a method to estimate the degrees of valence and arousal. Their method, includ-

ing a proposed feature set, could estimate both valence and arousal, with an

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

arXiv:2210.14495v1[cs.SD]26Oct2022Two-stagedimensionalemotionrecognitionbyfusingpredictionsofacousticandtextnetworksusingSVMBagusTrisAtmajaa,b,∗,MasatoAkagiaaJapanAdvancedInstituteofScienceandTechnology,1-1Asahidai,Nomi,Ishikawa923-1292,JapanbSepuluhNopemberInsituteofTechnology,Sukolilo,Surabaya6011...

展开>> 收起<<

Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using SVM.pdf

共46页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using SVM

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: