Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using SVM

2025-05-06 0 0 521.23KB 46 页 10玖币
侵权投诉
arXiv:2210.14495v1 [cs.SD] 26 Oct 2022
Two-stage dimensional emotion recognition by fusing
predictions of acoustic and text networks using SVM
Bagus Tris Atmajaa,b,
, Masato Akagia
aJapan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa
923-1292, Japan
bSepuluh Nopember Insitute of Technology, Sukolilo, Surabaya 60111, Indonesia
Abstract
Automatic speech emotion recognition (SER) by a computer is a critical compo-
nent for more natural human-machine interaction. As in human-human interac-
tion, the capability to perceive emotion correctly is essential to take further steps
in a particular situation. One issue in SER is whether it is necessary to combine
acoustic features with other data, such as facial expressions, text, and motion
capture. This research proposes to combine acoustic and text information by
applying a late-fusion approach consisting of two steps. First, acoustic and text
features are trained separately in deep learning systems. Second, the prediction
results from the deep learning systems are fed into a support vector machine
(SVM) to predict the final regression score. Furthermore, the task in this re-
search is dimensional emotion modeling, because it can enable a deeper analysis
of affective states. Experimental results show that this two-stage, late-fusion
approach obtains higher performance than that of any one-stage processing,
with a linear correlation from one-stage to two-stage processing. This late-
fusion approach improves previous early fusion results measured in concordance
correlation coefficients score.
Keywords: automatic speech emotion recognition, affective computing, late
fusion, multimodal fusion, dimensional emotion
Corresponding author, E-mail address: bagus@ep.its.ac.id
Preprint submitted to Speech Communication October 27, 2022
1. Introduction
Understanding human emotion is important for responding properly in a
particular situation for both human-human communication and future machine-
human communication. Emotion can be recognized from many modalities: fa-
cial expressions, speech, and motion of body parts. In the absence of visual
features, speech is the only way to recognize emotion, as in the case of a tele-
phone call or a call-center application (Petrushin, 1999). By identifying caller
emotions automatically from a system, appropriate feedback can be applied
quickly and precisely.
Speech is a modality in which both acoustic and verbal information can be
extracted to recognize human emotion. Unfortunately, most speech emotion
recognition (SER) systems use only acoustic features for predicting categorical
emotions. In contrast, this research proposes to use both acoustic and text fea-
tures to improve dimensional SER performance. Text can be extracted from
speech, and it may contribute to emotion recognition. For example, an inter-
locutor can perceive emotion not only from prosodic information but also from
semantics. Grice (2002) stated in his implicature theory that what is implied de-
rives from what is said. For example, if someone says that he is angry but looks
happy, then the implication is that he is indeed angry. Hence, it is necessary to
use linguistic information to determine expressed emotion from speech. A fusion
of acoustic and linguistic information from speech is viable since (spoken) text
can be obtained from speech-to-text technology. This bimodal features fusion
strategy may improve the performance of SER over acoustic-only SER.
Besides the categorical approach, emotion can also be analyzed via a di-
mensional approach. In dimensional emotion, affective states are lines in a
continuous space. Some researchers have used a two-dimensional (2D) space
comprising valance (positive or negative) and arousal (excited or apathetic).
Other researchers have proposed a 3D emotional space by adding either domi-
nance (degree of power over emotion) or liking/disliking. Although it is rare, a
4D emotional space has also been studied by adding expectancy or naturalness.
2
While some researchers, e.g., Russell (1980), argue that a 2D emotion model
is enough to characterize all categorical emotions, in this research, we choose a
3D emotion model with valence, arousal, and dominance as the emotion dimen-
sions/attributes.
Darwin argued that the biological category of a species, like emotion cat-
egories, does not have an essence due to the high variability of individuals
(Charles et al., 1872). Mehrabian and Russell (1974) developed a pleasure, arousal,
and dominance (PAD) model to assess environmental perception, experience,
and psychological responses, as an alternative to categorical emotion. The lat-
ter, also called as dimensional emotion, may represent human emotion better
than categorical emotion. This dimensional emotion view is also known as the
circumplex model of affect, and the pleasure dimension is often replaced by va-
lence for the same meaning (the VAD model). Although most research used the
2D model (valence and arousal), recent research shows four dimensions needed
to represent the meaning of emotion words (Fontaine et al., 2017). However,
current datasets lack the availability of the fourth dimension label (i.e., ex-
pectancy). We evaluate the VAD emotion model since the datasets also present
the labels in 3D space.
Deep neural networks (DNN) have recently gained more interest in model-
ing human cognitive processing for several tasks. Fayek et al. (2017) evaluated
some DNN architectures for categorical SER. They found fully connected (FC)
networks and recurrent neural networks (RNN) worked well for SER tasks us-
ing acoustic features only. In neuropsychological science, the neural mechanism
that integrates acoustic (verbal) and linguistic (non-verbal) information remains
unclear (Berckmoes and Vingerhoets, 2004). The paper also stated that “the
various parameters of prosody [acoustics] are processed separately in specific
brain areas” while no information is given for linguistic processing. In this
understanding, separation of acoustic and linguistic/text processing is better
modeled by a late fusion than an early fusion. This research makes use of a
support vector machine (SVM) for a late-fusion prediction from DNN-based
acoustic and linguistic emotion recognitions. The small remaining test data
3
after used by DNNs is a reason to use SVM over DNN.
This study aims to evaluate the combination of acoustic and text features
to improve the performance of dimensional automatic SER by using two-stage
processing. Current research on pattern recognition has also shown that the use
of multimodal features from audio, visual, and motion-capture data increases
performance as compared to using a single modality (Hu and Flaxman, 2018;
Yoon et al., 2018; Tripathi and Beigi, 2018). Meanwhile, research on big data
has revealed that the use of more data will improve performance for results
from the same algorithm (Halevy et al., 2009). By using both acoustic and text
features, SER should obtain improved performance over acoustic-only and text-
only recognition. This assumption is also motivated by the fact that human
emotion perception uses multimodal sensing, peculiarly verbal and non-verbal
information. Many technologies, such as human-robot interaction, can poten-
tially benefit from such improvement in emotion recognition.
The main contributions of this study then are: (1) a proposal of two-stage
processing for dimensional emotion recognition from acoustic and text features
using LSTM and SVM, and a comparison of the results with unimodal re-
sults and another fusion method on the same metric and dataset scenario; (2)
an evaluation of different acoustic and text features to find the best pair of
acoustic-text pair based on evaluated features, including a frame-based acoustic
feature and utterance-based statistical functions with and without silent pause
features; (3) evaluation of speaker-dependent vs. speaker-independent scenarios
in dimensional speech emotion recognition from text features; and (4) evalua-
tion of using text features on a dataset that originally contains target sentences
but removed to avoid the effect of these target sentences.
The rest of this paper is organized as follows. “Related work” reviews closely
related work to this research, including the difference between this study and
previous research, “Datasets and features” outlines the datasets and feature
sets used in this research, “Two-stage bimodal emotion recognition” explains
the method to achieve the results, Results and discussion” shows the results
and its discussion, and finally “Conclusions” concludes this study and proposes
4
future work.
2. Related work
Speech emotion recognition (SER) began to be seriously researched as part
of human-computer interaction with the work by e.g., Kleine-cosack (2006). The
amount of research has grown as datasets have become publicly available, includ-
ing the Berlin EMO-DB, IEMOCAP, MSP-IMPROV, and RAVDESS datasets.
To enable the analysis and comparison with previous research, we include the
following literature reviews of related work. We focus on comparing previous
work that used the same or similar datasets as this work does (specifically,
IEMOCAP, MSP-IMPROV, or both), and especially on research that focused
on dimensional rather than categorical emotion. While the focus here is on
bimodal emotion recognition using both acoustic and text data, some work on
speech-only or text-only emotion recognition is briefly described.
2.1. Acoustic emotion recognition
Recognition of emotion within speech signals has been actively developed
since the success of recognizing emotion via facial expressions. From categor-
ical emotion detection, the paradigm of SER has shifted to predicting degrees
of emotion attributes or dimensional emotions. One of the earliest papers on
(categorical) SER (Petrushin, 1999) explored how well humans and computers
recognize emotion in speech. Since then, research on categorical emotion recog-
nition has grown following the development of affective research in psychology.
Jin and Wang (2005) reported a first trial on SER in categorical and two-
dimensional (2D) spaces. They found that acoustic features are helpful in de-
scribing and distinguishing emotion through the concept of emotion modeling
(2D space). In 2009, Giannakopoulos et al. (2009) re-investigated the associa-
tion of speech signals with an emotion wheel (continuous space). They proposed
a method to estimate the degrees of valence and arousal. Their method, includ-
ing a proposed feature set, could estimate both valence and arousal, with an
5
摘要:

arXiv:2210.14495v1[cs.SD]26Oct2022Two-stagedimensionalemotionrecognitionbyfusingpredictionsofacousticandtextnetworksusingSVMBagusTrisAtmajaa,b,∗,MasatoAkagiaaJapanAdvancedInstituteofScienceandTechnology,1-1Asahidai,Nomi,Ishikawa923-1292,JapanbSepuluhNopemberInsituteofTechnology,Sukolilo,Surabaya6011...

展开>> 收起<<
Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using SVM.pdf

共46页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:46 页 大小:521.23KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 46
客服
关注