Effect of different splitting criteria on the
performance of speech emotion recognition
Bagus Tris Atmaja*
National Institute of Advanced Industrial Science and Technology, Japan
b-atmaja@aist.go.jp
Akira Sasou
National Institute of Advanced Industrial Science and Technology, Japan
a-sasou@aist.go.jp
Abstract—Traditional speech emotion recognition (SER) eval-
uations have been performed merely on a speaker-independent
condition; some of them even did not evaluate their result on
this condition. This paper highlights the importance of splitting
training and test data for SER by script, known as sentence-
open or text-independent criteria. The results show that em-
ploying sentence-open criteria degraded the performance of
SER. This finding implies the difficulties of recognizing emotion
from speech in different linguistic information embedded in
acoustic information. Surprisingly, text-independent criteria
consistently performed worse than speaker+text-independent
criteria. The full order of difficulties for splitting criteria on
SER performances from the most difficult to the easiest is text-
independent, speaker+text-independent, speaker-independent,
and speaker+text-dependent. The gap between speaker+text-
independent and text-independent was smaller than other
criteria, strengthening the difficulties of recognizing emotion
from speech in different sentences.
Index Terms—Speech emotion recognition, data partition,
text-independent, speaker-independent, splitting criteria
I. INTRODUCTION
Speech emotion recognition (SER) is one topic of interest
in automatic speech recognition and understanding. In con-
trast to automatic speech recognition (ASR) which attempts
to obtain linguistic information from speech, SER attempts
to obtain non-linguistic information from speech. In more
concrete, SER aims to infer the affective state of the speaker
from solely speech data.
SER can be designed to recognize discrete emotion, con-
tinuous emotion, or both emotion models. Recent research
suggested that emotion is ordinal by nature [1], which is
closer to a categorical than the continuous model. In the
categorical model, several emotion categories exist, from the
simplest two categories with positive and negative emotions
to 27 categories [2]. The choice of emotion model in SER
depends on the availability of the labels in the dataset.
Data-driven methods, in which most SER systems employ
this kind of approach, rely on the configuration or selection
of the data to build the model. In SER, it is common to
split the data by evaluating different speakers for training and
test partitions. This approach, known as speaker-independent
criteria, is a gold standard to build SER model that minimizes
speaker variability in the training phase.
*Corresponding author, on leave from Department of Engineering
Physics, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia (email:
bagus@ep.its.ac.id).
Recent research in SER, particularly by fusing acoustic
and linguistic information, has found that different splitting
criteria in splitting data for training will result in different
performances [3]. In the fusion of acoustic and linguis-
tic information, it is sound that the model needs to be
trained in different scripts, known as sentence-open or text-
independent. This strategy was intended to avoid the effect of
having the same linguistic information for predicting emotion
under the same sentences for both training and test partitions.
Since linguistic information is extracted via text or script, this
splitting condition is necessary to evaluate such discrepancies
of using different features or types of information. Using
merely acoustic features for SER, one may argue that this
evaluation is unnecessary since no linguistic features are
involved in building the SER model.
Fujisaki in 2002 proposed a scheme in which various types
of information are manifested in the segmental and supraseg-
mental features of speech [4]. One of the types of information
includes emotion. Referencing this argument that emotional
information is manifested directly in speech without a need to
convert speech into text, there is a possibility that different
sentences will yield different SER performances under the
same acoustic-only system. The current research on SER
showed no evaluation of the differences of splitting crite-
ria, particularly comparing data with and without linguistic
information.
The contribution of this paper is an evaluation of the
effect of splitting criteria into the training data on SER
performance. As argued previously, linguistic information
is embedded in acoustic features; hence, evaluating text-
independent criteria, i.e., different sentences for training
and test, is necessary to observe such effects. We eval-
uated four splitting criteria: speaker-dependent (including
text-dependent data), speaker-independent, text-independent,
and speaker+text-independent criteria, and traced their SER
performances. We experimented with these criteria in three
different experiments. The results of three different experi-
ments show a consistent pattern of difficulties for four differ-
ent splitting criteria. Text-independent criteria obtained the
worst result followed by speaker+text-independent, speaker-
independent, and speaker-dependent criteria.
II. RELATED WORK
Research on the evaluation of the effect of data selection
on speech processing is not new. Different data for training
arXiv:2210.14501v1 [eess.AS] 26 Oct 2022