
KNOWLEDGE TRANSFER FOR ON-DEVICE SPEECH EMOTION RECOGNITION
WITH NEURAL STRUCTURED LEARNING
Yi Chang1, Zhao Ren2, Thanh Tam Nguyen3, Kun Qian4, and Bj¨
orn W. Schuller1,5
1GLAM – Group on Language, Audio, & Music, Imperial College London, United Kingdom
2L3S Research Center, Leibniz University Hannover, Germany
3Griffith University, Australia
4School of Medical Technology, Beijing Institute of Technology, China
5Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany
ABSTRACT
Speech emotion recognition (SER) has been a popular research topic
in human-computer interaction (HCI). As edge devices are rapidly
springing up, applying SER to edge devices is promising for a huge
number of HCI applications. Although deep learning has been in-
vestigated to improve the performance of SER by training complex
models, the memory space and computational capability of edge de-
vices represents a constraint for embedding deep learning models.
We propose a neural structured learning (NSL) framework through
building synthesized graphs. An SER model is trained on a source
dataset and used to build graphs on a target dataset. A relatively
lightweight model is then trained with the speech samples and graphs
together as the input. Our experiments demonstrate that training a
lightweight SER model on the target dataset with speech samples
and graphs can not only produce small SER models, but also en-
hance the model performance compared to models with speech sam-
ples only and those using classic transfer learning strategies.
Index Terms—Speech emotion recognition, neural structured
learning, edge device, lightweight deep learning
1. INTRODUCTION
Speech emotion recognition (SER), which aims to recognise emo-
tional states from speech, has been a popular research topic in the
domain of human-computer interaction (HCI) [1]. SER has been
applied to a range of applications, including call centres, education,
mental health, computer games, and many others [2]. In particu-
lar, speech signals provide rich and complementary information to
other modalities, e. g., images, biosignals, social media, etc [3]. SER
can not only improve the performance of emotion recognition when
combined with other modalities in a multimodal system, but also en-
able machines to perceive human emotions when other modalities
are not available, such as in audio-only call centres [1].
Deep learning has been widely applied to SER with many
model architectures, including spectrum-based and end-to-end mod-
els. Spectrum-based models process spectrum features from speech
This research was partially funded by the Federal Ministry of Educa-
tion and Research (BMBF), Germany under the project LeibnizKILabor with
grant No. 01DD20003, and the research projects “IIP-Ecosphere”, granted by
the German Federal Ministry for Economics and Climate Action (BMWK)
via funding code No. 01MK20006A, the Ministry of Science and Technology
of the People’s Republic of China (No. 2021ZD0201900), the National Nat-
ural Science Foundation of China (No. 62272044), the National High-Level
Young Talent Project, and the BIT Teli Young Fellow Program from the Bei-
jing Institute of Technology, China.
signals [4, 5], while end-to-end models directly process raw speech
signals [6,7]. Convolutional neural networks (CNNs), recurrent neu-
ral networks (RNNs), and their variants (e. g., transformers) have
been commonly used to build either end-to-end or spectrum-based
models for SER [8, 9].
Improving the performance of SER faces two challenges. First,
creating emotional speech datasets with high-quality annotations
is a time-consuming and potentially biased process, leading to
small-scale datasets [10]. Second, with the increasing demand for
SER in Internet-of-Things (IoT) applications, it is essential to train
lightweight neural networks for efficient model development. How-
ever, directly fine-tuning complex SER models pre-trained on large-
scale data places a high demand on computing systems [11].
More recently, neural structured learning (NSL) was proposed to
add structured signals (e. g., graphs) as the model input in addition to
the original data [12]. In particular, NSL was developed to solve the
data labelling problem of semi-supervised learning, and to build ad-
versarial training against adversarial attacks [12]. Inspired by NSL,
it is promising to construct a graph with a pre-trained model to break
the bottleneck caused by small-scale labelled data and edge devices.
In this study, we propose an NSL framework to transfer the
knowledge of a large, pre-trained SER model to a smaller model with
graph. To the best of the authors’ knowledge, there were only few
studies using NSL for SER [13]. The contributions of our work are
twofold: i) transferring model knowledge through an NSL-generated
graph can improve performance by leveraging multiple databases; ii)
the proposed NSL framework can train lightweight neural networks
without a high requirement of computing resources. Evaluated on
an emotional speech dataset, our NSL framework outperforms mod-
els trained on the original data only and those trained with classic
transfer learning strategies.
Related Works. Deep Learning for SER. As previously mentioned,
deep learning is mainly applied for SER with spectrum-based and
end-to-end models. Spectrum-based models are typically shallower
and more efficient than end-to-end models due to smaller data size of
extracted spectrums compared to raw speech signals. On the other
hand, end-to-end models save the time of selecting suitable spec-
trum types and can extract complementary features over fixed spec-
trums. An end-to-end model was developed based on 1D CNNs for
learning spatial features in SER [14]. Additionally, stacked multiple
transformer layers were utilised in [8] to better extract global feature
dependencies from speech. More recently, wav2vec models, which
include CNNs and transformers, were trained with self-supervised
learning on unlabelled speech data [15]. Wav2vec models have been
applied to generate speech embeddings for SER [16, 17].
arXiv:2210.14977v3 [cs.SD] 11 May 2023