KNOWLEDGE TRANSFER FOR ON-DEVICE SPEECH EMOTION RECOGNITION WITH NEURAL STRUCTURED LEARNING Yi Chang1 Zhao Ren2 Thanh Tam Nguyen3 Kun Qian4 and Bj orn W. Schuller15

2025-05-03 0 0 437.09KB 5 页 10玖币

侵权投诉

KNOWLEDGE TRANSFER FOR ON-DEVICE SPEECH EMOTION RECOGNITION

WITH NEURAL STRUCTURED LEARNING

Yi Chang1, Zhao Ren2, Thanh Tam Nguyen3, Kun Qian4, and Bj¨

orn W. Schuller1,5

1GLAM – Group on Language, Audio, & Music, Imperial College London, United Kingdom

2L3S Research Center, Leibniz University Hannover, Germany

3Grifﬁth University, Australia

4School of Medical Technology, Beijing Institute of Technology, China

5Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany

ABSTRACT

Speech emotion recognition (SER) has been a popular research topic

in human-computer interaction (HCI). As edge devices are rapidly

springing up, applying SER to edge devices is promising for a huge

number of HCI applications. Although deep learning has been in-

vestigated to improve the performance of SER by training complex

models, the memory space and computational capability of edge de-

vices represents a constraint for embedding deep learning models.

We propose a neural structured learning (NSL) framework through

building synthesized graphs. An SER model is trained on a source

dataset and used to build graphs on a target dataset. A relatively

lightweight model is then trained with the speech samples and graphs

together as the input. Our experiments demonstrate that training a

lightweight SER model on the target dataset with speech samples

and graphs can not only produce small SER models, but also en-

hance the model performance compared to models with speech sam-

ples only and those using classic transfer learning strategies.

Index Terms—Speech emotion recognition, neural structured

learning, edge device, lightweight deep learning

1. INTRODUCTION

Speech emotion recognition (SER), which aims to recognise emo-

tional states from speech, has been a popular research topic in the

domain of human-computer interaction (HCI) [1]. SER has been

applied to a range of applications, including call centres, education,

mental health, computer games, and many others [2]. In particu-

lar, speech signals provide rich and complementary information to

other modalities, e. g., images, biosignals, social media, etc [3]. SER

can not only improve the performance of emotion recognition when

combined with other modalities in a multimodal system, but also en-

able machines to perceive human emotions when other modalities

are not available, such as in audio-only call centres [1].

Deep learning has been widely applied to SER with many

model architectures, including spectrum-based and end-to-end mod-

els. Spectrum-based models process spectrum features from speech

This research was partially funded by the Federal Ministry of Educa-

tion and Research (BMBF), Germany under the project LeibnizKILabor with

grant No. 01DD20003, and the research projects “IIP-Ecosphere”, granted by

the German Federal Ministry for Economics and Climate Action (BMWK)

via funding code No. 01MK20006A, the Ministry of Science and Technology

of the People’s Republic of China (No. 2021ZD0201900), the National Nat-

ural Science Foundation of China (No. 62272044), the National High-Level

Young Talent Project, and the BIT Teli Young Fellow Program from the Bei-

jing Institute of Technology, China.

signals [4, 5], while end-to-end models directly process raw speech

signals [6,7]. Convolutional neural networks (CNNs), recurrent neu-

ral networks (RNNs), and their variants (e. g., transformers) have

been commonly used to build either end-to-end or spectrum-based

models for SER [8, 9].

Improving the performance of SER faces two challenges. First,

creating emotional speech datasets with high-quality annotations

is a time-consuming and potentially biased process, leading to

small-scale datasets [10]. Second, with the increasing demand for

SER in Internet-of-Things (IoT) applications, it is essential to train

lightweight neural networks for efﬁcient model development. How-

ever, directly ﬁne-tuning complex SER models pre-trained on large-

scale data places a high demand on computing systems [11].

More recently, neural structured learning (NSL) was proposed to

add structured signals (e. g., graphs) as the model input in addition to

the original data [12]. In particular, NSL was developed to solve the

data labelling problem of semi-supervised learning, and to build ad-

versarial training against adversarial attacks [12]. Inspired by NSL,

it is promising to construct a graph with a pre-trained model to break

the bottleneck caused by small-scale labelled data and edge devices.

In this study, we propose an NSL framework to transfer the

knowledge of a large, pre-trained SER model to a smaller model with

graph. To the best of the authors’ knowledge, there were only few

studies using NSL for SER [13]. The contributions of our work are

twofold: i) transferring model knowledge through an NSL-generated

graph can improve performance by leveraging multiple databases; ii)

the proposed NSL framework can train lightweight neural networks

without a high requirement of computing resources. Evaluated on

an emotional speech dataset, our NSL framework outperforms mod-

els trained on the original data only and those trained with classic

transfer learning strategies.

Related Works. Deep Learning for SER. As previously mentioned,

deep learning is mainly applied for SER with spectrum-based and

end-to-end models. Spectrum-based models are typically shallower

and more efﬁcient than end-to-end models due to smaller data size of

extracted spectrums compared to raw speech signals. On the other

hand, end-to-end models save the time of selecting suitable spec-

trum types and can extract complementary features over ﬁxed spec-

trums. An end-to-end model was developed based on 1D CNNs for

learning spatial features in SER [14]. Additionally, stacked multiple

transformer layers were utilised in [8] to better extract global feature

dependencies from speech. More recently, wav2vec models, which

include CNNs and transformers, were trained with self-supervised

learning on unlabelled speech data [15]. Wav2vec models have been

applied to generate speech embeddings for SER [16, 17].

arXiv:2210.14977v3 [cs.SD] 11 May 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

KNOWLEDGETRANSFERFORON-DEVICESPEECHEMOTIONRECOGNITIONWITHNEURALSTRUCTUREDLEARNINGYiChang1,ZhaoRen2,ThanhTamNguyen3,KunQian4,andBj¨ornW.Schuller1;51GLAMGrouponLanguage,Audio,&Music,ImperialCollegeLondon,UnitedKingdom2L3SResearchCenter,LeibnizUniversityHannover,Germany3GrifthUniversity,Australia4Sch...

展开>> 收起<<

KNOWLEDGE TRANSFER FOR ON-DEVICE SPEECH EMOTION RECOGNITION WITH NEURAL STRUCTURED LEARNING Yi Chang1 Zhao Ren2 Thanh Tam Nguyen3 Kun Qian4 and Bj orn W. Schuller15.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

KNOWLEDGE TRANSFER FOR ON-DEVICE SPEECH EMOTION RECOGNITION WITH NEURAL STRUCTURED LEARNING Yi Chang1 Zhao Ren2 Thanh Tam Nguyen3 Kun Qian4 and Bj orn W. Schuller15

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: