FAST YET EFFECTIVE SPEECH EMOTION RECOGNITION WITH SELF-DISTILLATION Zhao Ren1 Thanh Tam Nguyen2 Yi Chang3 Bjorn W. Schuller34 1L3S Research Center Leibniz University Hannover Germany

2025-05-06 0 0 314.95KB 5 页 10玖币

侵权投诉

FAST YET EFFECTIVE SPEECH EMOTION RECOGNITION WITH SELF-DISTILLATION

Zhao Ren1, Thanh Tam Nguyen2, Yi Chang3, Bj¨

orn W. Schuller3,4

1L3S Research Center, Leibniz University Hannover, Germany

2Grifﬁth University, Australia

3GLAM – Group on Language, Audio, & Music, Imperial College London, United Kingdom

4Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany

zren@l3s.de

ABSTRACT

Speech emotion recognition (SER) is the task of recognising hu-

man’s emotional states from speech. SER is extremely prevalent

in helping dialogue systems to truly understand our emotions and

become a trustworthy human conversational partner. Due to the

lengthy nature of speech, SER also suffers from the lack of abundant

labelled data for powerful models like deep neural networks. Pre-

trained complex models on large-scale speech datasets have been

successfully applied to SER via transfer learning. However, ﬁne-

tuning complex models still requires large memory space and results

in low inference efﬁciency. In this paper, we argue achieving a fast

yet effective SER is possible with self-distillation, a method of si-

multaneously ﬁne-tuning a pretrained model and training shallower

versions of itself. The beneﬁts of our self-distillation framework

are threefold: (1) the adoption of self-distillation method upon the

acoustic modality breaks through the limited ground-truth of speech

data, and outperforms the existing models’ performance on an SER

dataset; (2) executing powerful models at different depth can achieve

adaptive accuracy-efﬁciency trade-offs on resource-limited edge de-

vices; (3) a new ﬁne-tuning process rather than training from scratch

for self-distillation leads to faster learning time and the state-of-the-

art accuracy on data with small quantities of label information.

Index Terms—self-distillation, speech emotion recognition,

adaptive inference, efﬁcient deep learning, efﬁcient edge analytics

1. INTRODUCTION

Speech emotion recognition (SER) nowadays is an idiosyncratic task

in many dialogue systems, such as Siri, Cortana, and Alexa [1].

Through classifying human speech signals into various emotional

states (e. g., happiness, surprise, anger, disgust, fear, sadness, neu-

tral, etc.), SER helps human-computer systems become more per-

sonalised and trustworthy as well as adjust the contexts accord-

ingly in car-driving, heath-diagnosis, call-center, aircraft-cockpit,

and web/mobile applications [2, 3].

Existing techniques for SER are limited by the inherent lack of

labelled data due to the expensive efforts of annotation (e.g. thou-

sands of hours of speech over nearly 7,000 spoken languages [4]).

They often rely on large deep neural networks that are pre-trained by

This research was funded by the Federal Ministry of Education and

Research (BMBF), Germany under the project LeibnizKILabor with grant

No. 01DD20003, and the research projects “IIP-Ecosphere”, granted by the

German Federal Ministry for Economics and Climate Action (BMWK) via

funding code No. 01MK20006A.

unsupervised learning, contrastive learning, or self-supervised learn-

ing, such as wav2vec [5], wav2vec 2.0 [4], and vq-wav2vec [6].

However, ﬁne-tuning large models has a high demand of memory

space and inference time [7].

In machine learning, self-distillation has emerged as a paradigm

to develop a student model with a more lightweight architecture that

can even outperform the teacher [8]. This has been particularly suc-

cessfully applied to computer vision [8, 9]. However, in contrast to

the visual modality, the acoustic modality is signiﬁcantly more chal-

lenging due to limited ground-truth. Self-distillation methods cannot

be applied directly to SER since they often require large labelled data

to simultaneously train a teacher model from scratch with shallower

student versions of itself [8].

In this paper, we present a framework of self-distillation for fast,

yet effective speech emotion recognition. While our framework is

demonstrated on wav2vec 2.0 [4] (one of the state-of-the-art (SOTA)

pre-trained models for speech representations), it can be applied to

other models and datasets with limited ground-truth information. In

our framework (see Figure 1), the pre-trained wav2vec 2.0 (i. e., the

teacher model) was ﬁne-tuned together with shallower model param-

eters from itself (i. e., the student models), when the teacher and all

students are predicting emotional states from speech samples.

To the best of our knowledge, this is the ﬁrst attempt to develop

a self-distillation framework for SER. The contributions of our self-

distillation framework include: (1) the application of self-distillation

on speech data overcomes the difﬁculty caused by limited annota-

tions, and outperforms the existing models’ performance on an SER

dataset; (2) executing powerful models at different depths increases

the possibility to achieve adaptive accuracy-efﬁciency trade-offs on

resource-limited edge devices; (3) a new ﬁne-tuning process rather

than training from scratch for self-distillation leads to faster learning

time and SOTA accuracy on data with limited ground-truth.

Related Works. Spectrum features [10, 11] have been often used

as the input of deep neural networks for SER [12], while selecting

the appropriate spectrum features is a time-consuming work. More-

over, the performance of SER is limited to expensive human annota-

tions; lacking of labelled data for deep learning. More recently, self-

supervised learning on speech data has shown promising to learn

effective representations, and the pre-trained models have been suc-

cessfully ﬁne-tuned for SER tasks [13–15]. Therefore, we apply an

end-to-end self-supervised learning model, wav2vec 2.0, to SER.

Knowledge distillation is one of the popular methods to achieve

high efﬁciency by transferring knowledge from a teacher model to

a smaller student model [7]. Similar to other model compression

approaches such as pruning and quantisation, they sacriﬁce infor-

mation loss (thus accuracy) and could not overcome the accuracy-

arXiv:2210.14636v1 [cs.SD] 26 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FASTYETEFFECTIVESPEECHEMOTIONRECOGNITIONWITHSELF-DISTILLATIONZhaoRen1,ThanhTamNguyen2,YiChang3,Bj¨ornW.Schuller3;41L3SResearchCenter,LeibnizUniversityHannover,Germany2GrifthUniversity,Australia3GLAMGrouponLanguage,Audio,&Music,ImperialCollegeLondon,UnitedKingdom4ChairofEmbeddedIntelligenceforHealt...

展开>> 收起<<

FAST YET EFFECTIVE SPEECH EMOTION RECOGNITION WITH SELF-DISTILLATION Zhao Ren1 Thanh Tam Nguyen2 Yi Chang3 Bjorn W. Schuller34 1L3S Research Center Leibniz University Hannover Germany.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FAST YET EFFECTIVE SPEECH EMOTION RECOGNITION WITH SELF-DISTILLATION Zhao Ren1 Thanh Tam Nguyen2 Yi Chang3 Bjorn W. Schuller34 1L3S Research Center Leibniz University Hannover Germany

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: