FAST YET EFFECTIVE SPEECH EMOTION RECOGNITION WITH SELF-DISTILLATION Zhao Ren1 Thanh Tam Nguyen2 Yi Chang3 Bjorn W. Schuller34 1L3S Research Center Leibniz University Hannover Germany

2025-05-06 0 0 314.95KB 5 页 10玖币
侵权投诉
FAST YET EFFECTIVE SPEECH EMOTION RECOGNITION WITH SELF-DISTILLATION
Zhao Ren1, Thanh Tam Nguyen2, Yi Chang3, Bj¨
orn W. Schuller3,4
1L3S Research Center, Leibniz University Hannover, Germany
2Griffith University, Australia
3GLAM – Group on Language, Audio, & Music, Imperial College London, United Kingdom
4Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany
zren@l3s.de
ABSTRACT
Speech emotion recognition (SER) is the task of recognising hu-
man’s emotional states from speech. SER is extremely prevalent
in helping dialogue systems to truly understand our emotions and
become a trustworthy human conversational partner. Due to the
lengthy nature of speech, SER also suffers from the lack of abundant
labelled data for powerful models like deep neural networks. Pre-
trained complex models on large-scale speech datasets have been
successfully applied to SER via transfer learning. However, fine-
tuning complex models still requires large memory space and results
in low inference efficiency. In this paper, we argue achieving a fast
yet effective SER is possible with self-distillation, a method of si-
multaneously fine-tuning a pretrained model and training shallower
versions of itself. The benefits of our self-distillation framework
are threefold: (1) the adoption of self-distillation method upon the
acoustic modality breaks through the limited ground-truth of speech
data, and outperforms the existing models’ performance on an SER
dataset; (2) executing powerful models at different depth can achieve
adaptive accuracy-efficiency trade-offs on resource-limited edge de-
vices; (3) a new fine-tuning process rather than training from scratch
for self-distillation leads to faster learning time and the state-of-the-
art accuracy on data with small quantities of label information.
Index Termsself-distillation, speech emotion recognition,
adaptive inference, efficient deep learning, efficient edge analytics
1. INTRODUCTION
Speech emotion recognition (SER) nowadays is an idiosyncratic task
in many dialogue systems, such as Siri, Cortana, and Alexa [1].
Through classifying human speech signals into various emotional
states (e. g., happiness, surprise, anger, disgust, fear, sadness, neu-
tral, etc.), SER helps human-computer systems become more per-
sonalised and trustworthy as well as adjust the contexts accord-
ingly in car-driving, heath-diagnosis, call-center, aircraft-cockpit,
and web/mobile applications [2, 3].
Existing techniques for SER are limited by the inherent lack of
labelled data due to the expensive efforts of annotation (e.g. thou-
sands of hours of speech over nearly 7,000 spoken languages [4]).
They often rely on large deep neural networks that are pre-trained by
This research was funded by the Federal Ministry of Education and
Research (BMBF), Germany under the project LeibnizKILabor with grant
No. 01DD20003, and the research projects “IIP-Ecosphere”, granted by the
German Federal Ministry for Economics and Climate Action (BMWK) via
funding code No. 01MK20006A.
unsupervised learning, contrastive learning, or self-supervised learn-
ing, such as wav2vec [5], wav2vec 2.0 [4], and vq-wav2vec [6].
However, fine-tuning large models has a high demand of memory
space and inference time [7].
In machine learning, self-distillation has emerged as a paradigm
to develop a student model with a more lightweight architecture that
can even outperform the teacher [8]. This has been particularly suc-
cessfully applied to computer vision [8, 9]. However, in contrast to
the visual modality, the acoustic modality is significantly more chal-
lenging due to limited ground-truth. Self-distillation methods cannot
be applied directly to SER since they often require large labelled data
to simultaneously train a teacher model from scratch with shallower
student versions of itself [8].
In this paper, we present a framework of self-distillation for fast,
yet effective speech emotion recognition. While our framework is
demonstrated on wav2vec 2.0 [4] (one of the state-of-the-art (SOTA)
pre-trained models for speech representations), it can be applied to
other models and datasets with limited ground-truth information. In
our framework (see Figure 1), the pre-trained wav2vec 2.0 (i. e., the
teacher model) was fine-tuned together with shallower model param-
eters from itself (i. e., the student models), when the teacher and all
students are predicting emotional states from speech samples.
To the best of our knowledge, this is the first attempt to develop
a self-distillation framework for SER. The contributions of our self-
distillation framework include: (1) the application of self-distillation
on speech data overcomes the difficulty caused by limited annota-
tions, and outperforms the existing models’ performance on an SER
dataset; (2) executing powerful models at different depths increases
the possibility to achieve adaptive accuracy-efficiency trade-offs on
resource-limited edge devices; (3) a new fine-tuning process rather
than training from scratch for self-distillation leads to faster learning
time and SOTA accuracy on data with limited ground-truth.
Related Works. Spectrum features [10, 11] have been often used
as the input of deep neural networks for SER [12], while selecting
the appropriate spectrum features is a time-consuming work. More-
over, the performance of SER is limited to expensive human annota-
tions; lacking of labelled data for deep learning. More recently, self-
supervised learning on speech data has shown promising to learn
effective representations, and the pre-trained models have been suc-
cessfully fine-tuned for SER tasks [13–15]. Therefore, we apply an
end-to-end self-supervised learning model, wav2vec 2.0, to SER.
Knowledge distillation is one of the popular methods to achieve
high efficiency by transferring knowledge from a teacher model to
a smaller student model [7]. Similar to other model compression
approaches such as pruning and quantisation, they sacrifice infor-
mation loss (thus accuracy) and could not overcome the accuracy-
arXiv:2210.14636v1 [cs.SD] 26 Oct 2022
摘要:

FASTYETEFFECTIVESPEECHEMOTIONRECOGNITIONWITHSELF-DISTILLATIONZhaoRen1,ThanhTamNguyen2,YiChang3,Bj¨ornW.Schuller3;41L3SResearchCenter,LeibnizUniversityHannover,Germany2GrifthUniversity,Australia3GLAM–GrouponLanguage,Audio,&Music,ImperialCollegeLondon,UnitedKingdom4ChairofEmbeddedIntelligenceforHealt...

展开>> 收起<<
FAST YET EFFECTIVE SPEECH EMOTION RECOGNITION WITH SELF-DISTILLATION Zhao Ren1 Thanh Tam Nguyen2 Yi Chang3 Bjorn W. Schuller34 1L3S Research Center Leibniz University Hannover Germany.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:314.95KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注