IMPROVING LABEL-DEFICIENT KEYWORD SPOTTING THROUGH SELF-SUPERVISED PRETRAINING Holger Severin Bovbjerg Zheng-Hua Tan

2025-05-08 0 0 252.58KB 5 页 10玖币
侵权投诉
IMPROVING LABEL-DEFICIENT KEYWORD SPOTTING THROUGH SELF-SUPERVISED
PRETRAINING
Holger Severin Bovbjerg, Zheng-Hua Tan
Department of Electronic Systems, Aalborg University, Denmark
Pioneer Centre for AI, Denmark
{hsbo,zt}@es.aau.dk
ABSTRACT
Keyword Spotting (KWS) models are becoming increasingly
integrated into various systems, e.g. voice assistants. To
achieve satisfactory performance, these models typically rely
on a large amount of labelled data, limiting their applica-
tions only to situations where such data is available. Self-
supervised Learning (SSL) methods can mitigate such a re-
liance by leveraging readily-available unlabelled data. Most
SSL methods for speech have primarily been studied for large
models, whereas this is not ideal, as compact KWS models
are generally required. This paper explores the effectiveness
of SSL on small models for KWS and establishes that SSL
can enhance the performance of small KWS models when la-
belled data is scarce. We pretrain three compact transformer-
based KWS models using Data2Vec, and fine-tune them on
a label-deficient setup of the Google Speech Commands data
set. It is found that Data2Vec pretraining leads to a significant
increase in accuracy, with label-deficient scenarios showing
an improvement of 8.22 % to 11.18 % absolute accuracy.
Index TermsKeyword Spotting, Self-Supervised,
Speech Commands, Transformer
1. INTRODUCTION
Common for personal assistants like Google Assistant and
Apple’s Siri is that they make use of an Automatic Speech
Recognition (ASR) system, which is activated by a smaller
Keyword Spotting (KWS) system in order to save resources
when the ASR system is not needed [1]. Modern deep learn-
ing based KWS models have improved the accuracy of KWS
systems. However, they need to be trained on a large amount
of labelled data to generalize well and obtaining properly la-
belled speech data is a labour-intensive and costly process,
especially for low-resource languages.
Recently, self-supervised learning methods have shown to
be able to learn strong representations from unlabelled data,
yielding good performance on a number of downstream tasks,
including KWS, when fine-tuned on a limited amount of la-
belled data. However, current studies mainly focus on devel-
oping universal speech models [2, 3], which are trained on
large speech corpuses such as Librispeech [4] or LibriLight
[5], with the goal of obtaining a model that can perform well
for multiple downstream tasks. These large models are com-
monly evaluated on benchmarks like SUPERB [6], requir-
ing fine-tuning on multiple downstream tasks. Consequently,
training these models require numerous high-end GPUs and
often several weeks of training, making training these models
infeasible in many cases, e.g., due to limited time or restricted
computing resources. Additionally, for many use cases, such
as KWS for voice assistants, it is desirable that the models are
small and efficient [1].
While knowledge distillation [7] has been investigated for
transferring the representations learned by a large model to
a smaller model [8, 9, 10], such methods do not deal with
the problem of the necessity of training a large model ini-
tially. One study used a contrastive type of SSL method to
train smaller models without distillation from a large pre-
trained model and found that, contrary to former assump-
tions, small models are able to solve the self-supervised pre-
text tasks without overfitting [11]. Additionally, they were
able to improve the performance of five different small im-
age recognition models, ranging from 2.5 to 11 million pa-
rameters, suggesting that training small self-supervised mod-
els is feasible. Other work found that the learned parameters
of large speech models suffer from redundancy across layers,
and proposed the use of weight sharing to reduce parameter
redundancy and the network size [12].
In this paper, we investigate the adaption of the general
non-contrastive SSL framework Data2Vec [13] to improve
KWS performance in label-deficient scenarios. We imple-
ment three variations of the Keyword Transformer (KWT)
model [14], varying from 600k to 5.4M parameters, and pre-
train the models using Data2Vec. The models are evaluated
on a label-deficient setup of the Google Speech Commands
data set [15] with only 20 % labelled data for supervised
training, and the results show the following:
1. Self-supervised pretraining significantly improves the
KWS performance for all three models when the
amount of labelled data is limited, indicating that self-
supervised learning can also be beneficial for small
To appear in Proc. ICASSP2023 SASB Workshop, June 10th, 2023, Rhodes, Greece © IEEE 2023
arXiv:2210.01703v3 [cs.SD] 24 May 2023
摘要:

IMPROVINGLABEL-DEFICIENTKEYWORDSPOTTINGTHROUGHSELF-SUPERVISEDPRETRAININGHolgerSeverinBovbjerg⋆,Zheng-HuaTan⋆†⋆DepartmentofElectronicSystems,AalborgUniversity,Denmark†PioneerCentreforAI,Denmark{hsbo,zt}@es.aau.dkABSTRACTKeywordSpotting(KWS)modelsarebecomingincreasinglyintegratedintovarioussystems,e.g...

展开>> 收起<<
IMPROVING LABEL-DEFICIENT KEYWORD SPOTTING THROUGH SELF-SUPERVISED PRETRAINING Holger Severin Bovbjerg Zheng-Hua Tan.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:252.58KB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注