IMPROVING LABEL-DEFICIENT KEYWORD SPOTTING THROUGH SELF-SUPERVISED PRETRAINING Holger Severin Bovbjerg Zheng-Hua Tan

2025-05-08 1 0 252.58KB 5 页 10玖币

侵权投诉

IMPROVING LABEL-DEFICIENT KEYWORD SPOTTING THROUGH SELF-SUPERVISED

PRETRAINING

Holger Severin Bovbjerg⋆, Zheng-Hua Tan⋆†

⋆Department of Electronic Systems, Aalborg University, Denmark

†Pioneer Centre for AI, Denmark

{hsbo,zt}@es.aau.dk

ABSTRACT

Keyword Spotting (KWS) models are becoming increasingly

integrated into various systems, e.g. voice assistants. To

achieve satisfactory performance, these models typically rely

on a large amount of labelled data, limiting their applica-

tions only to situations where such data is available. Self-

supervised Learning (SSL) methods can mitigate such a re-

liance by leveraging readily-available unlabelled data. Most

SSL methods for speech have primarily been studied for large

models, whereas this is not ideal, as compact KWS models

are generally required. This paper explores the effectiveness

of SSL on small models for KWS and establishes that SSL

can enhance the performance of small KWS models when la-

belled data is scarce. We pretrain three compact transformer-

based KWS models using Data2Vec, and ﬁne-tune them on

a label-deﬁcient setup of the Google Speech Commands data

set. It is found that Data2Vec pretraining leads to a signiﬁcant

increase in accuracy, with label-deﬁcient scenarios showing

an improvement of 8.22 % to 11.18 % absolute accuracy.

Index Terms—Keyword Spotting, Self-Supervised,

Speech Commands, Transformer

1. INTRODUCTION

Common for personal assistants like Google Assistant and

Apple’s Siri is that they make use of an Automatic Speech

Recognition (ASR) system, which is activated by a smaller

Keyword Spotting (KWS) system in order to save resources

when the ASR system is not needed [1]. Modern deep learn-

ing based KWS models have improved the accuracy of KWS

systems. However, they need to be trained on a large amount

of labelled data to generalize well and obtaining properly la-

belled speech data is a labour-intensive and costly process,

especially for low-resource languages.

Recently, self-supervised learning methods have shown to

be able to learn strong representations from unlabelled data,

yielding good performance on a number of downstream tasks,

including KWS, when ﬁne-tuned on a limited amount of la-

belled data. However, current studies mainly focus on devel-

oping universal speech models [2, 3], which are trained on

large speech corpuses such as Librispeech [4] or LibriLight

[5], with the goal of obtaining a model that can perform well

for multiple downstream tasks. These large models are com-

monly evaluated on benchmarks like SUPERB [6], requir-

ing ﬁne-tuning on multiple downstream tasks. Consequently,

training these models require numerous high-end GPUs and

often several weeks of training, making training these models

infeasible in many cases, e.g., due to limited time or restricted

computing resources. Additionally, for many use cases, such

as KWS for voice assistants, it is desirable that the models are

small and efﬁcient [1].

While knowledge distillation [7] has been investigated for

transferring the representations learned by a large model to

a smaller model [8, 9, 10], such methods do not deal with

the problem of the necessity of training a large model ini-

tially. One study used a contrastive type of SSL method to

train smaller models without distillation from a large pre-

trained model and found that, contrary to former assump-

tions, small models are able to solve the self-supervised pre-

text tasks without overﬁtting [11]. Additionally, they were

able to improve the performance of ﬁve different small im-

age recognition models, ranging from 2.5 to 11 million pa-

rameters, suggesting that training small self-supervised mod-

els is feasible. Other work found that the learned parameters

of large speech models suffer from redundancy across layers,

and proposed the use of weight sharing to reduce parameter

redundancy and the network size [12].

In this paper, we investigate the adaption of the general

non-contrastive SSL framework Data2Vec [13] to improve

KWS performance in label-deﬁcient scenarios. We imple-

ment three variations of the Keyword Transformer (KWT)

model [14], varying from 600k to 5.4M parameters, and pre-

train the models using Data2Vec. The models are evaluated

on a label-deﬁcient setup of the Google Speech Commands

data set [15] with only 20 % labelled data for supervised

training, and the results show the following:

1. Self-supervised pretraining signiﬁcantly improves the

KWS performance for all three models when the

amount of labelled data is limited, indicating that self-

supervised learning can also be beneﬁcial for small

arXiv:2210.01703v3 [cs.SD] 24 May 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IMPROVINGLABEL-DEFICIENTKEYWORDSPOTTINGTHROUGHSELF-SUPERVISEDPRETRAININGHolgerSeverinBovbjerg⋆,Zheng-HuaTan⋆†⋆DepartmentofElectronicSystems,AalborgUniversity,Denmark†PioneerCentreforAI,Denmark{hsbo,zt}@es.aau.dkABSTRACTKeywordSpotting(KWS)modelsarebecomingincreasinglyintegratedintovarioussystems,e.g...

展开>> 收起<<

IMPROVING LABEL-DEFICIENT KEYWORD SPOTTING THROUGH SELF-SUPERVISED PRETRAINING Holger Severin Bovbjerg Zheng-Hua Tan.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

IMPROVING LABEL-DEFICIENT KEYWORD SPOTTING THROUGH SELF-SUPERVISED PRETRAINING Holger Severin Bovbjerg Zheng-Hua Tan

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: