IMPROVING LABEL-DEFICIENT KEYWORD SPOTTING THROUGH SELF-SUPERVISED
PRETRAINING
Holger Severin Bovbjerg⋆, Zheng-Hua Tan⋆†
⋆Department of Electronic Systems, Aalborg University, Denmark
†Pioneer Centre for AI, Denmark
{hsbo,zt}@es.aau.dk
ABSTRACT
Keyword Spotting (KWS) models are becoming increasingly
integrated into various systems, e.g. voice assistants. To
achieve satisfactory performance, these models typically rely
on a large amount of labelled data, limiting their applica-
tions only to situations where such data is available. Self-
supervised Learning (SSL) methods can mitigate such a re-
liance by leveraging readily-available unlabelled data. Most
SSL methods for speech have primarily been studied for large
models, whereas this is not ideal, as compact KWS models
are generally required. This paper explores the effectiveness
of SSL on small models for KWS and establishes that SSL
can enhance the performance of small KWS models when la-
belled data is scarce. We pretrain three compact transformer-
based KWS models using Data2Vec, and fine-tune them on
a label-deficient setup of the Google Speech Commands data
set. It is found that Data2Vec pretraining leads to a significant
increase in accuracy, with label-deficient scenarios showing
an improvement of 8.22 % to 11.18 % absolute accuracy.
Index Terms—Keyword Spotting, Self-Supervised,
Speech Commands, Transformer
1. INTRODUCTION
Common for personal assistants like Google Assistant and
Apple’s Siri is that they make use of an Automatic Speech
Recognition (ASR) system, which is activated by a smaller
Keyword Spotting (KWS) system in order to save resources
when the ASR system is not needed [1]. Modern deep learn-
ing based KWS models have improved the accuracy of KWS
systems. However, they need to be trained on a large amount
of labelled data to generalize well and obtaining properly la-
belled speech data is a labour-intensive and costly process,
especially for low-resource languages.
Recently, self-supervised learning methods have shown to
be able to learn strong representations from unlabelled data,
yielding good performance on a number of downstream tasks,
including KWS, when fine-tuned on a limited amount of la-
belled data. However, current studies mainly focus on devel-
oping universal speech models [2, 3], which are trained on
large speech corpuses such as Librispeech [4] or LibriLight
[5], with the goal of obtaining a model that can perform well
for multiple downstream tasks. These large models are com-
monly evaluated on benchmarks like SUPERB [6], requir-
ing fine-tuning on multiple downstream tasks. Consequently,
training these models require numerous high-end GPUs and
often several weeks of training, making training these models
infeasible in many cases, e.g., due to limited time or restricted
computing resources. Additionally, for many use cases, such
as KWS for voice assistants, it is desirable that the models are
small and efficient [1].
While knowledge distillation [7] has been investigated for
transferring the representations learned by a large model to
a smaller model [8, 9, 10], such methods do not deal with
the problem of the necessity of training a large model ini-
tially. One study used a contrastive type of SSL method to
train smaller models without distillation from a large pre-
trained model and found that, contrary to former assump-
tions, small models are able to solve the self-supervised pre-
text tasks without overfitting [11]. Additionally, they were
able to improve the performance of five different small im-
age recognition models, ranging from 2.5 to 11 million pa-
rameters, suggesting that training small self-supervised mod-
els is feasible. Other work found that the learned parameters
of large speech models suffer from redundancy across layers,
and proposed the use of weight sharing to reduce parameter
redundancy and the network size [12].
In this paper, we investigate the adaption of the general
non-contrastive SSL framework Data2Vec [13] to improve
KWS performance in label-deficient scenarios. We imple-
ment three variations of the Keyword Transformer (KWT)
model [14], varying from 600k to 5.4M parameters, and pre-
train the models using Data2Vec. The models are evaluated
on a label-deficient setup of the Google Speech Commands
data set [15] with only 20 % labelled data for supervised
training, and the results show the following:
1. Self-supervised pretraining significantly improves the
KWS performance for all three models when the
amount of labelled data is limited, indicating that self-
supervised learning can also be beneficial for small
To appear in Proc. ICASSP2023 SASB Workshop, June 10th, 2023, Rhodes, Greece © IEEE 2023
arXiv:2210.01703v3 [cs.SD] 24 May 2023