
FULLY UNSUPERVISED TRAINING OF FEW-SHOT KEYWORD SPOTTING
Dongjune Lee∗, Minchan Kim∗, Sung Hwan Mun, Min Hyun Han, Nam Soo Kim
Department of Electrical and Computer Engineering and INMC,
Seoul National University, Seoul, South Korea
ABSTRACT
For training a few-shot keyword spotting (FS-KWS)
model, a large labeled dataset containing massive target key-
words has known to be essential to generalize to arbitrary
target keywords with only a few enrollment samples. To
alleviate the expensive data collection with labeling, in this
paper, we propose a novel FS-KWS system trained only on
synthetic data. The proposed system is based on metric learn-
ing enabling target keywords to be detected using distance
metrics. Exploiting the speech synthesis model that generates
speech with pseudo phonemes instead of texts, we easily ob-
tain a large collection of multi-view samples with the same
semantics. These samples are sufficient for training, consid-
ering metric learning does not intrinsically necessitate labeled
data. All of the components in our framework do not require
any supervision, making our method unsupervised. Experi-
mental results on real datasets show our proposed method is
competitive even without any labeled and real datasets.
Index Terms—user-defined keyword spotting, few-shot
keyword spotting, metric learning, speech synthesis
1. INTRODUCTION
Keyword spotting (KWS) is to identify a target keyword in
the continuous audio streams, which is broadly deployed as
a front door to voice assistants in several edge devices such
as smartphones and AI speakers. In general, KWS systems
predetermine target keywords and are directly optimized for
selected keywords. Although existing predefined KWS mod-
els show high detection performance [1, 2, 3], the necessity
of a large dataset containing target keywords and inflexibil-
ity of changing target keywords hinder KWS models from
expanding to various applications. When it comes to user-
defined KWS, users can customize the target keywords with
only a few enrollment samples [4, 5, 6, 7] or in the form of
string [8, 9]. Few-shot KWS (FS-KWS) especially has shown
its feasibility through meta learning [4], transfer learning [5],
and metric learning [6, 7], operating on the few-shot detection
scenario. These approaches typically require learning from a
large corpus with lots of different keywords to secure gen-
eralization on unseen keywords with few samples. However,
*These authors contributed equally to this work.
user-defined KWS despite its potential in diverse usability has
been underexplored due to the deficiency of a large high qual-
ity public corpus.
Recently, with dramatic advances in deep generative mod-
els, there have been several approaches that leverage genera-
tive models as a data source to compensate for insufficient or
unavailable labeled data. For example, data augmentation us-
ing synthesized data has been explored in KWS and automatic
speech recognition (ASR)[10, 11, 12]. Text-to-speech (TTS)
models are utilized to supplement the less frequent utterances
such as named entities which are difficult to collect in the real
world. In the vision domain, Besnier et al. [13] effectively
trains a classifier with several learning strategies only using
synthetic image data generated by a conditional GAN. More-
over, Jahanian et al. [14] used an unconditional GAN as a data
source for representation learning. In [14], searching in latent
space of GAN offers multi-view data sharing the semantics
which is necessary for training contrastive objective of rep-
resentation learning. The aforementioned approaches show
that the synthesized dataset has the potential to substitute a
real dataset, not only bounded to taking a role as a subsidiary
dataset.
In this paper, we propose a novel framework for FS-KWS
trained on only synthetic data. The proposed framework
is based on metric learning so as to detect target keywords
with few-shot enrollments. We assume that a good FS-KWS
model can extract phonetic information from any short ut-
terances and make clusters of any arbitrary phonetic chunks
from various voices. Our motivation stems from the question
of whether a labeled KWS dataset is indispensable in metric
learning if we can acquire a large bunch of utterances with the
same pronunciation by a different route. The notable point
is that the metric learning objective in FS-KWS does not re-
quire any explicit textual supervision for training. Instead, we
exploit pseudo-TTS model [15], which is trained on a large-
scale unlabeled speech corpus. The pseudo-TTS model takes
pseudo phoneme sequence extracted from wav2vec2.0 [16]
and reference speech as inputs, and returns utterances with
various speakers and prosody reflected. Using a phonetic rep-
resentation of wav2vec2.0, the pseudo-TTS system takes the
role of decomposing the utterances to the fine-grained factors
and reassembling to the suitable form for metric learning. We
notice that our proposed method is trained in a fully unsuper-
978-1-6654-7189-3/22/$31.00 © 2023 IEEE
arXiv:2210.02732v2 [eess.AS] 7 Oct 2022