FULLY UNSUPERVISED TRAINING OF FEW-SHOT KEYWORD SPOTTING Dongjune Lee Minchan Kim Sung Hwan Mun Min Hyun Han Nam Soo Kim Department of Electrical and Computer Engineering and INMC

2025-05-06 0 0 750.52KB 7 页 10玖币
侵权投诉
FULLY UNSUPERVISED TRAINING OF FEW-SHOT KEYWORD SPOTTING
Dongjune Lee, Minchan Kim, Sung Hwan Mun, Min Hyun Han, Nam Soo Kim
Department of Electrical and Computer Engineering and INMC,
Seoul National University, Seoul, South Korea
ABSTRACT
For training a few-shot keyword spotting (FS-KWS)
model, a large labeled dataset containing massive target key-
words has known to be essential to generalize to arbitrary
target keywords with only a few enrollment samples. To
alleviate the expensive data collection with labeling, in this
paper, we propose a novel FS-KWS system trained only on
synthetic data. The proposed system is based on metric learn-
ing enabling target keywords to be detected using distance
metrics. Exploiting the speech synthesis model that generates
speech with pseudo phonemes instead of texts, we easily ob-
tain a large collection of multi-view samples with the same
semantics. These samples are sufficient for training, consid-
ering metric learning does not intrinsically necessitate labeled
data. All of the components in our framework do not require
any supervision, making our method unsupervised. Experi-
mental results on real datasets show our proposed method is
competitive even without any labeled and real datasets.
Index Termsuser-defined keyword spotting, few-shot
keyword spotting, metric learning, speech synthesis
1. INTRODUCTION
Keyword spotting (KWS) is to identify a target keyword in
the continuous audio streams, which is broadly deployed as
a front door to voice assistants in several edge devices such
as smartphones and AI speakers. In general, KWS systems
predetermine target keywords and are directly optimized for
selected keywords. Although existing predefined KWS mod-
els show high detection performance [1, 2, 3], the necessity
of a large dataset containing target keywords and inflexibil-
ity of changing target keywords hinder KWS models from
expanding to various applications. When it comes to user-
defined KWS, users can customize the target keywords with
only a few enrollment samples [4, 5, 6, 7] or in the form of
string [8, 9]. Few-shot KWS (FS-KWS) especially has shown
its feasibility through meta learning [4], transfer learning [5],
and metric learning [6, 7], operating on the few-shot detection
scenario. These approaches typically require learning from a
large corpus with lots of different keywords to secure gen-
eralization on unseen keywords with few samples. However,
*These authors contributed equally to this work.
user-defined KWS despite its potential in diverse usability has
been underexplored due to the deficiency of a large high qual-
ity public corpus.
Recently, with dramatic advances in deep generative mod-
els, there have been several approaches that leverage genera-
tive models as a data source to compensate for insufficient or
unavailable labeled data. For example, data augmentation us-
ing synthesized data has been explored in KWS and automatic
speech recognition (ASR)[10, 11, 12]. Text-to-speech (TTS)
models are utilized to supplement the less frequent utterances
such as named entities which are difficult to collect in the real
world. In the vision domain, Besnier et al. [13] effectively
trains a classifier with several learning strategies only using
synthetic image data generated by a conditional GAN. More-
over, Jahanian et al. [14] used an unconditional GAN as a data
source for representation learning. In [14], searching in latent
space of GAN offers multi-view data sharing the semantics
which is necessary for training contrastive objective of rep-
resentation learning. The aforementioned approaches show
that the synthesized dataset has the potential to substitute a
real dataset, not only bounded to taking a role as a subsidiary
dataset.
In this paper, we propose a novel framework for FS-KWS
trained on only synthetic data. The proposed framework
is based on metric learning so as to detect target keywords
with few-shot enrollments. We assume that a good FS-KWS
model can extract phonetic information from any short ut-
terances and make clusters of any arbitrary phonetic chunks
from various voices. Our motivation stems from the question
of whether a labeled KWS dataset is indispensable in metric
learning if we can acquire a large bunch of utterances with the
same pronunciation by a different route. The notable point
is that the metric learning objective in FS-KWS does not re-
quire any explicit textual supervision for training. Instead, we
exploit pseudo-TTS model [15], which is trained on a large-
scale unlabeled speech corpus. The pseudo-TTS model takes
pseudo phoneme sequence extracted from wav2vec2.0 [16]
and reference speech as inputs, and returns utterances with
various speakers and prosody reflected. Using a phonetic rep-
resentation of wav2vec2.0, the pseudo-TTS system takes the
role of decomposing the utterances to the fine-grained factors
and reassembling to the suitable form for metric learning. We
notice that our proposed method is trained in a fully unsuper-
978-1-6654-7189-3/22/$31.00 © 2023 IEEE
arXiv:2210.02732v2 [eess.AS] 7 Oct 2022
vised manner as all of the components including wav2vec2.0
and pseudo-TTS are trained without any supervision. From
our experiments, we find that even without using any real and
labeled data, the proposed model demonstrates high perfor-
mance on KWS datasets [16, 17]. The experimental results
show the potential of utilizing unsupervised speech synthesis
for user-defined KWS.
2. BACKGROUNDS
2.1. Metric Learning based KWS
Most current KWS models are trained and evaluated under
classification objectives, treating all the possible non-target
keywords as a single class. Although there exist countless
non-target sounds, limited non-target samples are used dur-
ing training, which can inhibit the detection performance.
Furthermore, customizing target keywords in the classifica-
tion scenario is difficult since the models are not trained to
distinguish the diversity of non-target sounds and only re-
spond to the original target keywords. To overcome these
problems, several metric learning based KWS models have
recently arisen. The goal of metric learning for KWS is to
acquire a general representation for KWS and detect tar-
get keywords using distance metrics. For example, Huh et
al.[6] explore several metric learning objectives such as triplet
loss[18] and prototypical loss[19] for training KWS. In ad-
dition, Kim et al.[20] suggest a multiple dummy prototype
generator to handle open-set queries efficiently. Consider-
ing that KWS is closer to the detection task rather than the
classification task in the real-world scenario, metric learn-
ing methods are advantageous over classification approaches,
effectively tackling unknown category samples via distance
metrics.
2.2. TTS with pseudo phoneme
In [15], TTS with pseudo phoneme was firstly proposed for
the transfer learning framework for low-resource TTS. The
pseudo phoneme is a phonetic token that can be obtained
without any text labels. Instead, the pseudo phoneme is ex-
tracted from an unlabeled speech by k-means clustering of
wav2vec2.0 embeddings which contain rich phonetic infor-
mation. Kim et al. [15] pre-trained VITS [21] based TTS
model using the pseudo phoneme and fine-tuned the pre-
trained model on a small amount of transcribed corpus with
real phoneme. For convenience, we refer to the pre-trained
model in [15] to pseudo-TTS model. The pseudo-TTS model
takes pseudo phoneme sequence and a reference speech as
inputs and returns synthesized speech that contains phonetic
information of pseudo phoneme with the speaker and prosody
of reference speech. Using the pseudo-TTS model, with only
unlabeled speech corpus, we can get massive groups of the
same pronunciation spoken by various speakers with diverse
prosody.
3. PROPOSED METHOD
In this section, we describe the overall framework of the pro-
posed method. The proposed method uses the pseudo-TTS
model as a data source for training FS-KWS with metric
learning objective. The entire framework is depicted in Fig-
ure1.
3.1. Training
3.1.1. Training objective
Among the methods of metric learning, we adopt prototypical
networks [19] which operate in N-way K-shot classification,
where Nand Kdenote the number of classes and supports re-
spectively. At each iteration, the encoder takes N×(K+1) in-
put speech x1:N,1:K+1 and returns output embeddings. Here,
we fix the number of queries per class to 1 for simplicity, so
that xn,1:Krepresents the support set and xn,K+1 represents
the query for class n. The training objective is formulated as
(1)-(3).
cn=1
K
K
X
k=1
fφ(xn,k),(1)
p(y=n|x) = exp(dist(fφ(x), cn))
PN
n0=1 exp(dist(fφ(x), cn0)),(2)
L=1
N
N
X
n=1
log p(y=n|xn,K+1).(3)
In (1), cnis a prototype of class nand fφdenotes the
encoder parameterized by φ. In (2), dist(·,·)can be any dis-
tance metric for which we used euclidean distance. The ob-
jective Lrepresents the cross-entropy loss for the given few-
shot classification.
3.1.2. Data generation using pseudo-TTS
In the proposed method, the entire data is generated by the
pseudo-TTS model [15]. Although the pseudo-TTS model
was originally designed for the pre-training stage of transfer
learning, we exploit it to generate arbitrary speech under var-
ious conditions. The data generation is processed as follows.
First, we sample speech from an unlabeled speech corpus
and extract pseudo phonemes. Then, we randomly crop the
pseudo phoneme sequences to get the arbitrary pseudo key-
words with lengths sampled from (Lmin, Lmax). Both Lmin
and Lmax are hyperparameters of data generation. After, we
sample reference speech from a speech corpus for the speaker
and prosody variability. In consequence, we can generate un-
limited amounts of pronunciation chunks in an unsupervised
manner.
Although the multi-view samples of each pseudo keyword
can be obtained by the pseudo-TTS model, the domain mis-
match between synthesized and real audio can significantly
摘要:

FULLYUNSUPERVISEDTRAININGOFFEW-SHOTKEYWORDSPOTTINGDongjuneLee,MinchanKim,SungHwanMun,MinHyunHan,NamSooKimDepartmentofElectricalandComputerEngineeringandINMC,SeoulNationalUniversity,Seoul,SouthKoreaABSTRACTFortrainingafew-shotkeywordspotting(FS-KWS)model,alargelabeleddatasetcontainingmassivetargetk...

展开>> 收起<<
FULLY UNSUPERVISED TRAINING OF FEW-SHOT KEYWORD SPOTTING Dongjune Lee Minchan Kim Sung Hwan Mun Min Hyun Han Nam Soo Kim Department of Electrical and Computer Engineering and INMC.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:750.52KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注