FULLY UNSUPERVISED TRAINING OF FEW-SHOT KEYWORD SPOTTING Dongjune Lee Minchan Kim Sung Hwan Mun Min Hyun Han Nam Soo Kim Department of Electrical and Computer Engineering and INMC

2025-05-06 1 0 750.52KB 7 页 10玖币

侵权投诉

FULLY UNSUPERVISED TRAINING OF FEW-SHOT KEYWORD SPOTTING

Dongjune Lee∗, Minchan Kim∗, Sung Hwan Mun, Min Hyun Han, Nam Soo Kim

Department of Electrical and Computer Engineering and INMC,

Seoul National University, Seoul, South Korea

ABSTRACT

For training a few-shot keyword spotting (FS-KWS)

model, a large labeled dataset containing massive target key-

words has known to be essential to generalize to arbitrary

target keywords with only a few enrollment samples. To

alleviate the expensive data collection with labeling, in this

paper, we propose a novel FS-KWS system trained only on

synthetic data. The proposed system is based on metric learn-

ing enabling target keywords to be detected using distance

metrics. Exploiting the speech synthesis model that generates

speech with pseudo phonemes instead of texts, we easily ob-

tain a large collection of multi-view samples with the same

semantics. These samples are sufﬁcient for training, consid-

ering metric learning does not intrinsically necessitate labeled

data. All of the components in our framework do not require

any supervision, making our method unsupervised. Experi-

mental results on real datasets show our proposed method is

competitive even without any labeled and real datasets.

Index Terms—user-deﬁned keyword spotting, few-shot

keyword spotting, metric learning, speech synthesis

1. INTRODUCTION

Keyword spotting (KWS) is to identify a target keyword in

the continuous audio streams, which is broadly deployed as

a front door to voice assistants in several edge devices such

as smartphones and AI speakers. In general, KWS systems

predetermine target keywords and are directly optimized for

selected keywords. Although existing predeﬁned KWS mod-

els show high detection performance [1, 2, 3], the necessity

of a large dataset containing target keywords and inﬂexibil-

ity of changing target keywords hinder KWS models from

expanding to various applications. When it comes to user-

deﬁned KWS, users can customize the target keywords with

only a few enrollment samples [4, 5, 6, 7] or in the form of

string [8, 9]. Few-shot KWS (FS-KWS) especially has shown

its feasibility through meta learning [4], transfer learning [5],

and metric learning [6, 7], operating on the few-shot detection

scenario. These approaches typically require learning from a

large corpus with lots of different keywords to secure gen-

eralization on unseen keywords with few samples. However,

*These authors contributed equally to this work.

user-deﬁned KWS despite its potential in diverse usability has

been underexplored due to the deﬁciency of a large high qual-

ity public corpus.

Recently, with dramatic advances in deep generative mod-

els, there have been several approaches that leverage genera-

tive models as a data source to compensate for insufﬁcient or

unavailable labeled data. For example, data augmentation us-

ing synthesized data has been explored in KWS and automatic

speech recognition (ASR)[10, 11, 12]. Text-to-speech (TTS)

models are utilized to supplement the less frequent utterances

such as named entities which are difﬁcult to collect in the real

world. In the vision domain, Besnier et al. [13] effectively

trains a classiﬁer with several learning strategies only using

synthetic image data generated by a conditional GAN. More-

over, Jahanian et al. [14] used an unconditional GAN as a data

source for representation learning. In [14], searching in latent

space of GAN offers multi-view data sharing the semantics

which is necessary for training contrastive objective of rep-

resentation learning. The aforementioned approaches show

that the synthesized dataset has the potential to substitute a

real dataset, not only bounded to taking a role as a subsidiary

dataset.

In this paper, we propose a novel framework for FS-KWS

trained on only synthetic data. The proposed framework

is based on metric learning so as to detect target keywords

with few-shot enrollments. We assume that a good FS-KWS

model can extract phonetic information from any short ut-

terances and make clusters of any arbitrary phonetic chunks

from various voices. Our motivation stems from the question

of whether a labeled KWS dataset is indispensable in metric

learning if we can acquire a large bunch of utterances with the

same pronunciation by a different route. The notable point

is that the metric learning objective in FS-KWS does not re-

quire any explicit textual supervision for training. Instead, we

exploit pseudo-TTS model [15], which is trained on a large-

scale unlabeled speech corpus. The pseudo-TTS model takes

pseudo phoneme sequence extracted from wav2vec2.0 [16]

and reference speech as inputs, and returns utterances with

various speakers and prosody reﬂected. Using a phonetic rep-

resentation of wav2vec2.0, the pseudo-TTS system takes the

role of decomposing the utterances to the ﬁne-grained factors

and reassembling to the suitable form for metric learning. We

notice that our proposed method is trained in a fully unsuper-

arXiv:2210.02732v2 [eess.AS] 7 Oct 2022

vised manner as all of the components including wav2vec2.0

and pseudo-TTS are trained without any supervision. From

our experiments, we ﬁnd that even without using any real and

labeled data, the proposed model demonstrates high perfor-

mance on KWS datasets [16, 17]. The experimental results

show the potential of utilizing unsupervised speech synthesis

for user-deﬁned KWS.

2. BACKGROUNDS

2.1. Metric Learning based KWS

Most current KWS models are trained and evaluated under

classiﬁcation objectives, treating all the possible non-target

keywords as a single class. Although there exist countless

non-target sounds, limited non-target samples are used dur-

ing training, which can inhibit the detection performance.

Furthermore, customizing target keywords in the classiﬁca-

tion scenario is difﬁcult since the models are not trained to

distinguish the diversity of non-target sounds and only re-

spond to the original target keywords. To overcome these

problems, several metric learning based KWS models have

recently arisen. The goal of metric learning for KWS is to

acquire a general representation for KWS and detect tar-

get keywords using distance metrics. For example, Huh et

al.[6] explore several metric learning objectives such as triplet

loss[18] and prototypical loss[19] for training KWS. In ad-

dition, Kim et al.[20] suggest a multiple dummy prototype

generator to handle open-set queries efﬁciently. Consider-

ing that KWS is closer to the detection task rather than the

classiﬁcation task in the real-world scenario, metric learn-

ing methods are advantageous over classiﬁcation approaches,

effectively tackling unknown category samples via distance

metrics.

2.2. TTS with pseudo phoneme

In [15], TTS with pseudo phoneme was ﬁrstly proposed for

the transfer learning framework for low-resource TTS. The

pseudo phoneme is a phonetic token that can be obtained

without any text labels. Instead, the pseudo phoneme is ex-

tracted from an unlabeled speech by k-means clustering of

wav2vec2.0 embeddings which contain rich phonetic infor-

mation. Kim et al. [15] pre-trained VITS [21] based TTS

model using the pseudo phoneme and ﬁne-tuned the pre-

trained model on a small amount of transcribed corpus with

real phoneme. For convenience, we refer to the pre-trained

model in [15] to pseudo-TTS model. The pseudo-TTS model

takes pseudo phoneme sequence and a reference speech as

inputs and returns synthesized speech that contains phonetic

information of pseudo phoneme with the speaker and prosody

of reference speech. Using the pseudo-TTS model, with only

unlabeled speech corpus, we can get massive groups of the

same pronunciation spoken by various speakers with diverse

prosody.

3. PROPOSED METHOD

In this section, we describe the overall framework of the pro-

posed method. The proposed method uses the pseudo-TTS

model as a data source for training FS-KWS with metric

learning objective. The entire framework is depicted in Fig-

ure1.

3.1. Training

3.1.1. Training objective

Among the methods of metric learning, we adopt prototypical

networks [19] which operate in N-way K-shot classiﬁcation,

where Nand Kdenote the number of classes and supports re-

spectively. At each iteration, the encoder takes N×(K+1) in-

put speech x1:N,1:K+1 and returns output embeddings. Here,

we ﬁx the number of queries per class to 1 for simplicity, so

that xn,1:Krepresents the support set and xn,K+1 represents

the query for class n. The training objective is formulated as

(1)-(3).

cn=1

k=1

fφ(xn,k),(1)

p(y=n|x) = exp(−dist(fφ(x), cn))

n0=1 exp(−dist(fφ(x), cn0)),(2)

L=−1

n=1

log p(y=n|xn,K+1).(3)

In (1), cnis a prototype of class nand fφdenotes the

encoder parameterized by φ. In (2), dist(·,·)can be any dis-

tance metric for which we used euclidean distance. The ob-

jective Lrepresents the cross-entropy loss for the given few-

shot classiﬁcation.

3.1.2. Data generation using pseudo-TTS

In the proposed method, the entire data is generated by the

pseudo-TTS model [15]. Although the pseudo-TTS model

was originally designed for the pre-training stage of transfer

learning, we exploit it to generate arbitrary speech under var-

ious conditions. The data generation is processed as follows.

First, we sample speech from an unlabeled speech corpus

and extract pseudo phonemes. Then, we randomly crop the

pseudo phoneme sequences to get the arbitrary pseudo key-

words with lengths sampled from (Lmin, Lmax). Both Lmin

and Lmax are hyperparameters of data generation. After, we

sample reference speech from a speech corpus for the speaker

and prosody variability. In consequence, we can generate un-

limited amounts of pronunciation chunks in an unsupervised

manner.

Although the multi-view samples of each pseudo keyword

can be obtained by the pseudo-TTS model, the domain mis-

match between synthesized and real audio can signiﬁcantly

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FULLYUNSUPERVISEDTRAININGOFFEW-SHOTKEYWORDSPOTTINGDongjuneLee,MinchanKim,SungHwanMun,MinHyunHan,NamSooKimDepartmentofElectricalandComputerEngineeringandINMC,SeoulNationalUniversity,Seoul,SouthKoreaABSTRACTFortrainingafew-shotkeywordspotting(FS-KWS)model,alargelabeleddatasetcontainingmassivetargetk...

展开>> 收起<<

FULLY UNSUPERVISED TRAINING OF FEW-SHOT KEYWORD SPOTTING Dongjune Lee Minchan Kim Sung Hwan Mun Min Hyun Han Nam Soo Kim Department of Electrical and Computer Engineering and INMC.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FULLY UNSUPERVISED TRAINING OF FEW-SHOT KEYWORD SPOTTING Dongjune Lee Minchan Kim Sung Hwan Mun Min Hyun Han Nam Soo Kim Department of Electrical and Computer Engineering and INMC

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: