HEiMDaL HIGHLY EFFICIENT METHOD FOR DETECTION AND LOCALIZATION OF WAKE-WORDS Arnav Kundu Mohammad Samragh Razlighi Minsik Cho Priyanka Padmanabhan Devang Naik

2025-05-06 0 0 441.18KB 5 页 10玖币

侵权投诉

HEiMDaL: HIGHLY EFFICIENT METHOD FOR DETECTION AND LOCALIZATION OF

WAKE-WORDS

Arnav Kundu, Mohammad Samragh Razlighi, Minsik Cho, Priyanka Padmanabhan, Devang Naik

{a kundu, m samraghrazlighi, minsik, priyanka padmanabhan, naik.d}@apple.com

ABSTRACT

Streaming keyword spotting is a widely used solution for ac-

tivating voice assistants. Deep Neural Networks with Hidden

Markov Model (DNN-HMM) based methods have proven to

be efﬁcient and widely adopted in this space, primarily be-

cause of the ability to detect and identify the start and end

of the wake-up word at low compute cost. However, such

hybrid systems suffer from loss metric mismatch when the

DNN and HMM are trained independently. Sequence dis-

criminative training cannot fully mitigate the loss-metric mis-

match due to the inherent Markovian style of the operation.

We propose an low footprint CNN model, called HEiMDaL,

to detect and localize keywords in streaming conditions. We

introduce an alignment-based classiﬁcation loss to detect the

occurrence of the keyword along with an offset loss to predict

the start of the keyword. HEiMDaL shows 73% reduction in

detection metrics along with equivalent localization accuracy

and with the same memory footprint as existing DNN-HMM

style models for a given wake-word.

Index Terms—- Keyword Spotting, voice assistants,

wake-word detection, detection, localization, BC-ResNet

1. INTRODUCTION

Voice assistants allow users to control electronic devices via

vocal commands. In this setting, a device waits for the user

to say a wake-word, e.g., “hey Siri/ Alexa”, which indicates

the user’s intention to engage with a voice assistant. Then,

the device records the remainder of the user’s utterance and

transmits it to a (possibly remote) voice assistant. Since a

wake-word recognizer often runs continuously on a device

with small SRAM storage, the recognizer has to be parameter-

efﬁcient (i.e., it should have high model accuracy with few

weights). Additionally, to respect users privacy, the recog-

nizer should be accurate: it should only start recording the

user’s voice when the user intends to interact with the voice

assistant.

Several contemporary wake-word detection systems uti-

lize a Deep Neural Network (DNN) together with a Hid-

den Markov Model (HMM) [1, 2, 3]. In this setting, the

DNN component is trained to identify word fragments (a.k.a.

phonemes) while the HMM component traverses the se-

quence of phonemes predicted by the DNN and detects the

wake-word. The combination of a DNN and an HMM can

detect wake-words and their exact occurring time. However,

such a hybrid system may suffer from loss metric mismatch:

the DNN component is trained to detect the phoneme se-

quence, not the keyword itself, hence, the trained model may

be sub-optimal. Sequence discriminative training is proposed

in [4] where the DNN-HMM model is optimized end to end

to minimize (maximize) the ﬁnal HMM score for negative

(positive) samples respectively. However, such models are

difﬁcult to optimize because of gradient loss during back

propagation through the HMM. In addition, training DNN-

HMM models takes substantially longer due to the sequence

dependent nature. More recent efforts train end-to-end CNNs

to detect the underlying wake-word without an HMM [5, 6].

These models can yield a good performance in wake-word de-

tection as they are directly optimized to detect the wake-word

but such CNNs suffer with two major limitations: a) higher

computational complexity than DNN-HMM based systems,

b) cannot accurately locate the exact occurrence time of the

keyword.

We introduce HEiMDaL, a wake-word detection system

that simultaneously inherits the beneﬁts of DNN-HMMs and

end-to-end models yet with extremely low memory footprint.

In particular, we train an end-to-end model that: a) does not

utilize an HMM, b) is directly trained to detect the wake-

word, c) is capable of predicting the start and end-time of the

wake-word, and d) is more efﬁcient and accurate than existing

systems. To this end, we make the following contributions:

• We formulate a discriminative setting for training an

end-to-end wake-word detection model. We train the

DNN to predict a binary label for a given segment of

audio (the receptive ﬁeld of the network) and an offset

label to predict the start of the wake-word.

• We introduce a localization-enforced classiﬁcation loss

along with a data mining algorithm. During training,

we minimize our customized loss over the samples

drawn by the mining algorithm. Our mining algorithm

balances the positive and negative samples.

• Compared to a DNN-HMM model [4], our model im-

proves the False Reject Rate by 73% at the same False

arXiv:2210.15425v1 [eess.AS] 26 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HEiMDaL:HIGHLYEFFICIENTMETHODFORDETECTIONANDLOCALIZATIONOFWAKE-WORDSArnavKundu,MohammadSamraghRazlighi,MinsikCho,PriyankaPadmanabhan,DevangNaikfakundu,msamraghrazlighi,minsik,priyankapadmanabhan,naik.dg@apple.comABSTRACTStreamingkeywordspottingisawidelyusedsolutionforac-tivatingvoiceassistants.DeepN...

展开>> 收起<<

HEiMDaL HIGHLY EFFICIENT METHOD FOR DETECTION AND LOCALIZATION OF WAKE-WORDS Arnav Kundu Mohammad Samragh Razlighi Minsik Cho Priyanka Padmanabhan Devang Naik.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

HEiMDaL HIGHLY EFFICIENT METHOD FOR DETECTION AND LOCALIZATION OF WAKE-WORDS Arnav Kundu Mohammad Samragh Razlighi Minsik Cho Priyanka Padmanabhan Devang Naik

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: