HEiMDaL HIGHLY EFFICIENT METHOD FOR DETECTION AND LOCALIZATION OF WAKE-WORDS Arnav Kundu Mohammad Samragh Razlighi Minsik Cho Priyanka Padmanabhan Devang Naik

2025-05-06 0 0 441.18KB 5 页 10玖币
侵权投诉
HEiMDaL: HIGHLY EFFICIENT METHOD FOR DETECTION AND LOCALIZATION OF
WAKE-WORDS
Arnav Kundu, Mohammad Samragh Razlighi, Minsik Cho, Priyanka Padmanabhan, Devang Naik
{a kundu, m samraghrazlighi, minsik, priyanka padmanabhan, naik.d}@apple.com
ABSTRACT
Streaming keyword spotting is a widely used solution for ac-
tivating voice assistants. Deep Neural Networks with Hidden
Markov Model (DNN-HMM) based methods have proven to
be efficient and widely adopted in this space, primarily be-
cause of the ability to detect and identify the start and end
of the wake-up word at low compute cost. However, such
hybrid systems suffer from loss metric mismatch when the
DNN and HMM are trained independently. Sequence dis-
criminative training cannot fully mitigate the loss-metric mis-
match due to the inherent Markovian style of the operation.
We propose an low footprint CNN model, called HEiMDaL,
to detect and localize keywords in streaming conditions. We
introduce an alignment-based classification loss to detect the
occurrence of the keyword along with an offset loss to predict
the start of the keyword. HEiMDaL shows 73% reduction in
detection metrics along with equivalent localization accuracy
and with the same memory footprint as existing DNN-HMM
style models for a given wake-word.
Index Terms- Keyword Spotting, voice assistants,
wake-word detection, detection, localization, BC-ResNet
1. INTRODUCTION
Voice assistants allow users to control electronic devices via
vocal commands. In this setting, a device waits for the user
to say a wake-word, e.g., “hey Siri/ Alexa”, which indicates
the user’s intention to engage with a voice assistant. Then,
the device records the remainder of the user’s utterance and
transmits it to a (possibly remote) voice assistant. Since a
wake-word recognizer often runs continuously on a device
with small SRAM storage, the recognizer has to be parameter-
efficient (i.e., it should have high model accuracy with few
weights). Additionally, to respect users privacy, the recog-
nizer should be accurate: it should only start recording the
user’s voice when the user intends to interact with the voice
assistant.
Several contemporary wake-word detection systems uti-
lize a Deep Neural Network (DNN) together with a Hid-
den Markov Model (HMM) [1, 2, 3]. In this setting, the
DNN component is trained to identify word fragments (a.k.a.
phonemes) while the HMM component traverses the se-
quence of phonemes predicted by the DNN and detects the
wake-word. The combination of a DNN and an HMM can
detect wake-words and their exact occurring time. However,
such a hybrid system may suffer from loss metric mismatch:
the DNN component is trained to detect the phoneme se-
quence, not the keyword itself, hence, the trained model may
be sub-optimal. Sequence discriminative training is proposed
in [4] where the DNN-HMM model is optimized end to end
to minimize (maximize) the final HMM score for negative
(positive) samples respectively. However, such models are
difficult to optimize because of gradient loss during back
propagation through the HMM. In addition, training DNN-
HMM models takes substantially longer due to the sequence
dependent nature. More recent efforts train end-to-end CNNs
to detect the underlying wake-word without an HMM [5, 6].
These models can yield a good performance in wake-word de-
tection as they are directly optimized to detect the wake-word
but such CNNs suffer with two major limitations: a) higher
computational complexity than DNN-HMM based systems,
b) cannot accurately locate the exact occurrence time of the
keyword.
We introduce HEiMDaL, a wake-word detection system
that simultaneously inherits the benefits of DNN-HMMs and
end-to-end models yet with extremely low memory footprint.
In particular, we train an end-to-end model that: a) does not
utilize an HMM, b) is directly trained to detect the wake-
word, c) is capable of predicting the start and end-time of the
wake-word, and d) is more efficient and accurate than existing
systems. To this end, we make the following contributions:
We formulate a discriminative setting for training an
end-to-end wake-word detection model. We train the
DNN to predict a binary label for a given segment of
audio (the receptive field of the network) and an offset
label to predict the start of the wake-word.
We introduce a localization-enforced classification loss
along with a data mining algorithm. During training,
we minimize our customized loss over the samples
drawn by the mining algorithm. Our mining algorithm
balances the positive and negative samples.
Compared to a DNN-HMM model [4], our model im-
proves the False Reject Rate by 73% at the same False
arXiv:2210.15425v1 [eess.AS] 26 Oct 2022
摘要:

HEiMDaL:HIGHLYEFFICIENTMETHODFORDETECTIONANDLOCALIZATIONOFWAKE-WORDSArnavKundu,MohammadSamraghRazlighi,MinsikCho,PriyankaPadmanabhan,DevangNaikfakundu,msamraghrazlighi,minsik,priyankapadmanabhan,naik.dg@apple.comABSTRACTStreamingkeywordspottingisawidelyusedsolutionforac-tivatingvoiceassistants.DeepN...

展开>> 收起<<
HEiMDaL HIGHLY EFFICIENT METHOD FOR DETECTION AND LOCALIZATION OF WAKE-WORDS Arnav Kundu Mohammad Samragh Razlighi Minsik Cho Priyanka Padmanabhan Devang Naik.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:441.18KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注