
HEiMDaL: HIGHLY EFFICIENT METHOD FOR DETECTION AND LOCALIZATION OF
WAKE-WORDS
Arnav Kundu, Mohammad Samragh Razlighi, Minsik Cho, Priyanka Padmanabhan, Devang Naik
{a kundu, m samraghrazlighi, minsik, priyanka padmanabhan, naik.d}@apple.com
ABSTRACT
Streaming keyword spotting is a widely used solution for ac-
tivating voice assistants. Deep Neural Networks with Hidden
Markov Model (DNN-HMM) based methods have proven to
be efficient and widely adopted in this space, primarily be-
cause of the ability to detect and identify the start and end
of the wake-up word at low compute cost. However, such
hybrid systems suffer from loss metric mismatch when the
DNN and HMM are trained independently. Sequence dis-
criminative training cannot fully mitigate the loss-metric mis-
match due to the inherent Markovian style of the operation.
We propose an low footprint CNN model, called HEiMDaL,
to detect and localize keywords in streaming conditions. We
introduce an alignment-based classification loss to detect the
occurrence of the keyword along with an offset loss to predict
the start of the keyword. HEiMDaL shows 73% reduction in
detection metrics along with equivalent localization accuracy
and with the same memory footprint as existing DNN-HMM
style models for a given wake-word.
Index Terms—- Keyword Spotting, voice assistants,
wake-word detection, detection, localization, BC-ResNet
1. INTRODUCTION
Voice assistants allow users to control electronic devices via
vocal commands. In this setting, a device waits for the user
to say a wake-word, e.g., “hey Siri/ Alexa”, which indicates
the user’s intention to engage with a voice assistant. Then,
the device records the remainder of the user’s utterance and
transmits it to a (possibly remote) voice assistant. Since a
wake-word recognizer often runs continuously on a device
with small SRAM storage, the recognizer has to be parameter-
efficient (i.e., it should have high model accuracy with few
weights). Additionally, to respect users privacy, the recog-
nizer should be accurate: it should only start recording the
user’s voice when the user intends to interact with the voice
assistant.
Several contemporary wake-word detection systems uti-
lize a Deep Neural Network (DNN) together with a Hid-
den Markov Model (HMM) [1, 2, 3]. In this setting, the
DNN component is trained to identify word fragments (a.k.a.
phonemes) while the HMM component traverses the se-
quence of phonemes predicted by the DNN and detects the
wake-word. The combination of a DNN and an HMM can
detect wake-words and their exact occurring time. However,
such a hybrid system may suffer from loss metric mismatch:
the DNN component is trained to detect the phoneme se-
quence, not the keyword itself, hence, the trained model may
be sub-optimal. Sequence discriminative training is proposed
in [4] where the DNN-HMM model is optimized end to end
to minimize (maximize) the final HMM score for negative
(positive) samples respectively. However, such models are
difficult to optimize because of gradient loss during back
propagation through the HMM. In addition, training DNN-
HMM models takes substantially longer due to the sequence
dependent nature. More recent efforts train end-to-end CNNs
to detect the underlying wake-word without an HMM [5, 6].
These models can yield a good performance in wake-word de-
tection as they are directly optimized to detect the wake-word
but such CNNs suffer with two major limitations: a) higher
computational complexity than DNN-HMM based systems,
b) cannot accurately locate the exact occurrence time of the
keyword.
We introduce HEiMDaL, a wake-word detection system
that simultaneously inherits the benefits of DNN-HMMs and
end-to-end models yet with extremely low memory footprint.
In particular, we train an end-to-end model that: a) does not
utilize an HMM, b) is directly trained to detect the wake-
word, c) is capable of predicting the start and end-time of the
wake-word, and d) is more efficient and accurate than existing
systems. To this end, we make the following contributions:
• We formulate a discriminative setting for training an
end-to-end wake-word detection model. We train the
DNN to predict a binary label for a given segment of
audio (the receptive field of the network) and an offset
label to predict the start of the wake-word.
• We introduce a localization-enforced classification loss
along with a data mining algorithm. During training,
we minimize our customized loss over the samples
drawn by the mining algorithm. Our mining algorithm
balances the positive and negative samples.
• Compared to a DNN-HMM model [4], our model im-
proves the False Reject Rate by 73% at the same False
arXiv:2210.15425v1 [eess.AS] 26 Oct 2022