WAKEUPNET: A MOBILE-TRANSFORMER BASED FRAMEWORK
FOR END-TO-END STREAMING VOICE TRIGGER
Zixing Zhang, Thorin Farnsworth, Senling Lin, Salah Karout
Huawei Technologies Research & Development (UK) Ltd
zixingzhang@huawei.com
ABSTRACT
End-to-end models have gradually become the main technical stream
for voice trigger, aiming to achieve an utmost prediction accu-
racy but with a small footprint. In present paper, we propose an
end-to-end voice trigger framework, namely WakeupNet, which
is basically structured on a Transformer encoder. The purpose of
this framework is to explore the context-capturing capability of
Transformer, as sequential information is vital for wakeup-word
detection. However, the conventional Transformer encoder is too
large to fit our task. To address this issue, we introduce differ-
ent model compression approaches to shrink the vanilla one into a
tiny one, called mobile-Transformer. To evaluate the performance
of mobile-Transformer, we conduct extensive experiments on a
large public-available dataset HiMia. The obtained results indicate
that introduced mobile-Transformer significantly outperforms other
frequently used models for voice trigger in both clean and noisy
scenarios.
Index Terms—voice trigger, mobile-Transformer, focal loss,
seperable convolution
1. INTRODUCTION
Nowadays, voice assistants have become increasingly popular in our
daily life [1]. In the systems, voice trigger (aka keyword spotting or
wakeup-word detection) is considered as one of the frontier compo-
nents, taking the responsibility to trigger the voice assistants so as to
initialise the control or interaction process. Therefore, the prediction
accuracy of voice trigger has a strong impact on the user experience
of voice assistants. Besides, it is of significant importance to keep
voice trigger system hardware-efficient as well due to its always-on
characteristics. Thus, utmost reducing its storage and computational
cost to fit the memory and energy constraint is of necessary.
Over the past few years, Transformer encoder as well as its vari-
ants like Bert [2, 3, 4, 5], have been widely used in natural lan-
guage processing (NLP) [3, 4, 5]. The major advantage of Trans-
former is its efficiency in extracting context-dependent representa-
tions [6]. It can explicitly explore the context dependence over a
long sequence by a self-attention mechanism. Compared with Re-
current Neural Networks (RNNs), such as long short-term memory
(LSTM) or Gated Recurrent Unit (GRU) RNNs, Transformer avoids
the recurrent process which is considered to be unfriendly for par-
allel computation when using GPU. Thus, it largely facilitates the
model training and inference processes.
Encouraged by its great success in NLP, Transformer encoder
has recently attracted increasing attention in other domains, such
as computer vision [7] and speech processing [8, 9]. For example,
in [10], the authors conducted comprehensive studies on the compar-
ison among Transformer, LSTM, and Kaldi-based hybrid models for
automatic speech recognition (ASR), and found that the Transformer
achieve the best performance on most datasets.
However, the vanilla Transformer encoder was designed without
considering deploying them in an edge device. This issue largely
impedes its applications because these devices normally have strong
storage and energy consumption limitations. Recently, much effort
has been made toward compressing the model size. For example,
DistilBert [4] was introduced by distilling knowledge from a large
model to a light one. In the context of voice trigger, nevertheless,
these models are still far larger than the ones we need. To this end,
in this paper, we propose a compressed Transformer encoder, namely
mobile-Transformer, to enable the conventional model to fit the task
of voice trigger. Besides, we take an end-to-end framework that is
able to detect wakeup words streamingly.
2. RELATED WORK
In the literature, many approaches were introduced for voice trigger,
which can be grouped into filler-based [11] and end-to-end ones [12,
13]. The former approaches regard all background noise and non-
wakeup speech as fillers, and model both the wakeup words and
the fillers; whereas the later approaches model the offset of wakeup
words and the others.
Typical filler-based approaches seek the help from ASR sys-
tems, where hidden Markov models (HMMs) are used to represent
both the wakeup word (aka keyword) and the background audio [11].
However, its performance highly depends on its prediction accuracy
of phoneme predictions. Besides, the complexity of ASR systems
increases their deployment difficulty due to the high memory and
power requirements. To overcome these issues, neural network-
only based approaches were then proposed. They utilised advanced
deep learning models to predict the wakeup words framewisely and
straightforwardly by stacking multiple acoustic frames as inputs.
Then, a sliding window is applied to average the posteriors. Once the
smoothed value surpasses a pre-defined threshold, a wakeup word is
supposed to be detected. Typical work can be found in [14, 15, 16],
where convolutional neural network (CNN), LSTM-RNNs, convo-
lutional RNNs (CRNN) were applied respectively.
Nowadays, end-to-end approaches have gradually become the
mainstream technology for voice trigger, which straightforwardly
estimate the wakeup point of keywords [12]. Compared with the
filler-based approaches, the end-to-end structure becomes simpler.
Besides, it was investigated to be more effective as it directly op-
timises the detection score [12, 13]. Typical work can be found
in [12, 13] where only the offset of wakeup words is annotated as
positive. In this paper, we focus on the end-to-end detection system
and a novel tiny model is designed and investigated.
arXiv:2210.02904v1 [cs.SD] 6 Oct 2022