WAKEUPNET A MOBILE-TRANSFORMER BASED FRAMEWORK FOR END-TO-END STREAMING VOICE TRIGGER Zixing Zhang Thorin Farnsworth Senling Lin Salah Karout

2025-05-06 0 0 333.25KB 6 页 10玖币
侵权投诉
WAKEUPNET: A MOBILE-TRANSFORMER BASED FRAMEWORK
FOR END-TO-END STREAMING VOICE TRIGGER
Zixing Zhang, Thorin Farnsworth, Senling Lin, Salah Karout
Huawei Technologies Research & Development (UK) Ltd
zixingzhang@huawei.com
ABSTRACT
End-to-end models have gradually become the main technical stream
for voice trigger, aiming to achieve an utmost prediction accu-
racy but with a small footprint. In present paper, we propose an
end-to-end voice trigger framework, namely WakeupNet, which
is basically structured on a Transformer encoder. The purpose of
this framework is to explore the context-capturing capability of
Transformer, as sequential information is vital for wakeup-word
detection. However, the conventional Transformer encoder is too
large to fit our task. To address this issue, we introduce differ-
ent model compression approaches to shrink the vanilla one into a
tiny one, called mobile-Transformer. To evaluate the performance
of mobile-Transformer, we conduct extensive experiments on a
large public-available dataset HiMia. The obtained results indicate
that introduced mobile-Transformer significantly outperforms other
frequently used models for voice trigger in both clean and noisy
scenarios.
Index Termsvoice trigger, mobile-Transformer, focal loss,
seperable convolution
1. INTRODUCTION
Nowadays, voice assistants have become increasingly popular in our
daily life [1]. In the systems, voice trigger (aka keyword spotting or
wakeup-word detection) is considered as one of the frontier compo-
nents, taking the responsibility to trigger the voice assistants so as to
initialise the control or interaction process. Therefore, the prediction
accuracy of voice trigger has a strong impact on the user experience
of voice assistants. Besides, it is of significant importance to keep
voice trigger system hardware-efficient as well due to its always-on
characteristics. Thus, utmost reducing its storage and computational
cost to fit the memory and energy constraint is of necessary.
Over the past few years, Transformer encoder as well as its vari-
ants like Bert [2, 3, 4, 5], have been widely used in natural lan-
guage processing (NLP) [3, 4, 5]. The major advantage of Trans-
former is its efficiency in extracting context-dependent representa-
tions [6]. It can explicitly explore the context dependence over a
long sequence by a self-attention mechanism. Compared with Re-
current Neural Networks (RNNs), such as long short-term memory
(LSTM) or Gated Recurrent Unit (GRU) RNNs, Transformer avoids
the recurrent process which is considered to be unfriendly for par-
allel computation when using GPU. Thus, it largely facilitates the
model training and inference processes.
Encouraged by its great success in NLP, Transformer encoder
has recently attracted increasing attention in other domains, such
as computer vision [7] and speech processing [8, 9]. For example,
in [10], the authors conducted comprehensive studies on the compar-
ison among Transformer, LSTM, and Kaldi-based hybrid models for
automatic speech recognition (ASR), and found that the Transformer
achieve the best performance on most datasets.
However, the vanilla Transformer encoder was designed without
considering deploying them in an edge device. This issue largely
impedes its applications because these devices normally have strong
storage and energy consumption limitations. Recently, much effort
has been made toward compressing the model size. For example,
DistilBert [4] was introduced by distilling knowledge from a large
model to a light one. In the context of voice trigger, nevertheless,
these models are still far larger than the ones we need. To this end,
in this paper, we propose a compressed Transformer encoder, namely
mobile-Transformer, to enable the conventional model to fit the task
of voice trigger. Besides, we take an end-to-end framework that is
able to detect wakeup words streamingly.
2. RELATED WORK
In the literature, many approaches were introduced for voice trigger,
which can be grouped into filler-based [11] and end-to-end ones [12,
13]. The former approaches regard all background noise and non-
wakeup speech as fillers, and model both the wakeup words and
the fillers; whereas the later approaches model the offset of wakeup
words and the others.
Typical filler-based approaches seek the help from ASR sys-
tems, where hidden Markov models (HMMs) are used to represent
both the wakeup word (aka keyword) and the background audio [11].
However, its performance highly depends on its prediction accuracy
of phoneme predictions. Besides, the complexity of ASR systems
increases their deployment difficulty due to the high memory and
power requirements. To overcome these issues, neural network-
only based approaches were then proposed. They utilised advanced
deep learning models to predict the wakeup words framewisely and
straightforwardly by stacking multiple acoustic frames as inputs.
Then, a sliding window is applied to average the posteriors. Once the
smoothed value surpasses a pre-defined threshold, a wakeup word is
supposed to be detected. Typical work can be found in [14, 15, 16],
where convolutional neural network (CNN), LSTM-RNNs, convo-
lutional RNNs (CRNN) were applied respectively.
Nowadays, end-to-end approaches have gradually become the
mainstream technology for voice trigger, which straightforwardly
estimate the wakeup point of keywords [12]. Compared with the
filler-based approaches, the end-to-end structure becomes simpler.
Besides, it was investigated to be more effective as it directly op-
timises the detection score [12, 13]. Typical work can be found
in [12, 13] where only the offset of wakeup words is annotated as
positive. In this paper, we focus on the end-to-end detection system
and a novel tiny model is designed and investigated.
arXiv:2210.02904v1 [cs.SD] 6 Oct 2022
· · ·
x0x1· · · xT
dilated residual causal convolutional block
time-restricted self-attention block
linear decoder
ˆy0ˆy1· · · ˆyN
0000· · · 001110
/h/ /ei/ /ei/ sil /m/ /i/ /a/ /a/
M×
N×
inputs
mobile-Transformer
predictions
targets
L R
Fig. 1. WakeupNet framework of the end-to-end streaming voice
trigger based on a mobile-Transformer encoder.
3. TRANSFORMER-BASED END-TO-END
ARCHITECTURE
In this section, we briefly introduce the WakeupNet framework for
end-to-end streaming voice trigger first. We then describe the pro-
posed mobile-Transformer and employed focal loss.
3.1. Overview of WakeupNet Framework
The WakeupNet framework for voice trigger is illustrated in Fig. 1
by using mobile-Transformer encoder as a backbone. Given sequen-
tial acoustic features {xt, t = 0,...,T}and corresponding labels
{yi[0,1], i = 0,...,I}where Tand Iare corresponding frame
and label numbers respectively, the system aims to find a nonlinear
mapping function fbetween them.
Regarding to the annotations, we followed the same principle
in [12] where only the end of the wakeup words is annotated as pos-
itive (yt= 1) and rest of them as negative (yt= 0). The benefits
are at least twofold: i) it directly optimises the detection task and
avoids any intermediate components compared with the filler-based
systems [12]; ii) it ultimately avoids the advanced and delayed prob-
lems when triggering the voice [12]. Nonetheless, this annotation
way leads a high data imbalanced problem, where the negative labels
are far more than the positive labels. To deal with this, we repeated
Ltimes before, and Rtimes after the original positive labels, such
that the number of positive labels increases L+Rtimes to relieve
the data imbalanced problem.
In the inference stage, acoustic features are extracted from the
streaming audio signals, and then are segmented into sequential clips
{Sj, j = 0,...,J}with the same WakeupNet perception field of
Wframes and the step size of Pframes. Thus, generally speaking
I=J=bT/P c. After that, a smoothed window is applied to the
obtained sequential predictions {ˆyi[0,1], i = 0,...,I}. Once
the smoothed predictions ˜yiare higher than a pre-defined threshold
score s, the system is triggered.
3.2. Mobile-Transformer
The introduced mobile-Transformer is composed with Mstacked
VGG convolutional blocks [17], Nstacked time-restricted self-
attention blocks [6], and a linear decoder, as illustrated in Fig. 2.
For sequence encoding, the position information of its elements
is of importance. However, the attention mechanism in Transformer
is neither recurrent nor convoluted [6]. In NLP, one simple way is
adding a position encoding [6]. Different from this explicit way, in
speech processing domain an implicit way have shown to be efficient
by using convolutional operations [18, 19], which is supposed to au-
tomatically capture the contextual information when a deep structure
is taken [19]. Besides, for streaming inference, causal convolution is
considered such that only historical signals need to be collected for
inference. More details of the VGG blocks can be found in [6].
As to the self-attention blocks, it contains a stack of multiple
identical blocks. Each block contains an attention layer and two
feedforward layers. The attention layer first applies LayerNorm,
then projects the input to queries (Q), keys (K), and values (V).
The attention output is calculated by
Attention(Q, K, V ) = softmax(QKT
dk
)V, (1)
where dk, the dimension of keys, is the scaled factor. In doing this,
each obtained representation implicitly contains the semantic infor-
mation over the whole sequence. To jointly attend the information
from different representation subspaces at different positions, we ap-
ply the multihead strategy [6] as well by splitting queries, keys, and
values into several parts. After that, two feedforward layers with
ReLU activation functions follow to increase the non-linear learning
capability of the blocks. For the self-attention layers and feedfor-
ward layers, residual adding and layer normalisating are applied to
deal with the gradient vanishing problem when increasing the depth
of networks and the internal covariate shift problem, respectively.
As mentioned in Section I, low latency and computation is vi-
tal for voice trigger due to its always-on nature. Therefore, for
the Transformer encoder blocks, we took the suggestions in [19],
where truncated self-attention is utilised because of i) its capability
of streaming inference; i) its efficiency of computational complexity.
Compared with the original self-attention that depends on the entire
input sequence {xt, t = 0,...,T}, the truncated one only accesses
the sequence {xt, t =th,...,t,...,t +b}at time t, with h
look-ahead and blook-back.
The vanilla Transformer encoder is shown in [6], where the
weights mainly come from attention layers, feedforward layers, as
well as its stacked structure. In the following, we compress the
model in three ways.
3.2.1. Cross Layers Parameter Sharing
Encouraged by the success of ALBERT – a lite BERT structure [5], a
cross-layer parameter sharing is employed for mobile-Transformer,
where only the attention parameters are shared cross layers. We do
neither take the all-shared strategy or the feedforward-shared strat-
egy since they were empirically demonstrated to degrade the model
performance greatly [5]. The motivation of the cross layer parame-
ter sharing is that the semantic relationship among the sequence is
supposed to be similar although in different layers. By doing this,
the number of attention weights can be significantly reduced to 1/N
of its original size [5].
摘要:

WAKEUPNET:AMOBILE-TRANSFORMERBASEDFRAMEWORKFOREND-TO-ENDSTREAMINGVOICETRIGGERZixingZhang,ThorinFarnsworth,SenlingLin,SalahKaroutHuaweiTechnologiesResearch&Development(UK)Ltdzixingzhang@huawei.comABSTRACTEnd-to-endmodelshavegraduallybecomethemaintechnicalstreamforvoicetrigger,aimingtoachieveanutmostp...

展开>> 收起<<
WAKEUPNET A MOBILE-TRANSFORMER BASED FRAMEWORK FOR END-TO-END STREAMING VOICE TRIGGER Zixing Zhang Thorin Farnsworth Senling Lin Salah Karout.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:6 页 大小:333.25KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注