WAKEUPNET A MOBILE-TRANSFORMER BASED FRAMEWORK FOR END-TO-END STREAMING VOICE TRIGGER Zixing Zhang Thorin Farnsworth Senling Lin Salah Karout

2025-05-06 1 0 333.25KB 6 页 10玖币

侵权投诉

WAKEUPNET: A MOBILE-TRANSFORMER BASED FRAMEWORK

FOR END-TO-END STREAMING VOICE TRIGGER

Zixing Zhang, Thorin Farnsworth, Senling Lin, Salah Karout

Huawei Technologies Research & Development (UK) Ltd

zixingzhang@huawei.com

ABSTRACT

End-to-end models have gradually become the main technical stream

for voice trigger, aiming to achieve an utmost prediction accu-

racy but with a small footprint. In present paper, we propose an

end-to-end voice trigger framework, namely WakeupNet, which

is basically structured on a Transformer encoder. The purpose of

this framework is to explore the context-capturing capability of

Transformer, as sequential information is vital for wakeup-word

detection. However, the conventional Transformer encoder is too

large to ﬁt our task. To address this issue, we introduce differ-

ent model compression approaches to shrink the vanilla one into a

tiny one, called mobile-Transformer. To evaluate the performance

of mobile-Transformer, we conduct extensive experiments on a

large public-available dataset HiMia. The obtained results indicate

that introduced mobile-Transformer signiﬁcantly outperforms other

frequently used models for voice trigger in both clean and noisy

scenarios.

Index Terms—voice trigger, mobile-Transformer, focal loss,

seperable convolution

1. INTRODUCTION

Nowadays, voice assistants have become increasingly popular in our

daily life [1]. In the systems, voice trigger (aka keyword spotting or

wakeup-word detection) is considered as one of the frontier compo-

nents, taking the responsibility to trigger the voice assistants so as to

initialise the control or interaction process. Therefore, the prediction

accuracy of voice trigger has a strong impact on the user experience

of voice assistants. Besides, it is of signiﬁcant importance to keep

voice trigger system hardware-efﬁcient as well due to its always-on

characteristics. Thus, utmost reducing its storage and computational

cost to ﬁt the memory and energy constraint is of necessary.

Over the past few years, Transformer encoder as well as its vari-

ants like Bert [2, 3, 4, 5], have been widely used in natural lan-

guage processing (NLP) [3, 4, 5]. The major advantage of Trans-

former is its efﬁciency in extracting context-dependent representa-

tions [6]. It can explicitly explore the context dependence over a

long sequence by a self-attention mechanism. Compared with Re-

current Neural Networks (RNNs), such as long short-term memory

(LSTM) or Gated Recurrent Unit (GRU) RNNs, Transformer avoids

the recurrent process which is considered to be unfriendly for par-

allel computation when using GPU. Thus, it largely facilitates the

model training and inference processes.

Encouraged by its great success in NLP, Transformer encoder

has recently attracted increasing attention in other domains, such

as computer vision [7] and speech processing [8, 9]. For example,

in [10], the authors conducted comprehensive studies on the compar-

ison among Transformer, LSTM, and Kaldi-based hybrid models for

automatic speech recognition (ASR), and found that the Transformer

achieve the best performance on most datasets.

However, the vanilla Transformer encoder was designed without

considering deploying them in an edge device. This issue largely

impedes its applications because these devices normally have strong

storage and energy consumption limitations. Recently, much effort

has been made toward compressing the model size. For example,

DistilBert [4] was introduced by distilling knowledge from a large

model to a light one. In the context of voice trigger, nevertheless,

these models are still far larger than the ones we need. To this end,

in this paper, we propose a compressed Transformer encoder, namely

mobile-Transformer, to enable the conventional model to ﬁt the task

of voice trigger. Besides, we take an end-to-end framework that is

able to detect wakeup words streamingly.

2. RELATED WORK

In the literature, many approaches were introduced for voice trigger,

which can be grouped into ﬁller-based [11] and end-to-end ones [12,

13]. The former approaches regard all background noise and non-

wakeup speech as ﬁllers, and model both the wakeup words and

the ﬁllers; whereas the later approaches model the offset of wakeup

words and the others.

Typical ﬁller-based approaches seek the help from ASR sys-

tems, where hidden Markov models (HMMs) are used to represent

both the wakeup word (aka keyword) and the background audio [11].

However, its performance highly depends on its prediction accuracy

of phoneme predictions. Besides, the complexity of ASR systems

increases their deployment difﬁculty due to the high memory and

power requirements. To overcome these issues, neural network-

only based approaches were then proposed. They utilised advanced

deep learning models to predict the wakeup words framewisely and

straightforwardly by stacking multiple acoustic frames as inputs.

Then, a sliding window is applied to average the posteriors. Once the

smoothed value surpasses a pre-deﬁned threshold, a wakeup word is

supposed to be detected. Typical work can be found in [14, 15, 16],

where convolutional neural network (CNN), LSTM-RNNs, convo-

lutional RNNs (CRNN) were applied respectively.

Nowadays, end-to-end approaches have gradually become the

mainstream technology for voice trigger, which straightforwardly

estimate the wakeup point of keywords [12]. Compared with the

ﬁller-based approaches, the end-to-end structure becomes simpler.

Besides, it was investigated to be more effective as it directly op-

timises the detection score [12, 13]. Typical work can be found

in [12, 13] where only the offset of wakeup words is annotated as

positive. In this paper, we focus on the end-to-end detection system

and a novel tiny model is designed and investigated.

arXiv:2210.02904v1 [cs.SD] 6 Oct 2022

· · ·

x0x1· · · xT

dilated residual causal convolutional block

time-restricted self-attention block

linear decoder

ˆy0ˆy1· · · ˆyN

0000· · · 001110

/h/ /ei/ /ei/ sil /m/ /i/ /a/ /a/

M×

N×

inputs

mobile-Transformer

predictions

targets

L R

Fig. 1. WakeupNet framework of the end-to-end streaming voice

trigger based on a mobile-Transformer encoder.

3. TRANSFORMER-BASED END-TO-END

ARCHITECTURE

In this section, we brieﬂy introduce the WakeupNet framework for

end-to-end streaming voice trigger ﬁrst. We then describe the pro-

posed mobile-Transformer and employed focal loss.

3.1. Overview of WakeupNet Framework

The WakeupNet framework for voice trigger is illustrated in Fig. 1

by using mobile-Transformer encoder as a backbone. Given sequen-

tial acoustic features {xt, t = 0,...,T}and corresponding labels

{yi∈[0,1], i = 0,...,I}where Tand Iare corresponding frame

and label numbers respectively, the system aims to ﬁnd a nonlinear

mapping function fbetween them.

Regarding to the annotations, we followed the same principle

in [12] where only the end of the wakeup words is annotated as pos-

itive (yt= 1) and rest of them as negative (yt= 0). The beneﬁts

are at least twofold: i) it directly optimises the detection task and

avoids any intermediate components compared with the ﬁller-based

systems [12]; ii) it ultimately avoids the advanced and delayed prob-

lems when triggering the voice [12]. Nonetheless, this annotation

way leads a high data imbalanced problem, where the negative labels

are far more than the positive labels. To deal with this, we repeated

Ltimes before, and Rtimes after the original positive labels, such

that the number of positive labels increases L+Rtimes to relieve

the data imbalanced problem.

In the inference stage, acoustic features are extracted from the

streaming audio signals, and then are segmented into sequential clips

{Sj, j = 0,...,J}with the same WakeupNet perception ﬁeld of

Wframes and the step size of Pframes. Thus, generally speaking

I=J=bT/P c. After that, a smoothed window is applied to the

obtained sequential predictions {ˆyi∈[0,1], i = 0,...,I}. Once

the smoothed predictions ˜yiare higher than a pre-deﬁned threshold

score s, the system is triggered.

3.2. Mobile-Transformer

The introduced mobile-Transformer is composed with Mstacked

VGG convolutional blocks [17], Nstacked time-restricted self-

attention blocks [6], and a linear decoder, as illustrated in Fig. 2.

For sequence encoding, the position information of its elements

is of importance. However, the attention mechanism in Transformer

is neither recurrent nor convoluted [6]. In NLP, one simple way is

adding a position encoding [6]. Different from this explicit way, in

speech processing domain an implicit way have shown to be efﬁcient

by using convolutional operations [18, 19], which is supposed to au-

tomatically capture the contextual information when a deep structure

is taken [19]. Besides, for streaming inference, causal convolution is

considered such that only historical signals need to be collected for

inference. More details of the VGG blocks can be found in [6].

As to the self-attention blocks, it contains a stack of multiple

identical blocks. Each block contains an attention layer and two

feedforward layers. The attention layer ﬁrst applies LayerNorm,

then projects the input to queries (Q), keys (K), and values (V).

The attention output is calculated by

Attention(Q, K, V ) = softmax(QKT

√dk

)V, (1)

where dk, the dimension of keys, is the scaled factor. In doing this,

each obtained representation implicitly contains the semantic infor-

mation over the whole sequence. To jointly attend the information

from different representation subspaces at different positions, we ap-

ply the multihead strategy [6] as well by splitting queries, keys, and

values into several parts. After that, two feedforward layers with

ReLU activation functions follow to increase the non-linear learning

capability of the blocks. For the self-attention layers and feedfor-

ward layers, residual adding and layer normalisating are applied to

deal with the gradient vanishing problem when increasing the depth

of networks and the internal covariate shift problem, respectively.

As mentioned in Section I, low latency and computation is vi-

tal for voice trigger due to its always-on nature. Therefore, for

the Transformer encoder blocks, we took the suggestions in [19],

where truncated self-attention is utilised because of i) its capability

of streaming inference; i) its efﬁciency of computational complexity.

Compared with the original self-attention that depends on the entire

input sequence {xt, t = 0,...,T}, the truncated one only accesses

the sequence {xt, t =t−h,...,t,...,t +b}at time t, with h

look-ahead and blook-back.

The vanilla Transformer encoder is shown in [6], where the

weights mainly come from attention layers, feedforward layers, as

well as its stacked structure. In the following, we compress the

model in three ways.

3.2.1. Cross Layers Parameter Sharing

Encouraged by the success of ALBERT – a lite BERT structure [5], a

cross-layer parameter sharing is employed for mobile-Transformer,

where only the attention parameters are shared cross layers. We do

neither take the all-shared strategy or the feedforward-shared strat-

egy since they were empirically demonstrated to degrade the model

performance greatly [5]. The motivation of the cross layer parame-

ter sharing is that the semantic relationship among the sequence is

supposed to be similar although in different layers. By doing this,

the number of attention weights can be signiﬁcantly reduced to 1/N

of its original size [5].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WAKEUPNET:AMOBILE-TRANSFORMERBASEDFRAMEWORKFOREND-TO-ENDSTREAMINGVOICETRIGGERZixingZhang,ThorinFarnsworth,SenlingLin,SalahKaroutHuaweiTechnologiesResearch&Development(UK)Ltdzixingzhang@huawei.comABSTRACTEnd-to-endmodelshavegraduallybecomethemaintechnicalstreamforvoicetrigger,aimingtoachieveanutmostp...

展开>> 收起<<

WAKEUPNET A MOBILE-TRANSFORMER BASED FRAMEWORK FOR END-TO-END STREAMING VOICE TRIGGER Zixing Zhang Thorin Farnsworth Senling Lin Salah Karout.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

WAKEUPNET A MOBILE-TRANSFORMER BASED FRAMEWORK FOR END-TO-END STREAMING VOICE TRIGGER Zixing Zhang Thorin Farnsworth Senling Lin Salah Karout

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: