E-BRANCHFORMER BRANCHFORMER WITH ENHANCED MERGING FOR SPEECH RECOGNITION Kwangyoun Kim1 Felix Wu1 Yifan Peng2 Jing Pan1 Prashant Sridhar1 Kyu J. Han1 Shinji Watanabe2

2025-05-03 1 0 365.16KB 8 页 10玖币

侵权投诉

E-BRANCHFORMER: BRANCHFORMER WITH ENHANCED MERGING

FOR SPEECH RECOGNITION

Kwangyoun Kim1, Felix Wu1, Yifan Peng2∗, Jing Pan1, Prashant Sridhar1, Kyu J. Han1, Shinji Watanabe2

1ASAPP Inc., Mountain View, CA, USA

2Carnegie Mellon University, Pittsburgh, PA, USA

ABSTRACT

Conformer, combining convolution and self-attention sequentially to

capture both local and global information, has shown remarkable

performance and is currently regarded as the state-of-the-art for auto-

matic speech recognition (ASR). Several other studies have explored

integrating convolution and self-attention but they have not man-

aged to match Conformer’s performance. The recently introduced

Branchformer achieves comparable performance to Conformer by

using dedicated branches of convolution and self-attention and merg-

ing local and global context from each branch. In this paper, we

propose E-Branchformer, which enhances Branchformer by apply-

ing an effective merging method and stacking additional point-wise

modules. E-Branchformer sets new state-of-the-art word error rates

(WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other

sets without using any external training data.

Index Terms—Automatic speech recognition, Conformer,

Branchformer, Librispeech

Copyright 2023 IEEE. Published in the 2022 IEEE Spoken Language Technology Workshop (SLT) (SLT 2022), scheduled for 19-22 January 2023 in Doha, Qatar. Personal use of this material is permitted. However, permission to reprint/republish this

material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager,

Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.

1. INTRODUCTION

Automatic speech recognition (ASR) is a speech-to-text task, critical

to enable spoken language understanding [1, 2, 3]. Recently, there

has been a growing interest in end-to-end (E2E) ASR models for

their simplicity, efﬁciency, and competitive performance. Thanks

to many improvements in E2E ASR modeling such as new model

architectures [4, 5, 6], training objectives [4, 5, 7, 8], and data aug-

mentation methods [9, 10, 11], E2E ASR continues to take over con-

ventional hybrid ASR in various voice recognition applications.

An acoustic encoder that extracts features from audio inputs

plays a vital role in all E2E ASR models regardless of which

training objective is applied to optimization. Recurrent neural

networks [4, 12, 13, 14] used to be the de facto model choice

for an encoder. Later, many convolution-based [15, 16, 17] and

Transformer-based [6, 18, 19, 20] models have also been intro-

duced. The strength of multi-head self-attention in capturing global

context has allowed the Transformer to show competitive perfor-

mance. Transformers and their variants have been further studied to

reduce computation cost [21, 22, 23, 24, 25] or to train deep models

stably [26]. In parallel, several studies have investigated how to

merge the strengths of multi-head self-attention with those of other

neural units. The combination of a recurrent unit with self-attention,

which does not require explicit positional embeddings, shows stable

performance even in long-form speech recognition [27]. Besides,

Conformer [28] which combines convolution and self-attention se-

quentially, and Branchformer [29] which does the combination in

∗Work done during an internship at ASAPP.

parallel branches, both exhibit superior performance to the Trans-

former by processing local and global context together.

In this work, we extend the study of combining local and

global information and propose E-Branchformer, a descendent of

Branchformer which enhances the merging mechanism. Our en-

hanced merging block utilizes one additional lightweight operation,

a depth-wise convolution, which reinforces local aggregation. Our

experiments show that E-Branchformer surpasses Conformer and

Branchformer on both test-clean and test-other in LibriSpeech. Our

speciﬁc contribution includes:

1. The ﬁrst Conformer baseline implementation using the

attention-based encoder-decoder model that matches the

accuracy of Google’s Conformer [28].

2. An improved Branchformer baseline that has a 0.2% and

0.5% improvement in absolute WER.

3. An extensive study on various ways to merge local and global

branches.

4. The new E-Branchformer encoder architecture which sets a

new state-of-the-art performance (1.81% and 3.65% WER on

test-clean and test-other) on LibriSpeech under the constraint

of not using any external data.

5. Our code will be released at https://anonymized url for repro-

ducibility.

2. RELATED WORK

2.1. End-to-end Automatic Speech Recognition Models

E2E ASR models can be roughly categorized into three main types

based on their training objectives and decoding algorithms: connec-

tionist temporal classiﬁcation models [7], transducer models [5], and

the attention-based encoder-decoder (AED) models, a.k.a. Listen,

Attend, and Spell (LAS) models [4].

Regardless of the training objective and high-level framework,

almost all E2E ASR models share a similar backbone; an acoustic

encoder which encodes audio features into a sequence of hidden fea-

ture vectors. Many studies have explored different modeling choices

for the acoustic encoder. Recurrent neural networks [4, 5, 12, 13, 30]

encode audio signals sequentially; convolution alternatives [15, 16,

17] process local information in parallel and aggregate them as the

network goes deeper; self-attention [6, 18, 19, 20] allows long-range

interactions and achieves superior performance. In this work, we fo-

cus only on applying the E-Branchformer encoder under the AED

framework, but it can also be applied to CTC or transducer models

like the Conformer.

arXiv:2210.00077v2 [eess.AS] 14 Oct 2022

2.2. Combining Self-attention with Convolution

Taking the advantages of self-attention and convolution to capture

both long-range and local patterns has been studied in various prior

works. They can be categorized into two regimes: applying them

sequentially or in parallel (i.e., in a multi-branch manner).

2.2.1. Sequentially

To the best of our knowledge, QANet [31] for question answering is

the ﬁrst model that combines convolution and self-attention in a se-

quential order. QANet adds two additional convolution blocks (with

residual connections) before the self-attention block in each Trans-

former layer. Evolved Transformer [32] also combines self-attention

and convolution in a sequential manner. In computer vision, non-

local neural networks [33] also show that adding self-attention layer

after convolution layers enables the model to capture more global

information and improves the performance on various vision tasks.

Recently, a series of vision Transformer variants that apply convo-

lution and self-attention sequentially are also proposed including

CvT [34], CoAtNet [35], ViTAEv2 [36], MaxVit [37]. In speech,

Gulati et al. [28] introduce Conformer models for ASR and show that

adding a convolution block after the self-attention block achieves the

best performance compared to applying it before or in parallel with

the self-attention.

2.2.2. In parallel

Wu et al. [38] propose Lite Transformer using Long-Short Range

Attention (LSRA) which applies multi-head attention and dynamic

convolution [39] in parallel and concatenates their outputs. Lite

Transformer is more efﬁcient and accurate than Transformer base-

lines on various machine translation and summarization tasks. Jiang

et al. [40] extend this to large scale language model pretraining

and introduce ConvBERT which combines multi-head attention and

their newly proposed span-based dynamic convolution with shared

queries and values in two branches. Experiments show that Con-

vBERT outperforms attention-only baselines (ELECTRA [41] and

BERT [42]) on a variety of natural language processing tasks. In

computer vision, Pan et al. [43] share a similar concept except that

they decompose a convolution into a point-wise projection and a

shift operation. In speech, Branchformer combines self-attention

and the convolutional spatial gating unit (CSGU) [44] achieving

performance comparable with the Conformer. We will introduce

more details in Section 3.

2.2.3. Hybrid - both Sequentially and in parallel

Recently, Inception Transformer [45] which has three branches (av-

erage pooling, convolution, and self-attention) fused with a depth-

wise convolution achieves impressive performance on several vision

tasks. Our E-Branchformer shares a similar spirit of combing local

and global information both sequentially and in parallel.

3. PRELIMINARY: BRANCHFORMER

Figure 1 shows the high-level architecture of the Branchformer en-

coder, which uses a frontend and a convolutional subsampling layer

1Unlike the Lite-Transformer and the in-parallel method in the Con-

former, the Branchformer doesn’t split the inputs for each branch along the

channel dimension.

Fig. 1: A ﬁgure of the Branchformer-based Encoder and a single

Branchformer Block. There are two parallel branches1to extract

global and local context, and the merge module combines outputs

of branches.

to extract low-level speech features and then applies several Branch-

former blocks. There are three components in each Branchformer

block — a global extractor branch, a local extractor branch, and a

merge module.

The global extractor branch is a conventional self-attention

block in Transformer. It uses the pre-norm [46] setup where a

layer norm (LN) [47], a multi-head self-attention (MHSA), and a

dropout [48] are applied sequentially as follows:

YG= Dropout(MHSA(LN(X))),(1)

where X,YG∈RT×ddenote the input and the global-extractor

branch output with a length of Tand a hidden dimension of d. Like

the Conformer, Branchformer uses relative positional embeddings,

which generally shows better performance than absolute positional

embeddings in ASR and NLU tasks [28, 49]. In the paper, the au-

thors also explore more efﬁcient attention variants to reduce compu-

tational cost, which leads to some degradation in accuracy.

Fig. 2: A ﬁgure of the Local extractor branch in Branchformer. This

branch uses the Multi-Layer Perceptron (MLP) with convolutional

gating (cgMLP) [44].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

E-BRANCHFORMER:BRANCHFORMERWITHENHANCEDMERGINGFORSPEECHRECOGNITIONKwangyounKim1,FelixWu1,YifanPeng2,JingPan1,PrashantSridhar1,KyuJ.Han1,ShinjiWatanabe21ASAPPInc.,MountainView,CA,USA2CarnegieMellonUniversity,Pittsburgh,PA,USAABSTRACTConformer,combiningconvolutionandself-attentionsequentiallytocaptur...

展开>> 收起<<

E-BRANCHFORMER BRANCHFORMER WITH ENHANCED MERGING FOR SPEECH RECOGNITION Kwangyoun Kim1 Felix Wu1 Yifan Peng2 Jing Pan1 Prashant Sridhar1 Kyu J. Han1 Shinji Watanabe2.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

E-BRANCHFORMER BRANCHFORMER WITH ENHANCED MERGING FOR SPEECH RECOGNITION Kwangyoun Kim1 Felix Wu1 Yifan Peng2 Jing Pan1 Prashant Sridhar1 Kyu J. Han1 Shinji Watanabe2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: