E-BRANCHFORMER BRANCHFORMER WITH ENHANCED MERGING FOR SPEECH RECOGNITION Kwangyoun Kim1 Felix Wu1 Yifan Peng2 Jing Pan1 Prashant Sridhar1 Kyu J. Han1 Shinji Watanabe2

2025-05-03 0 0 365.16KB 8 页 10玖币
侵权投诉
E-BRANCHFORMER: BRANCHFORMER WITH ENHANCED MERGING
FOR SPEECH RECOGNITION
Kwangyoun Kim1, Felix Wu1, Yifan Peng2, Jing Pan1, Prashant Sridhar1, Kyu J. Han1, Shinji Watanabe2
1ASAPP Inc., Mountain View, CA, USA
2Carnegie Mellon University, Pittsburgh, PA, USA
ABSTRACT
Conformer, combining convolution and self-attention sequentially to
capture both local and global information, has shown remarkable
performance and is currently regarded as the state-of-the-art for auto-
matic speech recognition (ASR). Several other studies have explored
integrating convolution and self-attention but they have not man-
aged to match Conformer’s performance. The recently introduced
Branchformer achieves comparable performance to Conformer by
using dedicated branches of convolution and self-attention and merg-
ing local and global context from each branch. In this paper, we
propose E-Branchformer, which enhances Branchformer by apply-
ing an effective merging method and stacking additional point-wise
modules. E-Branchformer sets new state-of-the-art word error rates
(WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other
sets without using any external training data.
Index TermsAutomatic speech recognition, Conformer,
Branchformer, Librispeech
Copyright 2023 IEEE. Published in the 2022 IEEE Spoken Language Technology Workshop (SLT) (SLT 2022), scheduled for 19-22 January 2023 in Doha, Qatar. Personal use of this material is permitted. However, permission to reprint/republish this
material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager,
Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.
1. INTRODUCTION
Automatic speech recognition (ASR) is a speech-to-text task, critical
to enable spoken language understanding [1, 2, 3]. Recently, there
has been a growing interest in end-to-end (E2E) ASR models for
their simplicity, efficiency, and competitive performance. Thanks
to many improvements in E2E ASR modeling such as new model
architectures [4, 5, 6], training objectives [4, 5, 7, 8], and data aug-
mentation methods [9, 10, 11], E2E ASR continues to take over con-
ventional hybrid ASR in various voice recognition applications.
An acoustic encoder that extracts features from audio inputs
plays a vital role in all E2E ASR models regardless of which
training objective is applied to optimization. Recurrent neural
networks [4, 12, 13, 14] used to be the de facto model choice
for an encoder. Later, many convolution-based [15, 16, 17] and
Transformer-based [6, 18, 19, 20] models have also been intro-
duced. The strength of multi-head self-attention in capturing global
context has allowed the Transformer to show competitive perfor-
mance. Transformers and their variants have been further studied to
reduce computation cost [21, 22, 23, 24, 25] or to train deep models
stably [26]. In parallel, several studies have investigated how to
merge the strengths of multi-head self-attention with those of other
neural units. The combination of a recurrent unit with self-attention,
which does not require explicit positional embeddings, shows stable
performance even in long-form speech recognition [27]. Besides,
Conformer [28] which combines convolution and self-attention se-
quentially, and Branchformer [29] which does the combination in
Work done during an internship at ASAPP.
parallel branches, both exhibit superior performance to the Trans-
former by processing local and global context together.
In this work, we extend the study of combining local and
global information and propose E-Branchformer, a descendent of
Branchformer which enhances the merging mechanism. Our en-
hanced merging block utilizes one additional lightweight operation,
a depth-wise convolution, which reinforces local aggregation. Our
experiments show that E-Branchformer surpasses Conformer and
Branchformer on both test-clean and test-other in LibriSpeech. Our
specific contribution includes:
1. The first Conformer baseline implementation using the
attention-based encoder-decoder model that matches the
accuracy of Google’s Conformer [28].
2. An improved Branchformer baseline that has a 0.2% and
0.5% improvement in absolute WER.
3. An extensive study on various ways to merge local and global
branches.
4. The new E-Branchformer encoder architecture which sets a
new state-of-the-art performance (1.81% and 3.65% WER on
test-clean and test-other) on LibriSpeech under the constraint
of not using any external data.
5. Our code will be released at https://anonymized url for repro-
ducibility.
2. RELATED WORK
2.1. End-to-end Automatic Speech Recognition Models
E2E ASR models can be roughly categorized into three main types
based on their training objectives and decoding algorithms: connec-
tionist temporal classification models [7], transducer models [5], and
the attention-based encoder-decoder (AED) models, a.k.a. Listen,
Attend, and Spell (LAS) models [4].
Regardless of the training objective and high-level framework,
almost all E2E ASR models share a similar backbone; an acoustic
encoder which encodes audio features into a sequence of hidden fea-
ture vectors. Many studies have explored different modeling choices
for the acoustic encoder. Recurrent neural networks [4, 5, 12, 13, 30]
encode audio signals sequentially; convolution alternatives [15, 16,
17] process local information in parallel and aggregate them as the
network goes deeper; self-attention [6, 18, 19, 20] allows long-range
interactions and achieves superior performance. In this work, we fo-
cus only on applying the E-Branchformer encoder under the AED
framework, but it can also be applied to CTC or transducer models
like the Conformer.
arXiv:2210.00077v2 [eess.AS] 14 Oct 2022
2.2. Combining Self-attention with Convolution
Taking the advantages of self-attention and convolution to capture
both long-range and local patterns has been studied in various prior
works. They can be categorized into two regimes: applying them
sequentially or in parallel (i.e., in a multi-branch manner).
2.2.1. Sequentially
To the best of our knowledge, QANet [31] for question answering is
the first model that combines convolution and self-attention in a se-
quential order. QANet adds two additional convolution blocks (with
residual connections) before the self-attention block in each Trans-
former layer. Evolved Transformer [32] also combines self-attention
and convolution in a sequential manner. In computer vision, non-
local neural networks [33] also show that adding self-attention layer
after convolution layers enables the model to capture more global
information and improves the performance on various vision tasks.
Recently, a series of vision Transformer variants that apply convo-
lution and self-attention sequentially are also proposed including
CvT [34], CoAtNet [35], ViTAEv2 [36], MaxVit [37]. In speech,
Gulati et al. [28] introduce Conformer models for ASR and show that
adding a convolution block after the self-attention block achieves the
best performance compared to applying it before or in parallel with
the self-attention.
2.2.2. In parallel
Wu et al. [38] propose Lite Transformer using Long-Short Range
Attention (LSRA) which applies multi-head attention and dynamic
convolution [39] in parallel and concatenates their outputs. Lite
Transformer is more efficient and accurate than Transformer base-
lines on various machine translation and summarization tasks. Jiang
et al. [40] extend this to large scale language model pretraining
and introduce ConvBERT which combines multi-head attention and
their newly proposed span-based dynamic convolution with shared
queries and values in two branches. Experiments show that Con-
vBERT outperforms attention-only baselines (ELECTRA [41] and
BERT [42]) on a variety of natural language processing tasks. In
computer vision, Pan et al. [43] share a similar concept except that
they decompose a convolution into a point-wise projection and a
shift operation. In speech, Branchformer combines self-attention
and the convolutional spatial gating unit (CSGU) [44] achieving
performance comparable with the Conformer. We will introduce
more details in Section 3.
2.2.3. Hybrid - both Sequentially and in parallel
Recently, Inception Transformer [45] which has three branches (av-
erage pooling, convolution, and self-attention) fused with a depth-
wise convolution achieves impressive performance on several vision
tasks. Our E-Branchformer shares a similar spirit of combing local
and global information both sequentially and in parallel.
3. PRELIMINARY: BRANCHFORMER
Figure 1 shows the high-level architecture of the Branchformer en-
coder, which uses a frontend and a convolutional subsampling layer
1Unlike the Lite-Transformer and the in-parallel method in the Con-
former, the Branchformer doesn’t split the inputs for each branch along the
channel dimension.
Fig. 1: A figure of the Branchformer-based Encoder and a single
Branchformer Block. There are two parallel branches1to extract
global and local context, and the merge module combines outputs
of branches.
to extract low-level speech features and then applies several Branch-
former blocks. There are three components in each Branchformer
block — a global extractor branch, a local extractor branch, and a
merge module.
The global extractor branch is a conventional self-attention
block in Transformer. It uses the pre-norm [46] setup where a
layer norm (LN) [47], a multi-head self-attention (MHSA), and a
dropout [48] are applied sequentially as follows:
YG= Dropout(MHSA(LN(X))),(1)
where X,YGRT×ddenote the input and the global-extractor
branch output with a length of Tand a hidden dimension of d. Like
the Conformer, Branchformer uses relative positional embeddings,
which generally shows better performance than absolute positional
embeddings in ASR and NLU tasks [28, 49]. In the paper, the au-
thors also explore more efficient attention variants to reduce compu-
tational cost, which leads to some degradation in accuracy.
Fig. 2: A figure of the Local extractor branch in Branchformer. This
branch uses the Multi-Layer Perceptron (MLP) with convolutional
gating (cgMLP) [44].
摘要:

E-BRANCHFORMER:BRANCHFORMERWITHENHANCEDMERGINGFORSPEECHRECOGNITIONKwangyounKim1,FelixWu1,YifanPeng2,JingPan1,PrashantSridhar1,KyuJ.Han1,ShinjiWatanabe21ASAPPInc.,MountainView,CA,USA2CarnegieMellonUniversity,Pittsburgh,PA,USAABSTRACTConformer,combiningconvolutionandself-attentionsequentiallytocaptur...

展开>> 收起<<
E-BRANCHFORMER BRANCHFORMER WITH ENHANCED MERGING FOR SPEECH RECOGNITION Kwangyoun Kim1 Felix Wu1 Yifan Peng2 Jing Pan1 Prashant Sridhar1 Kyu J. Han1 Shinji Watanabe2.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:365.16KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注