
E-BRANCHFORMER: BRANCHFORMER WITH ENHANCED MERGING
FOR SPEECH RECOGNITION
Kwangyoun Kim1, Felix Wu1, Yifan Peng2∗, Jing Pan1, Prashant Sridhar1, Kyu J. Han1, Shinji Watanabe2
1ASAPP Inc., Mountain View, CA, USA
2Carnegie Mellon University, Pittsburgh, PA, USA
ABSTRACT
Conformer, combining convolution and self-attention sequentially to
capture both local and global information, has shown remarkable
performance and is currently regarded as the state-of-the-art for auto-
matic speech recognition (ASR). Several other studies have explored
integrating convolution and self-attention but they have not man-
aged to match Conformer’s performance. The recently introduced
Branchformer achieves comparable performance to Conformer by
using dedicated branches of convolution and self-attention and merg-
ing local and global context from each branch. In this paper, we
propose E-Branchformer, which enhances Branchformer by apply-
ing an effective merging method and stacking additional point-wise
modules. E-Branchformer sets new state-of-the-art word error rates
(WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other
sets without using any external training data.
Index Terms—Automatic speech recognition, Conformer,
Branchformer, Librispeech
Copyright 2023 IEEE. Published in the 2022 IEEE Spoken Language Technology Workshop (SLT) (SLT 2022), scheduled for 19-22 January 2023 in Doha, Qatar. Personal use of this material is permitted. However, permission to reprint/republish this
material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager,
Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.
1. INTRODUCTION
Automatic speech recognition (ASR) is a speech-to-text task, critical
to enable spoken language understanding [1, 2, 3]. Recently, there
has been a growing interest in end-to-end (E2E) ASR models for
their simplicity, efficiency, and competitive performance. Thanks
to many improvements in E2E ASR modeling such as new model
architectures [4, 5, 6], training objectives [4, 5, 7, 8], and data aug-
mentation methods [9, 10, 11], E2E ASR continues to take over con-
ventional hybrid ASR in various voice recognition applications.
An acoustic encoder that extracts features from audio inputs
plays a vital role in all E2E ASR models regardless of which
training objective is applied to optimization. Recurrent neural
networks [4, 12, 13, 14] used to be the de facto model choice
for an encoder. Later, many convolution-based [15, 16, 17] and
Transformer-based [6, 18, 19, 20] models have also been intro-
duced. The strength of multi-head self-attention in capturing global
context has allowed the Transformer to show competitive perfor-
mance. Transformers and their variants have been further studied to
reduce computation cost [21, 22, 23, 24, 25] or to train deep models
stably [26]. In parallel, several studies have investigated how to
merge the strengths of multi-head self-attention with those of other
neural units. The combination of a recurrent unit with self-attention,
which does not require explicit positional embeddings, shows stable
performance even in long-form speech recognition [27]. Besides,
Conformer [28] which combines convolution and self-attention se-
quentially, and Branchformer [29] which does the combination in
∗Work done during an internship at ASAPP.
parallel branches, both exhibit superior performance to the Trans-
former by processing local and global context together.
In this work, we extend the study of combining local and
global information and propose E-Branchformer, a descendent of
Branchformer which enhances the merging mechanism. Our en-
hanced merging block utilizes one additional lightweight operation,
a depth-wise convolution, which reinforces local aggregation. Our
experiments show that E-Branchformer surpasses Conformer and
Branchformer on both test-clean and test-other in LibriSpeech. Our
specific contribution includes:
1. The first Conformer baseline implementation using the
attention-based encoder-decoder model that matches the
accuracy of Google’s Conformer [28].
2. An improved Branchformer baseline that has a 0.2% and
0.5% improvement in absolute WER.
3. An extensive study on various ways to merge local and global
branches.
4. The new E-Branchformer encoder architecture which sets a
new state-of-the-art performance (1.81% and 3.65% WER on
test-clean and test-other) on LibriSpeech under the constraint
of not using any external data.
5. Our code will be released at https://anonymized url for repro-
ducibility.
2. RELATED WORK
2.1. End-to-end Automatic Speech Recognition Models
E2E ASR models can be roughly categorized into three main types
based on their training objectives and decoding algorithms: connec-
tionist temporal classification models [7], transducer models [5], and
the attention-based encoder-decoder (AED) models, a.k.a. Listen,
Attend, and Spell (LAS) models [4].
Regardless of the training objective and high-level framework,
almost all E2E ASR models share a similar backbone; an acoustic
encoder which encodes audio features into a sequence of hidden fea-
ture vectors. Many studies have explored different modeling choices
for the acoustic encoder. Recurrent neural networks [4, 5, 12, 13, 30]
encode audio signals sequentially; convolution alternatives [15, 16,
17] process local information in parallel and aggregate them as the
network goes deeper; self-attention [6, 18, 19, 20] allows long-range
interactions and achieves superior performance. In this work, we fo-
cus only on applying the E-Branchformer encoder under the AED
framework, but it can also be applied to CTC or transducer models
like the Conformer.
arXiv:2210.00077v2 [eess.AS] 14 Oct 2022