
MONOTONIC SEGMENTAL ATTENTION FOR AUTOMATIC SPEECH RECOGNITION
Albert Zeyer1,2, Robin Schmitt1,2, Wei Zhou1,2, Ralf Schl¨
uter1,2, Hermann Ney1,2
1Human Language Technology and Pattern Recognition, Computer Science Department,
RWTH Aachen University, 52062 Aachen, Germany,
2AppTek GmbH, 52062 Aachen, Germany
{zeyer,zhou,schlueter,ney}@cs.rwth-aachen.de, robin.schmitt1@rwth-aachen.de
ABSTRACT
We introduce a novel segmental-attention model for auto-
matic speech recognition. We restrict the decoder attention
to segments to avoid quadratic runtime of global attention,
better generalize to long sequences, and eventually enable
streaming. We directly compare global-attention and dif-
ferent segmental-attention modeling variants. We develop
and compare two separate time-synchronous decoders, one
specifically taking the segmental nature into account, yield-
ing further improvements. Using time-synchronous decoding
for segmental models is novel and a step towards stream-
ing applications. Our experiments show the importance of
a length model to predict the segment boundaries. The final
best segmental-attention model using segmental decoding
performs better than global-attention, in contrast to other
monotonic attention approaches in the literature. Further, we
observe that the segmental model generalizes much better to
long sequences of up to several minutes.
Index Terms—Segmental attention, segmental models
1. INTRODUCTION & RELATED WORK
The attention-based encoder-decoder architecture [1] has
been very successful as an end-to-end model for many tasks
including speech recognition [2–6]. However, for every out-
put label, the attention weights are over all the input frames,
referred to as global attention. This has the drawbacks of
quadratic runtime-complexity, potential non-monotonicity; it
does not allow for online streaming recognition and it does
not generalize to longer sequence lengths than those seen in
training [7–9].
There are many attempts to solve some of these issues.
Monotonic chunkwise attention (MoChA) [10–12] is one
popular approach which uses fixed-size chunks for soft at-
tention and a deterministic approach for the chunk positions,
i.e. the position is not treated as a latent variable in recog-
nition. Many similar approaches using a local fixed-sized
window and some heuristic or separate neural network for the
position prediction were proposed [2, 13–20]. The attention
sometimes also uses a Gauss distribution which allows for a
differentiable position prediction [14–16,21, 22]. Some mod-
els add a penalty in training, or are extended to have an im-
plicit bias to encourage monotonicity [2, 13, 23]. Framewise
defined models like CTC [24] or transducers [25] canonically
allow for streaming, and there are approaches to combine
such framewise model with attention [26–33].
Our probabilistic formulation using latent variables for
the segment boundaries is similar to other segmental mod-
els [34–46], although attention has not been used on the seg-
ments except in [46] and there are usually more independence
assumptions such as first or zero order dependency on the la-
bel, and only a first order dependency on the segment bound-
ary, which is a critical difference. It has also been shown that
transducer models and segmental models are equivalent [47].
Here, we want to make use of the attention mechanism
while solving the mentioned global attention drawbacks by
making the attention local and monotonic on segments. We
treat the segment boundaries as latent variables and end up at
our segmental attention models. Such segmental models by
definition are monotonic, allow for streaming, and are much
more efficient by using only local attention. Our aim is to get
a better understanding of such segmental attention models,
to directly compare it to the global attention model, to study
how to treat and model the latent variable, how to perform
the search for recognition, how to treat silence, how well it
generalizes to longer sequences among other questions.
2. GLOBAL ATTENTION MODEL
Our starting point is the standard global-attention-based
encoder-decoder model [1] adapted for speech recognition
[2, 3, 5, 6], specifically the model as described in [4, 48, 49].
We use an LSTM-based [50] encoder which gets a sequence
of audio feature frames xT0
1of length T0as input and encodes
it as a sequence
hT
1= Encoder(xT0
1)
of length T, where we apply downsampling by max-pooling
in time inside the encoder by factor 6. For the output label
sequence aS
1of length S, given the encoder output sequence
hT
1of length T, we define
p(aS
1|hT
1) =
S
Y
s=1
p(as|as−1
1, hT
1)
|{z }
label model
,
978-1-6654-7189-3/22/$31.00 ©2023 IEEE
arXiv:2210.14742v1 [cs.CL] 26 Oct 2022