MONOTONIC SEGMENTAL ATTENTION FOR AUTOMATIC SPEECH RECOGNITION Albert Zeyer12 Robin Schmitt12 Wei Zhou12 Ralf Schl uter12 Hermann Ney12 1Human Language Technology and Pattern Recognition Computer Science Department

2025-05-02 0 0 515.1KB 8 页 10玖币
侵权投诉
MONOTONIC SEGMENTAL ATTENTION FOR AUTOMATIC SPEECH RECOGNITION
Albert Zeyer1,2, Robin Schmitt1,2, Wei Zhou1,2, Ralf Schl¨
uter1,2, Hermann Ney1,2
1Human Language Technology and Pattern Recognition, Computer Science Department,
RWTH Aachen University, 52062 Aachen, Germany,
2AppTek GmbH, 52062 Aachen, Germany
{zeyer,zhou,schlueter,ney}@cs.rwth-aachen.de, robin.schmitt1@rwth-aachen.de
ABSTRACT
We introduce a novel segmental-attention model for auto-
matic speech recognition. We restrict the decoder attention
to segments to avoid quadratic runtime of global attention,
better generalize to long sequences, and eventually enable
streaming. We directly compare global-attention and dif-
ferent segmental-attention modeling variants. We develop
and compare two separate time-synchronous decoders, one
specifically taking the segmental nature into account, yield-
ing further improvements. Using time-synchronous decoding
for segmental models is novel and a step towards stream-
ing applications. Our experiments show the importance of
a length model to predict the segment boundaries. The final
best segmental-attention model using segmental decoding
performs better than global-attention, in contrast to other
monotonic attention approaches in the literature. Further, we
observe that the segmental model generalizes much better to
long sequences of up to several minutes.
Index TermsSegmental attention, segmental models
1. INTRODUCTION & RELATED WORK
The attention-based encoder-decoder architecture [1] has
been very successful as an end-to-end model for many tasks
including speech recognition [2–6]. However, for every out-
put label, the attention weights are over all the input frames,
referred to as global attention. This has the drawbacks of
quadratic runtime-complexity, potential non-monotonicity; it
does not allow for online streaming recognition and it does
not generalize to longer sequence lengths than those seen in
training [7–9].
There are many attempts to solve some of these issues.
Monotonic chunkwise attention (MoChA) [10–12] is one
popular approach which uses fixed-size chunks for soft at-
tention and a deterministic approach for the chunk positions,
i.e. the position is not treated as a latent variable in recog-
nition. Many similar approaches using a local fixed-sized
window and some heuristic or separate neural network for the
position prediction were proposed [2, 13–20]. The attention
sometimes also uses a Gauss distribution which allows for a
differentiable position prediction [14–16,21, 22]. Some mod-
els add a penalty in training, or are extended to have an im-
plicit bias to encourage monotonicity [2, 13, 23]. Framewise
defined models like CTC [24] or transducers [25] canonically
allow for streaming, and there are approaches to combine
such framewise model with attention [26–33].
Our probabilistic formulation using latent variables for
the segment boundaries is similar to other segmental mod-
els [34–46], although attention has not been used on the seg-
ments except in [46] and there are usually more independence
assumptions such as first or zero order dependency on the la-
bel, and only a first order dependency on the segment bound-
ary, which is a critical difference. It has also been shown that
transducer models and segmental models are equivalent [47].
Here, we want to make use of the attention mechanism
while solving the mentioned global attention drawbacks by
making the attention local and monotonic on segments. We
treat the segment boundaries as latent variables and end up at
our segmental attention models. Such segmental models by
definition are monotonic, allow for streaming, and are much
more efficient by using only local attention. Our aim is to get
a better understanding of such segmental attention models,
to directly compare it to the global attention model, to study
how to treat and model the latent variable, how to perform
the search for recognition, how to treat silence, how well it
generalizes to longer sequences among other questions.
2. GLOBAL ATTENTION MODEL
Our starting point is the standard global-attention-based
encoder-decoder model [1] adapted for speech recognition
[2, 3, 5, 6], specifically the model as described in [4, 48, 49].
We use an LSTM-based [50] encoder which gets a sequence
of audio feature frames xT0
1of length T0as input and encodes
it as a sequence
hT
1= Encoder(xT0
1)
of length T, where we apply downsampling by max-pooling
in time inside the encoder by factor 6. For the output label
sequence aS
1of length S, given the encoder output sequence
hT
1of length T, we define
p(aS
1|hT
1) =
S
Y
s=1
p(as|as1
1, hT
1)
|{z }
label model
,
978-1-6654-7189-3/22/$31.00 ©2023 IEEE
arXiv:2210.14742v1 [cs.CL] 26 Oct 2022
as-1
time t
labels a
as
as+1
ts
ts-1+1
Fig. 1: Example alignment of label sequence
...as1asas+1.... Segment for ashighlighted.
Encoder
Attention
Label Model
Length Model
ts-1
h
cs
ts
x
as
= output of previous step
as
h
ts-1
as-1
as-1
as-1
Fig. 2: Segmental attention model
The label model uses global attention on hT
1per each output
step s. The neural structure which defines the probability dis-
tribution of the labels is also called the decoder. The decoder
of the global-attention model is almost the same as our seg-
mental attention model, which we will define below in detail.
The main difference is segmental vs. global attention.
3. OUR SEGMENTAL ATTENTION MODEL
Now, we introduce segment boundaries as a sequence of la-
tent variables tS
1. Specifically, for one output as, the segment
is defined by [ts1+1, ts], with t0= 0 and tS=Tfixed, and
we require ts> ts1for strict monotonicity. Thus, the seg-
mentation fully covers all frames of the sequence. One such
segment is highlighted in Figure 1. The label model now uses
attention only in that segment, i.e. on hts
ts1+1. For the output
label sequence aS
1, we now define the segmental model as
p(aS
1|hT
1) = X
tS
1
S
Y
s=1
p(ts|...)α
| {z }
length model
·p(as|as1
1, hts
ts1+1, ...)
| {z }
label model
.
In the simplest case, we even do not use any length model
(α= 0). The intuition was that a proper dynamic search
over the segment boundaries can be guided by the label model
alone, as it should produce bad scores for bad segment bound-
aries. We also test a simple static length model, and a neural
length model, as we will describe later. The label model is
mostly the same as in the global attention case with the main
difference that we have the attention only on hts
ts1+1. The
whole segmental model is depicted in Figure 2.
csSoftmax p(as)
as
= output of previous step
LinearMaxoutLinear
LSTM
Attention
cs
as-1
cs-1
recs-1
recs
recs
Fig. 3: Segmental attention label model p(as|...)
3.1. Label model variations
The label model is depicted in Figure 3 and defined as
p(as|...) = (softmax Linear maxout Linear)
(LSTM(cs1
1, as1
1), cs)
cs=
ts
X
t=ts1+1
αs,t ·ht
αs,t =exp(es,t)
Pts
τ=ts1+1 exp(es,τ )
es,t = (Linear tanh Linear)
(LSTM(cs1
1, as1
1), ht).
The attention weights here are only calculated on the inter-
val [ts1+ 1,ts]. Further, we do not have attention weight
feedback as there is no overlap between the segments. Oth-
erwise the model is exactly the same as the global attention
decoder, to allow for direct comparisons, and also to import
model parameters.
Another variation is on the dependencies. In any case,
we depend on the full label history as1
1. When we have the
dependency on cs1
1as it is standard for LSTM-based atten-
tion models, this implies an implicit dependency on the whole
past segment boundaries ts1
1, which removes the option of an
exact first-order dynamic programming implementation for
forced alignment or exact sequence-likelihood training. So,
we also test the variant where we remove the cs1
1depen-
dency in the equations above.
3.2. Silence modeling
The output label vocabulary of the global attention model usu-
ally does not include silence as this is not necessarily needed.
Our segments completely cover the input sequence, and thus
the question arises whether the silence parts should be sep-
arate segments or not, i.e. whether we should add silence to
the vocabulary. Additionally, as silence segments tend to be
longer, we optionally split them up. We will perform ex-
periments for all three variants, also for the global attention
model.
Further, when we include silence, the question is whether
this should be treated just as another output label, or treated
in a special way, e.g. separating it from the softmax. We test
摘要:

MONOTONICSEGMENTALATTENTIONFORAUTOMATICSPEECHRECOGNITIONAlbertZeyer1;2,RobinSchmitt1;2,WeiZhou1;2,RalfSchl¨uter1;2,HermannNey1;21HumanLanguageTechnologyandPatternRecognition,ComputerScienceDepartment,RWTHAachenUniversity,52062Aachen,Germany,2AppTekGmbH,52062Aachen,Germanyfzeyer,zhou,schlueter,neyg@c...

展开>> 收起<<
MONOTONIC SEGMENTAL ATTENTION FOR AUTOMATIC SPEECH RECOGNITION Albert Zeyer12 Robin Schmitt12 Wei Zhou12 Ralf Schl uter12 Hermann Ney12 1Human Language Technology and Pattern Recognition Computer Science Department.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:515.1KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注