MONOTONIC SEGMENTAL ATTENTION FOR AUTOMATIC SPEECH RECOGNITION Albert Zeyer12 Robin Schmitt12 Wei Zhou12 Ralf Schl uter12 Hermann Ney12 1Human Language Technology and Pattern Recognition Computer Science Department

2025-05-02 0 0 515.1KB 8 页 10玖币

侵权投诉

MONOTONIC SEGMENTAL ATTENTION FOR AUTOMATIC SPEECH RECOGNITION

Albert Zeyer1,2, Robin Schmitt1,2, Wei Zhou1,2, Ralf Schl¨

uter1,2, Hermann Ney1,2

1Human Language Technology and Pattern Recognition, Computer Science Department,

RWTH Aachen University, 52062 Aachen, Germany,

2AppTek GmbH, 52062 Aachen, Germany

{zeyer,zhou,schlueter,ney}@cs.rwth-aachen.de, robin.schmitt1@rwth-aachen.de

ABSTRACT

We introduce a novel segmental-attention model for auto-

matic speech recognition. We restrict the decoder attention

to segments to avoid quadratic runtime of global attention,

better generalize to long sequences, and eventually enable

streaming. We directly compare global-attention and dif-

ferent segmental-attention modeling variants. We develop

and compare two separate time-synchronous decoders, one

speciﬁcally taking the segmental nature into account, yield-

ing further improvements. Using time-synchronous decoding

for segmental models is novel and a step towards stream-

ing applications. Our experiments show the importance of

a length model to predict the segment boundaries. The ﬁnal

best segmental-attention model using segmental decoding

performs better than global-attention, in contrast to other

monotonic attention approaches in the literature. Further, we

observe that the segmental model generalizes much better to

long sequences of up to several minutes.

Index Terms—Segmental attention, segmental models

1. INTRODUCTION & RELATED WORK

The attention-based encoder-decoder architecture [1] has

been very successful as an end-to-end model for many tasks

including speech recognition [2–6]. However, for every out-

put label, the attention weights are over all the input frames,

referred to as global attention. This has the drawbacks of

quadratic runtime-complexity, potential non-monotonicity; it

does not allow for online streaming recognition and it does

not generalize to longer sequence lengths than those seen in

training [7–9].

There are many attempts to solve some of these issues.

Monotonic chunkwise attention (MoChA) [10–12] is one

popular approach which uses ﬁxed-size chunks for soft at-

tention and a deterministic approach for the chunk positions,

i.e. the position is not treated as a latent variable in recog-

nition. Many similar approaches using a local ﬁxed-sized

window and some heuristic or separate neural network for the

position prediction were proposed [2, 13–20]. The attention

sometimes also uses a Gauss distribution which allows for a

differentiable position prediction [14–16,21, 22]. Some mod-

els add a penalty in training, or are extended to have an im-

plicit bias to encourage monotonicity [2, 13, 23]. Framewise

deﬁned models like CTC [24] or transducers [25] canonically

allow for streaming, and there are approaches to combine

such framewise model with attention [26–33].

Our probabilistic formulation using latent variables for

the segment boundaries is similar to other segmental mod-

els [34–46], although attention has not been used on the seg-

ments except in [46] and there are usually more independence

assumptions such as ﬁrst or zero order dependency on the la-

bel, and only a ﬁrst order dependency on the segment bound-

ary, which is a critical difference. It has also been shown that

transducer models and segmental models are equivalent [47].

Here, we want to make use of the attention mechanism

while solving the mentioned global attention drawbacks by

making the attention local and monotonic on segments. We

treat the segment boundaries as latent variables and end up at

our segmental attention models. Such segmental models by

deﬁnition are monotonic, allow for streaming, and are much

more efﬁcient by using only local attention. Our aim is to get

a better understanding of such segmental attention models,

to directly compare it to the global attention model, to study

how to treat and model the latent variable, how to perform

the search for recognition, how to treat silence, how well it

generalizes to longer sequences among other questions.

2. GLOBAL ATTENTION MODEL

Our starting point is the standard global-attention-based

encoder-decoder model [1] adapted for speech recognition

[2, 3, 5, 6], speciﬁcally the model as described in [4, 48, 49].

We use an LSTM-based [50] encoder which gets a sequence

of audio feature frames xT0

1of length T0as input and encodes

it as a sequence

1= Encoder(xT0

of length T, where we apply downsampling by max-pooling

in time inside the encoder by factor 6. For the output label

sequence aS

1of length S, given the encoder output sequence

1of length T, we deﬁne

p(aS

1|hT

1) =

s=1

p(as|as−1

1, hT

|{z }

label model

arXiv:2210.14742v1 [cs.CL] 26 Oct 2022

as-1

time t

labels a

as+1

ts-1+1

Fig. 1: Example alignment of label sequence

...as−1asas+1.... Segment for ashighlighted.

Encoder

Attention

Label Model

Length Model

ts-1

= output of previous step

ts-1

as-1

Fig. 2: Segmental attention model

The label model uses global attention on hT

1per each output

step s. The neural structure which deﬁnes the probability dis-

tribution of the labels is also called the decoder. The decoder

of the global-attention model is almost the same as our seg-

mental attention model, which we will deﬁne below in detail.

The main difference is segmental vs. global attention.

3. OUR SEGMENTAL ATTENTION MODEL

Now, we introduce segment boundaries as a sequence of la-

tent variables tS

1. Speciﬁcally, for one output as, the segment

is deﬁned by [ts−1+1, ts], with t0= 0 and tS=Tﬁxed, and

we require ts> ts−1for strict monotonicity. Thus, the seg-

mentation fully covers all frames of the sequence. One such

segment is highlighted in Figure 1. The label model now uses

attention only in that segment, i.e. on hts

ts−1+1. For the output

label sequence aS

1, we now deﬁne the segmental model as

p(aS

1|hT

1) = X

s=1

p(ts|...)α

| {z }

length model

·p(as|as−1

1, hts

ts−1+1, ...)

| {z }

label model

In the simplest case, we even do not use any length model

(α= 0). The intuition was that a proper dynamic search

over the segment boundaries can be guided by the label model

alone, as it should produce bad scores for bad segment bound-

aries. We also test a simple static length model, and a neural

length model, as we will describe later. The label model is

mostly the same as in the global attention case with the main

difference that we have the attention only on hts

ts−1+1. The

whole segmental model is depicted in Figure 2.

csSoftmax p(as)

= output of previous step

LinearMaxoutLinear

LSTM

Attention

as-1

cs-1

recs-1

recs

Fig. 3: Segmental attention label model p(as|...)

3.1. Label model variations

The label model is depicted in Figure 3 and deﬁned as

p(as|...) = (softmax ◦Linear ◦maxout ◦Linear)

(LSTM(cs−1

1, as−1

1), cs)

cs=

t=ts−1+1

αs,t ·ht

αs,t =exp(es,t)

Pts

τ=ts−1+1 exp(es,τ )

es,t = (Linear ◦tanh ◦Linear)

(LSTM(cs−1

1, as−1

1), ht).

The attention weights here are only calculated on the inter-

val [ts−1+ 1,ts]. Further, we do not have attention weight

feedback as there is no overlap between the segments. Oth-

erwise the model is exactly the same as the global attention

decoder, to allow for direct comparisons, and also to import

model parameters.

Another variation is on the dependencies. In any case,

we depend on the full label history as−1

1. When we have the

dependency on cs−1

1as it is standard for LSTM-based atten-

tion models, this implies an implicit dependency on the whole

past segment boundaries ts−1

1, which removes the option of an

exact ﬁrst-order dynamic programming implementation for

forced alignment or exact sequence-likelihood training. So,

we also test the variant where we remove the cs−1

1depen-

dency in the equations above.

3.2. Silence modeling

The output label vocabulary of the global attention model usu-

ally does not include silence as this is not necessarily needed.

Our segments completely cover the input sequence, and thus

the question arises whether the silence parts should be sep-

arate segments or not, i.e. whether we should add silence to

the vocabulary. Additionally, as silence segments tend to be

longer, we optionally split them up. We will perform ex-

periments for all three variants, also for the global attention

model.

Further, when we include silence, the question is whether

this should be treated just as another output label, or treated

in a special way, e.g. separating it from the softmax. We test

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MONOTONICSEGMENTALATTENTIONFORAUTOMATICSPEECHRECOGNITIONAlbertZeyer1;2,RobinSchmitt1;2,WeiZhou1;2,RalfSchl¨uter1;2,HermannNey1;21HumanLanguageTechnologyandPatternRecognition,ComputerScienceDepartment,RWTHAachenUniversity,52062Aachen,Germany,2AppTekGmbH,52062Aachen,Germanyfzeyer,zhou,schlueter,neyg@c...

展开>> 收起<<

MONOTONIC SEGMENTAL ATTENTION FOR AUTOMATIC SPEECH RECOGNITION Albert Zeyer12 Robin Schmitt12 Wei Zhou12 Ralf Schl uter12 Hermann Ney12 1Human Language Technology and Pattern Recognition Computer Science Department.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MONOTONIC SEGMENTAL ATTENTION FOR AUTOMATIC SPEECH RECOGNITION Albert Zeyer12 Robin Schmitt12 Wei Zhou12 Ralf Schl uter12 Hermann Ney12 1Human Language Technology and Pattern Recognition Computer Science Department

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: