Linguistic-Enhanced Transformer with CTC Embedding for Speech Recognition Xulong Zhang Jianzong Wang Ning Cheng Mengyuan Zhao Zhiyong Zhang Jing Xiao

2025-05-06 0 0 924.07KB 6 页 10玖币

侵权投诉

Linguistic-Enhanced Transformer with CTC

Embedding for Speech Recognition

Xulong Zhang, Jianzong Wang∗, Ning Cheng, Mengyuan Zhao, Zhiyong Zhang, Jing Xiao

Ping An Technology (Shenzhen) Co., Ltd.

Abstract—The recent emergence of joint CTC-Attention model

shows signiﬁcant improvement in automatic speech recognition

(ASR). The improvement largely lies in the modeling of linguistic

information by decoder. The decoder joint-optimized with an

acoustic encoder renders the language model from ground-

truth sequences in an auto-regressive manner during training.

However, the training corpus of the decoder is limited to the

speech transcriptions, which is far less than the corpus needed

to train an acceptable language model. This leads to poor

robustness of decoder. To alleviate this problem, we propose

linguistic-enhanced transformer, which introduces reﬁned CTC

information to decoder during training process, so that the

decoder can be more robust. Our experiments on AISHELL-

1 speech corpus show that the character error rate (CER) is

relatively reduced by up to 7%. We also ﬁnd that in joint CTC-

Attention ASR model, decoder is more sensitive to linguistic

information than acoustic information.

Index Terms—speech recognition, attention, CTC, transformer,

linguistic information

I. INTRODUCTION

The adoption of end-to-end models greatly simpliﬁes the

training process of automatic speech recognition (ASR) sys-

tem. In such models, acoustic, pronunciation and language

modeling components are jointly optimized in an uniﬁed

system. There’s no need to train these components separately

and then integrate them during decoding as traditional hybrid

system [1]. As the end-to-end model develops, it gradually

exhibits superior performance than traditional hybrid system

in both accuracy and real-time factor(RTF), which makes

it universal and a trend in the speech recognition research

community.

The fundamental problem of end-to-end ASR is how to

process input and output sequences of different lengths. Two

main approaches are proposed to handle this problem. One of

them is connectionist temporal classiﬁcation (CTC) [2], [3].

CTC introduces a special label “blank”, so that input sequence

and output sequence can be aligned, then the CTC loss can be

effectively computed using forward-backward algorithm. The

other framework is attention based encoder-decoder (AED)

model [4], [5]. Chorowski et al. introduce AED model into

speech recognition [4]. LAS [5] produces character sequences

without making any independence assumptions between the

characters. Furthermore, many optimizations of neural network

structure or strategies, such as Convolutional Neural Network

(CNN) [6], [7], Long-Short Term Memory (LSTM) [6], Batch

∗Corresponding author: Jianzong Wang, jzwang@188.com.

Normalization (BN) [6], are conducted on both CTC and AED

approaches.

The other framework is attention based encoder-decoder

(AED) model [4], [5], [8]–[11]. This model ﬁrstly achieves

great success on neural machine translation task [8], [9], [12],

then quickly expands to many other ﬁelds. Chorowski et al.

introduce AED model into speech recognition [4]. They get

PER 17.6% on the TIMIT phoneme recognition task, and

solve the problem of accuracy decrease for long utterances

by adding location-awareness to the attention mechanism.

LAS [5] produces character sequences without making any

independence assumptions between the characters. [10] uses

WFST to decode AED model with a word-level language

model.

A major breakthrough is joint CTC-Attention model [13]–

[23] based on multi-task learning (MTL) framework proposed

by Watanabe, kim et al., which fuses the advantages of the

two approaches above. CTC alignment is used as auxiliary

information to assist the training of AED model. Hence, the

robustness and converge speed are both signiﬁcantly improved.

This method gradually become standard framework of end-to-

end ASR task. Hori et al. propose CTC, attention, RNN-LM

joint decoding [24], and introduce word-based RNN-LM [25],

which further improve performance of end-to-end ASR.

Another important improvement is Transformer [26] pro-

posed by Vaswani et al., which dispense the recurrent network

in the AED model and based solely on attention mechanisms.

This signiﬁcantly speedup the training, and save computing

resources. Dong et al. apply Transformer on ASR task [27],

and get word error rate (WER) of 10.9% on Wall Street Journal

(WSJ) dataset. [28] propose Conformer, a network structure

combining CNN and Transformer, which further improves

model performance.

Joint CTC-Attention model with multi-task learning frame-

work produces 2 decodable branch: CTC and attention. The

attention branch often outperform CTC, because CTC re-

quires conditional independence assumptions to obtain the

label sequence probabilities. However, attention branch has

a drawback that word embedding is autoregressive so that

the computational cost is very large, and it can not process

decoding in parallel. Many studies have been made to tackle

this drawback.

In joint CTC-Attention model, the encoder is like an acous-

tic model, and the decoder is like a language model. Despite

the fusion and joint training of these two models bring quite

a signiﬁcant improvement, the training corpus of the decoder

arXiv:2210.14725v1 [cs.CL] 25 Oct 2022

(language model) is limited to the speech transcriptions, which

is far less than the corpus needed to train an acceptable

language model. We found that the involvement of CTC gives

us the opportunity to generate more linguistic features to assist

the training of decoder. In this work, we propose linguistic-

enhanced transformer, a simple training method, which in-

troduces linguistic information reﬁned from CTC encoder to

the AED decoder. In this way, CTC branch contributes more

linguistic information to decoder during training. On the other

hand, compared with baseline approach, which loads linguistic

target only from ground truth, this method involves some

”error” into linguistic target during the training process of

AED decoder. This reduces the inconsistency of training and

decoding process. Therefore, the decoder is trained to be more

robust and brings performance improvement than baseline

system.

II. METHODOLOGY

A. Embedding Fusion

As shown in Fig.1, the basic idea of embedding fusion

(EF), is directly push CTC 1-best into attention decoder’s word

embedding layer just like the ground truth transcription.

Fig. 1. Embedding Fusion (EF): CTC 1-best and ground truth share word-

embedding of decoder.

We deﬁne the output of the acoustic encoder hsas:

hs=E(x)(1)

where E(·)is the encoding function. We process a Linear layer

and a Softmax layer on hsto get CTC posterior, and invoke

greedy search to get CTC 1-best W= (w1,· · · , wL):

W=G(Softmax(Linear(hs))) (2)

where G(·)refers to greedy search algorithm. Then we fuse

the word-embeddings of ground truth transcription and CTC

1-best with linear combination:

Efusion =α· E(W) + (1 −α)· E(y)(3)

where E(·)refers to word-embedding function, Wis CTC 1-

best, yis ground truth transcription, αis a tunable parameter

to control the weights of the two text sources. Finally, we feed

Efusion into decoder layers.

This may cause some issues. For example, if the greedy

CTC output have different length from ground truth transcrip-

tion, they will have different shape after word-embedding,

so that they can not be combined with the equation above.

Especially when the training process was in the beginning,

the model is not well trained, so that the CTC output will be

disorder.

To solve this problem, we add some rules to our training

process. The basic idea is that if the CTC 1-best and ground

truth transcription have identical lengths, we combine them

using the equation above, if the lengths of them are not equal,

but relatively close, we use CTC 1-best as the decoder input,

if the lengths of them are too far apart, we feed ground truth

to the decoder just like standard Transformer. Obviously, we

need to deﬁne some rules to judge what is “relatively close”

and what is “far apart”. We design 2 types of threshold. First

is absolute threshold. Let Tldenotes maximum allowance

length difference between the ground truth transcription and

CTC 1-best, If |Lctc −Lgroundtruth| ≤ Tl, we think their

lengths are relatively close. The other is relative threshold

Tr, which consider the length differences of ground truth

sentences. |Lctc −Lgroundtruth|/Lgroundtruth ≤T r indicates

that lengths are relatively close.

B. Aligned Embedding Fusion

Inspired by [19], we ﬁnd that even though we add some

rules to prevent the poor CTC output from being fed to AED

decoder, there are still a defect in EF method. The problem is

that CTC 1-best and ground truth transcription are not aligned.

So if we set them to be the input and output of AED decoder

respectively, AED decoder may see wrong training samples. If

CTC 1-best has the same length as ground truth, and has only

substitution errors compared with ground truth, this problem

will not show up. In this situation, AED decoder will be trained

to be a “corrector”. But when the length of CTC 1-best and

ground truth differ, or they have the same lengths but their text

are dislocated, actually when CTC 1-best has insertion errors

or deletion errors compared with ground truth, this problem

will become obvious.

For example, ground truth is y={A, B, C, A}, CTC greedy

output is W={A, C, A}. In step 2 of decoder training,

in normal case, we feed {sos, A, B}into decoder, and wish

decoder to predict {C}as the next token. If we use CTC 1-

best as input, we put {sos, A, C}and wish it to predict {C}.

This is a wrong training sample.

To alleviate this shortcoming, we propose aligned embed-

ding fusion (AEF), which align the CTC 1-best and ground

truth with edit-distance algorithm and insert special symbol

“blank” into the speciﬁc position.

(yalign, Walign) = Atext(y, W )(4)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Linguistic-EnhancedTransformerwithCTCEmbeddingforSpeechRecognitionXulongZhang,JianzongWang,NingCheng,MengyuanZhao,ZhiyongZhang,JingXiaoPingAnTechnology(Shenzhen)Co.,Ltd.AbstractTherecentemergenceofjointCTC-Attentionmodelshowssignicantimprovementinautomaticspeechrecognition(ASR).Theimprovementlarg...

展开>> 收起<<

Linguistic-Enhanced Transformer with CTC Embedding for Speech Recognition Xulong Zhang Jianzong Wang Ning Cheng Mengyuan Zhao Zhiyong Zhang Jing Xiao.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Linguistic-Enhanced Transformer with CTC Embedding for Speech Recognition Xulong Zhang Jianzong Wang Ning Cheng Mengyuan Zhao Zhiyong Zhang Jing Xiao

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: