Linguistic-Enhanced Transformer with CTC Embedding for Speech Recognition Xulong Zhang Jianzong Wang Ning Cheng Mengyuan Zhao Zhiyong Zhang Jing Xiao

2025-05-06 0 0 924.07KB 6 页 10玖币
侵权投诉
Linguistic-Enhanced Transformer with CTC
Embedding for Speech Recognition
Xulong Zhang, Jianzong Wang, Ning Cheng, Mengyuan Zhao, Zhiyong Zhang, Jing Xiao
Ping An Technology (Shenzhen) Co., Ltd.
Abstract—The recent emergence of joint CTC-Attention model
shows significant improvement in automatic speech recognition
(ASR). The improvement largely lies in the modeling of linguistic
information by decoder. The decoder joint-optimized with an
acoustic encoder renders the language model from ground-
truth sequences in an auto-regressive manner during training.
However, the training corpus of the decoder is limited to the
speech transcriptions, which is far less than the corpus needed
to train an acceptable language model. This leads to poor
robustness of decoder. To alleviate this problem, we propose
linguistic-enhanced transformer, which introduces refined CTC
information to decoder during training process, so that the
decoder can be more robust. Our experiments on AISHELL-
1 speech corpus show that the character error rate (CER) is
relatively reduced by up to 7%. We also find that in joint CTC-
Attention ASR model, decoder is more sensitive to linguistic
information than acoustic information.
Index Terms—speech recognition, attention, CTC, transformer,
linguistic information
I. INTRODUCTION
The adoption of end-to-end models greatly simplifies the
training process of automatic speech recognition (ASR) sys-
tem. In such models, acoustic, pronunciation and language
modeling components are jointly optimized in an unified
system. There’s no need to train these components separately
and then integrate them during decoding as traditional hybrid
system [1]. As the end-to-end model develops, it gradually
exhibits superior performance than traditional hybrid system
in both accuracy and real-time factor(RTF), which makes
it universal and a trend in the speech recognition research
community.
The fundamental problem of end-to-end ASR is how to
process input and output sequences of different lengths. Two
main approaches are proposed to handle this problem. One of
them is connectionist temporal classification (CTC) [2], [3].
CTC introduces a special label “blank”, so that input sequence
and output sequence can be aligned, then the CTC loss can be
effectively computed using forward-backward algorithm. The
other framework is attention based encoder-decoder (AED)
model [4], [5]. Chorowski et al. introduce AED model into
speech recognition [4]. LAS [5] produces character sequences
without making any independence assumptions between the
characters. Furthermore, many optimizations of neural network
structure or strategies, such as Convolutional Neural Network
(CNN) [6], [7], Long-Short Term Memory (LSTM) [6], Batch
Corresponding author: Jianzong Wang, jzwang@188.com.
Normalization (BN) [6], are conducted on both CTC and AED
approaches.
The other framework is attention based encoder-decoder
(AED) model [4], [5], [8]–[11]. This model firstly achieves
great success on neural machine translation task [8], [9], [12],
then quickly expands to many other fields. Chorowski et al.
introduce AED model into speech recognition [4]. They get
PER 17.6% on the TIMIT phoneme recognition task, and
solve the problem of accuracy decrease for long utterances
by adding location-awareness to the attention mechanism.
LAS [5] produces character sequences without making any
independence assumptions between the characters. [10] uses
WFST to decode AED model with a word-level language
model.
A major breakthrough is joint CTC-Attention model [13]–
[23] based on multi-task learning (MTL) framework proposed
by Watanabe, kim et al., which fuses the advantages of the
two approaches above. CTC alignment is used as auxiliary
information to assist the training of AED model. Hence, the
robustness and converge speed are both significantly improved.
This method gradually become standard framework of end-to-
end ASR task. Hori et al. propose CTC, attention, RNN-LM
joint decoding [24], and introduce word-based RNN-LM [25],
which further improve performance of end-to-end ASR.
Another important improvement is Transformer [26] pro-
posed by Vaswani et al., which dispense the recurrent network
in the AED model and based solely on attention mechanisms.
This significantly speedup the training, and save computing
resources. Dong et al. apply Transformer on ASR task [27],
and get word error rate (WER) of 10.9% on Wall Street Journal
(WSJ) dataset. [28] propose Conformer, a network structure
combining CNN and Transformer, which further improves
model performance.
Joint CTC-Attention model with multi-task learning frame-
work produces 2 decodable branch: CTC and attention. The
attention branch often outperform CTC, because CTC re-
quires conditional independence assumptions to obtain the
label sequence probabilities. However, attention branch has
a drawback that word embedding is autoregressive so that
the computational cost is very large, and it can not process
decoding in parallel. Many studies have been made to tackle
this drawback.
In joint CTC-Attention model, the encoder is like an acous-
tic model, and the decoder is like a language model. Despite
the fusion and joint training of these two models bring quite
a significant improvement, the training corpus of the decoder
arXiv:2210.14725v1 [cs.CL] 25 Oct 2022
(language model) is limited to the speech transcriptions, which
is far less than the corpus needed to train an acceptable
language model. We found that the involvement of CTC gives
us the opportunity to generate more linguistic features to assist
the training of decoder. In this work, we propose linguistic-
enhanced transformer, a simple training method, which in-
troduces linguistic information refined from CTC encoder to
the AED decoder. In this way, CTC branch contributes more
linguistic information to decoder during training. On the other
hand, compared with baseline approach, which loads linguistic
target only from ground truth, this method involves some
”error” into linguistic target during the training process of
AED decoder. This reduces the inconsistency of training and
decoding process. Therefore, the decoder is trained to be more
robust and brings performance improvement than baseline
system.
II. METHODOLOGY
A. Embedding Fusion
As shown in Fig.1, the basic idea of embedding fusion
(EF), is directly push CTC 1-best into attention decoder’s word
embedding layer just like the ground truth transcription.
Fig. 1. Embedding Fusion (EF): CTC 1-best and ground truth share word-
embedding of decoder.
We define the output of the acoustic encoder hsas:
hs=E(x)(1)
where E(·)is the encoding function. We process a Linear layer
and a Softmax layer on hsto get CTC posterior, and invoke
greedy search to get CTC 1-best W= (w1,· · · , wL):
W=G(Softmax(Linear(hs))) (2)
where G(·)refers to greedy search algorithm. Then we fuse
the word-embeddings of ground truth transcription and CTC
1-best with linear combination:
Efusion =α· E(W) + (1 α)· E(y)(3)
where E(·)refers to word-embedding function, Wis CTC 1-
best, yis ground truth transcription, αis a tunable parameter
to control the weights of the two text sources. Finally, we feed
Efusion into decoder layers.
This may cause some issues. For example, if the greedy
CTC output have different length from ground truth transcrip-
tion, they will have different shape after word-embedding,
so that they can not be combined with the equation above.
Especially when the training process was in the beginning,
the model is not well trained, so that the CTC output will be
disorder.
To solve this problem, we add some rules to our training
process. The basic idea is that if the CTC 1-best and ground
truth transcription have identical lengths, we combine them
using the equation above, if the lengths of them are not equal,
but relatively close, we use CTC 1-best as the decoder input,
if the lengths of them are too far apart, we feed ground truth
to the decoder just like standard Transformer. Obviously, we
need to define some rules to judge what is “relatively close”
and what is “far apart”. We design 2 types of threshold. First
is absolute threshold. Let Tldenotes maximum allowance
length difference between the ground truth transcription and
CTC 1-best, If |Lctc Lgroundtruth| ≤ Tl, we think their
lengths are relatively close. The other is relative threshold
Tr, which consider the length differences of ground truth
sentences. |Lctc Lgroundtruth|/Lgroundtruth T r indicates
that lengths are relatively close.
B. Aligned Embedding Fusion
Inspired by [19], we find that even though we add some
rules to prevent the poor CTC output from being fed to AED
decoder, there are still a defect in EF method. The problem is
that CTC 1-best and ground truth transcription are not aligned.
So if we set them to be the input and output of AED decoder
respectively, AED decoder may see wrong training samples. If
CTC 1-best has the same length as ground truth, and has only
substitution errors compared with ground truth, this problem
will not show up. In this situation, AED decoder will be trained
to be a “corrector”. But when the length of CTC 1-best and
ground truth differ, or they have the same lengths but their text
are dislocated, actually when CTC 1-best has insertion errors
or deletion errors compared with ground truth, this problem
will become obvious.
For example, ground truth is y={A, B, C, A}, CTC greedy
output is W={A, C, A}. In step 2 of decoder training,
in normal case, we feed {sos, A, B}into decoder, and wish
decoder to predict {C}as the next token. If we use CTC 1-
best as input, we put {sos, A, C}and wish it to predict {C}.
This is a wrong training sample.
To alleviate this shortcoming, we propose aligned embed-
ding fusion (AEF), which align the CTC 1-best and ground
truth with edit-distance algorithm and insert special symbol
“blank” into the specific position.
(yalign, Walign) = Atext(y, W )(4)
摘要:

Linguistic-EnhancedTransformerwithCTCEmbeddingforSpeechRecognitionXulongZhang,JianzongWang,NingCheng,MengyuanZhao,ZhiyongZhang,JingXiaoPingAnTechnology(Shenzhen)Co.,Ltd.Abstract—TherecentemergenceofjointCTC-Attentionmodelshowssignicantimprovementinautomaticspeechrecognition(ASR).Theimprovementlarg...

展开>> 收起<<
Linguistic-Enhanced Transformer with CTC Embedding for Speech Recognition Xulong Zhang Jianzong Wang Ning Cheng Mengyuan Zhao Zhiyong Zhang Jing Xiao.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:924.07KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注