Linguistic-Enhanced Transformer with CTC
Embedding for Speech Recognition
Xulong Zhang, Jianzong Wang∗, Ning Cheng, Mengyuan Zhao, Zhiyong Zhang, Jing Xiao
Ping An Technology (Shenzhen) Co., Ltd.
Abstract—The recent emergence of joint CTC-Attention model
shows significant improvement in automatic speech recognition
(ASR). The improvement largely lies in the modeling of linguistic
information by decoder. The decoder joint-optimized with an
acoustic encoder renders the language model from ground-
truth sequences in an auto-regressive manner during training.
However, the training corpus of the decoder is limited to the
speech transcriptions, which is far less than the corpus needed
to train an acceptable language model. This leads to poor
robustness of decoder. To alleviate this problem, we propose
linguistic-enhanced transformer, which introduces refined CTC
information to decoder during training process, so that the
decoder can be more robust. Our experiments on AISHELL-
1 speech corpus show that the character error rate (CER) is
relatively reduced by up to 7%. We also find that in joint CTC-
Attention ASR model, decoder is more sensitive to linguistic
information than acoustic information.
Index Terms—speech recognition, attention, CTC, transformer,
linguistic information
I. INTRODUCTION
The adoption of end-to-end models greatly simplifies the
training process of automatic speech recognition (ASR) sys-
tem. In such models, acoustic, pronunciation and language
modeling components are jointly optimized in an unified
system. There’s no need to train these components separately
and then integrate them during decoding as traditional hybrid
system [1]. As the end-to-end model develops, it gradually
exhibits superior performance than traditional hybrid system
in both accuracy and real-time factor(RTF), which makes
it universal and a trend in the speech recognition research
community.
The fundamental problem of end-to-end ASR is how to
process input and output sequences of different lengths. Two
main approaches are proposed to handle this problem. One of
them is connectionist temporal classification (CTC) [2], [3].
CTC introduces a special label “blank”, so that input sequence
and output sequence can be aligned, then the CTC loss can be
effectively computed using forward-backward algorithm. The
other framework is attention based encoder-decoder (AED)
model [4], [5]. Chorowski et al. introduce AED model into
speech recognition [4]. LAS [5] produces character sequences
without making any independence assumptions between the
characters. Furthermore, many optimizations of neural network
structure or strategies, such as Convolutional Neural Network
(CNN) [6], [7], Long-Short Term Memory (LSTM) [6], Batch
∗Corresponding author: Jianzong Wang, jzwang@188.com.
Normalization (BN) [6], are conducted on both CTC and AED
approaches.
The other framework is attention based encoder-decoder
(AED) model [4], [5], [8]–[11]. This model firstly achieves
great success on neural machine translation task [8], [9], [12],
then quickly expands to many other fields. Chorowski et al.
introduce AED model into speech recognition [4]. They get
PER 17.6% on the TIMIT phoneme recognition task, and
solve the problem of accuracy decrease for long utterances
by adding location-awareness to the attention mechanism.
LAS [5] produces character sequences without making any
independence assumptions between the characters. [10] uses
WFST to decode AED model with a word-level language
model.
A major breakthrough is joint CTC-Attention model [13]–
[23] based on multi-task learning (MTL) framework proposed
by Watanabe, kim et al., which fuses the advantages of the
two approaches above. CTC alignment is used as auxiliary
information to assist the training of AED model. Hence, the
robustness and converge speed are both significantly improved.
This method gradually become standard framework of end-to-
end ASR task. Hori et al. propose CTC, attention, RNN-LM
joint decoding [24], and introduce word-based RNN-LM [25],
which further improve performance of end-to-end ASR.
Another important improvement is Transformer [26] pro-
posed by Vaswani et al., which dispense the recurrent network
in the AED model and based solely on attention mechanisms.
This significantly speedup the training, and save computing
resources. Dong et al. apply Transformer on ASR task [27],
and get word error rate (WER) of 10.9% on Wall Street Journal
(WSJ) dataset. [28] propose Conformer, a network structure
combining CNN and Transformer, which further improves
model performance.
Joint CTC-Attention model with multi-task learning frame-
work produces 2 decodable branch: CTC and attention. The
attention branch often outperform CTC, because CTC re-
quires conditional independence assumptions to obtain the
label sequence probabilities. However, attention branch has
a drawback that word embedding is autoregressive so that
the computational cost is very large, and it can not process
decoding in parallel. Many studies have been made to tackle
this drawback.
In joint CTC-Attention model, the encoder is like an acous-
tic model, and the decoder is like a language model. Despite
the fusion and joint training of these two models bring quite
a significant improvement, the training corpus of the decoder
arXiv:2210.14725v1 [cs.CL] 25 Oct 2022