representation learning method. Fig.1 depicts the architecture
of the CAKT model. Extensive experiments are conducted
on the AISHELL-1 and AISHELL-2 datasets, and the CAKT
yields about 14% and 5% relative improvements over the
baseline system, respectively. Furthermore, we will release
the pre-trained Mandarin Chinese wav2vec2.0 model, the first
publicly available speech representation model trained on the
AISHELL-2 dataset, to the community.
2. RELATED WORKS
Large-scale pre-trained models have attracted much atten-
tion in recent years because they are trained using unlabeled
data and achieve superior results on several downstream
tasks by simple fine-tuning with only a few task-oriented
labeled data. In the context of natural language processing,
the pre-trained language models, such as bidirectional en-
coder representations from Transformers (BERT) [10] and
its variants[11, 12], are representatives. These models have
demonstrated their achievements in information retrieval, text
summarization, and question answering, to name just a few.
Because of the success, previous studies have investigated
the pre-trained language model to enhance the performance
of ASR. On the one hand, several studies directly leverage a
pre-trained language model as a portion of the ASR model
[13, 14, 15, 16, 17, 18, 19]. Although such designs are
straightforward, they can obtain satisfactory performances.
However, these models often slow down the decoding speed
and usually have a large set of model parameters. On the
other hand, a school of research makes the ASR model to
learn linguistic information from pre-trained language mod-
els in a teacher-student training manner [20, 21, 7, 22, 23].
These models still obtain a fast decoding speed, but their
improvements are usually incremental.
Apart from natural language processing, self-supervised
speech representation learning creates a potential research
subject in the speech processing community. Representative
models include wav2vec [24], vq-wav2vec [25], wav2vec2.0
[9], Hubert [26], and so forth. These methods are usually
trained with unlabeled data self-supervised and concentrate
on deriving informative acoustic representatives for a given
speech. Downstream tasks can be done by simply fine-tuning
some additional layers of neural networks with only a few
task-oriented labeled data. Several studies have explored
novel ways to build ASR systems based on the pre-trained
speech representation learning models. The most straight-
forward method is to employ them as an acoustic feature
encoder and then stack a simple layer of neural network on
top of the encoder to do speech recognition [9]. After that,
some studies present various cascade methods to concate-
nate pre-trained language and speech representation learning
models for ASR [14, 15, 17, 18]. Although these methods
have proven their capabilities and effectiveness on bench-
mark corpora, their complicated model architectures and/or
large-scaled model parameters have usually made them hard
to be used in practice.
3. PROPOSED METHODOLOGY
3.1. Vanilla wav2vec2.0-CTC ASR Model
Among the self-supervised speech representation learning
methods, wav2vec can be treated as the pioneer study in
the research subject. In this study, we thus employ the ad-
vanced variant, i.e., wav2vec2.0, to serve as the cornerstone.
The wav2vec2.0 consists of a CNN-based feature encoder
and a contextualized acoustic representation extractor based
on multi-layer Transformers. More formally, it takes a raw
speech Xas an input and outputs a set of acoustic represen-
tations HX
L:
HX
0=CNN(X),(1)
HX
l=Transformerl(HX
l−1),(2)
where l∈ {1,· · · , L}denotes the layer number of Trans-
formers and HX
l∈Rd×Trepresents a set of Tfeature vec-
tors whose dimension is d. To construct an ASR model, a
layer normalization (LN) layer, a linear layer and a softmax
activation function are sequentially stacked on the top of the
wav2vec2.0:
HX=LN(HX
L),(3)
ˆ
Y=Softmax(Linear(HX)).(4)
Consequently, the CTC loss LCTC is used to guide the model
training toward minimizing the differences between the pre-
diction ˆ
Yand the ground-truth Y. We denote the simple but
straightforward wav2vec2.0-CTC ASR model as w2v2-CTC.
3.2. Token-dependent Knowledge Transferring Module
Extending from the vanilla w2v2-CTC ASR model, we
present a token-dependent knowledge transferring module
and a context-aware training strategy to not only copy lin-
guistic information from a pre-trained language model to the
ASR model but also reduce the limitations caused by the con-
ditional independence assumption. Since the classic BERT
model remains the most popular, we thus use it as an example
to conduct the framework.
In order to distill knowledge from BERT, a token-dependent
knowledge transferring module, which is mainly based on
multi-head attention, is introduced. First of all, for each train-
ing speech utterance X, special tokens [BOS] and [EOS] are
padded at the beginning and end of its corresponding gold to-
ken sequence Y={y1,· · · , yN}. Then, as with the seminal
literal, we sum each token embedding with its own absolute
sinusoidal positional embedding [27], which is used to distin-
guish the order of each token in the line. The set of resulting
vectors E={e[BOS], ey1,· · · , eyN, e[EOS]}and high-level