A CONTEXT-AWARE KNOWLEDGE TRANSFERRING STRATEGY FOR CTC-BASED ASR Ke-Han Lu and Kuan-Yu Chen National Taiwan University of Science and Technology Taiwan

2025-04-27 0 0 490.1KB 8 页 10玖币
侵权投诉
A CONTEXT-AWARE KNOWLEDGE TRANSFERRING STRATEGY FOR CTC-BASED ASR
Ke-Han Lu, and Kuan-Yu Chen
National Taiwan University of Science and Technology, Taiwan
khlu@nlp.csie.ntust.edu.tw, kychen@mail.ntust.edu.tw
ABSTRACT
Non-autoregressive automatic speech recognition (ASR)
modeling has received increasing attention recently because
of its fast decoding speed and superior performance. Among
representatives, methods based on the connectionist temporal
classification (CTC) are still a dominating stream. However,
the theoretically inherent flaw, the assumption of indepen-
dence between tokens, creates a performance barrier for the
school of works. To mitigate the challenge, we propose a
context-aware knowledge transferring strategy, consisting of
a knowledge transferring module and a context-aware train-
ing strategy, for CTC-based ASR. The former is designed
to distill linguistic information from a pre-trained language
model, and the latter is framed to modulate the limitations
caused by the conditional independence assumption. As a
result, a knowledge-injected context-aware CTC-based ASR
built upon the wav2vec2.0 is presented in this paper. A series
of experiments on the AISHELL-1 and AISHELL-2 datasets
demonstrate the effectiveness of the proposed method.
Index TermsCTC, context-aware, knowledge transfer,
ASR
1. INTRODUCTION
Automatic speech recognition (ASR) systems aim at convert-
ing a given input speech signal into its corresponding token
sequence. They are not only required to model the acous-
tic information from the speech signal but needed to gen-
erate a precise token sequence corresponding to the speech
and the contextual coherence. In recent years, connectionist
temporal classification (CTC)-based ASR systems [1] have
attracted significant attention since they can achieve a much
faster decoding speed in the non-autoregressive manner and
obtain competitive or even better performance compared to
the conventional auto-regressive models [2, 3, 4, 5, 6, 7, 8].
To be specific, a standard CTC-based ASR usually consists of
a multi-layer Transformer-based acoustic encoder and a clas-
sification head based on some layers of simple feedforward
neural network. The acoustic encoder concentrates on encap-
sulating important characteristics of input speech into a set
of feature vectors. The classification head aims at translating
the set of feature vectors into a sequence of tokens. Subse-
quently, a CTC loss is employed to guild the model training
so as to minimize the differences between the generated token
sequence and the target gold reference.
Although CTC-based ASR models demonstrate their ef-
ficiency and effectiveness on several benchmark corpora, the
school of models usually suffers from the conditional inde-
pendence assumption, making it difficult to consider the re-
lationships among tokens occurring in a sequence. Take the
wav2vec2.0-CTC model (cf. Section 3.1) as an example, we
found most of the substitution errors are mistakenly predicted
to tokens with similar pronunciations. We also observed that
the output representations generated by the encoder for to-
kens with similar pronunciations usually mix together. These
observations indicate that CTC-based ASR systems can learn
acoustic information well but are still imperfect in learning
linguistic information.
Various research has been devoted to improving CTC-
based ASR. The intermediate CTC [2] introduces auxiliary
CTC losses to the intermediate layers of the acoustic encoder.
The self-conditioned CTC [3] and its variants [4, 5] take pre-
dictions from intermediate layers as additional clues for the
following layers of the encoder. These methods mainly fo-
cus on easing the conditional independence assumption from
a theoretical perspective. The contextualized CTC loss [6] is
proposed to guild the model learn contextualized information
by introducing extra prediction heads to predict surrounding
tokens. Some studies aspire to improve CTC-based ASR via
knowledge transferring from pre-trained language models [7].
Following the research line, in this study, we present
a knowledge-injected context-aware CTC-based ASR built
upon the wav2vec2.0 [9]. Specific characteristics are at least
threefold. First, a knowledge transferring module is designed
to distill linguistic information from a pre-trained language
model and inject the knowledge into the ASR model. Next, a
context-aware training strategy is proposed to relax the condi-
tional independence assumption. Finally, to enjoy the merits
of a pre-trained speech representation learning method, the
wav2vec2.0 [9] is employed as the acoustic encoder for our
ASR model. As a result, a context-aware knowledge trans-
ferred wav2vec2.0-CTC ASR (CAKT) model is proposed in
this study. It not only has a similar model size and decoding
speed to the vanilla CTC-based ASR model, but also inher-
its benefits from a pre-trained language model and a speech
arXiv:2210.06244v1 [cs.CL] 12 Oct 2022
representation learning method. Fig.1 depicts the architecture
of the CAKT model. Extensive experiments are conducted
on the AISHELL-1 and AISHELL-2 datasets, and the CAKT
yields about 14% and 5% relative improvements over the
baseline system, respectively. Furthermore, we will release
the pre-trained Mandarin Chinese wav2vec2.0 model, the first
publicly available speech representation model trained on the
AISHELL-2 dataset, to the community.
2. RELATED WORKS
Large-scale pre-trained models have attracted much atten-
tion in recent years because they are trained using unlabeled
data and achieve superior results on several downstream
tasks by simple fine-tuning with only a few task-oriented
labeled data. In the context of natural language processing,
the pre-trained language models, such as bidirectional en-
coder representations from Transformers (BERT) [10] and
its variants[11, 12], are representatives. These models have
demonstrated their achievements in information retrieval, text
summarization, and question answering, to name just a few.
Because of the success, previous studies have investigated
the pre-trained language model to enhance the performance
of ASR. On the one hand, several studies directly leverage a
pre-trained language model as a portion of the ASR model
[13, 14, 15, 16, 17, 18, 19]. Although such designs are
straightforward, they can obtain satisfactory performances.
However, these models often slow down the decoding speed
and usually have a large set of model parameters. On the
other hand, a school of research makes the ASR model to
learn linguistic information from pre-trained language mod-
els in a teacher-student training manner [20, 21, 7, 22, 23].
These models still obtain a fast decoding speed, but their
improvements are usually incremental.
Apart from natural language processing, self-supervised
speech representation learning creates a potential research
subject in the speech processing community. Representative
models include wav2vec [24], vq-wav2vec [25], wav2vec2.0
[9], Hubert [26], and so forth. These methods are usually
trained with unlabeled data self-supervised and concentrate
on deriving informative acoustic representatives for a given
speech. Downstream tasks can be done by simply fine-tuning
some additional layers of neural networks with only a few
task-oriented labeled data. Several studies have explored
novel ways to build ASR systems based on the pre-trained
speech representation learning models. The most straight-
forward method is to employ them as an acoustic feature
encoder and then stack a simple layer of neural network on
top of the encoder to do speech recognition [9]. After that,
some studies present various cascade methods to concate-
nate pre-trained language and speech representation learning
models for ASR [14, 15, 17, 18]. Although these methods
have proven their capabilities and effectiveness on bench-
mark corpora, their complicated model architectures and/or
large-scaled model parameters have usually made them hard
to be used in practice.
3. PROPOSED METHODOLOGY
3.1. Vanilla wav2vec2.0-CTC ASR Model
Among the self-supervised speech representation learning
methods, wav2vec can be treated as the pioneer study in
the research subject. In this study, we thus employ the ad-
vanced variant, i.e., wav2vec2.0, to serve as the cornerstone.
The wav2vec2.0 consists of a CNN-based feature encoder
and a contextualized acoustic representation extractor based
on multi-layer Transformers. More formally, it takes a raw
speech Xas an input and outputs a set of acoustic represen-
tations HX
L:
HX
0=CNN(X),(1)
HX
l=Transformerl(HX
l1),(2)
where l∈ {1,· · · , L}denotes the layer number of Trans-
formers and HX
lRd×Trepresents a set of Tfeature vec-
tors whose dimension is d. To construct an ASR model, a
layer normalization (LN) layer, a linear layer and a softmax
activation function are sequentially stacked on the top of the
wav2vec2.0:
HX=LN(HX
L),(3)
ˆ
Y=Softmax(Linear(HX)).(4)
Consequently, the CTC loss LCTC is used to guide the model
training toward minimizing the differences between the pre-
diction ˆ
Yand the ground-truth Y. We denote the simple but
straightforward wav2vec2.0-CTC ASR model as w2v2-CTC.
3.2. Token-dependent Knowledge Transferring Module
Extending from the vanilla w2v2-CTC ASR model, we
present a token-dependent knowledge transferring module
and a context-aware training strategy to not only copy lin-
guistic information from a pre-trained language model to the
ASR model but also reduce the limitations caused by the con-
ditional independence assumption. Since the classic BERT
model remains the most popular, we thus use it as an example
to conduct the framework.
In order to distill knowledge from BERT, a token-dependent
knowledge transferring module, which is mainly based on
multi-head attention, is introduced. First of all, for each train-
ing speech utterance X, special tokens [BOS] and [EOS] are
padded at the beginning and end of its corresponding gold to-
ken sequence Y={y1,· · · , yN}. Then, as with the seminal
literal, we sum each token embedding with its own absolute
sinusoidal positional embedding [27], which is used to distin-
guish the order of each token in the line. The set of resulting
vectors E={e[BOS], ey1,· · · , eyN, e[EOS]}and high-level
摘要:

ACONTEXT-AWAREKNOWLEDGETRANSFERRINGSTRATEGYFORCTC-BASEDASRKe-HanLu,andKuan-YuChenNationalTaiwanUniversityofScienceandTechnology,Taiwankhlu@nlp.csie.ntust.edu.tw,kychen@mail.ntust.edu.twABSTRACTNon-autoregressiveautomaticspeechrecognition(ASR)modelinghasreceivedincreasingattentionrecentlybecauseofits...

展开>> 收起<<
A CONTEXT-AWARE KNOWLEDGE TRANSFERRING STRATEGY FOR CTC-BASED ASR Ke-Han Lu and Kuan-Yu Chen National Taiwan University of Science and Technology Taiwan.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:8 页 大小:490.1KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注