A CONTEXT-AWARE KNOWLEDGE TRANSFERRING STRATEGY FOR CTC-BASED ASR Ke-Han Lu and Kuan-Yu Chen National Taiwan University of Science and Technology Taiwan

2025-04-27 1 0 490.1KB 8 页 10玖币

侵权投诉

A CONTEXT-AWARE KNOWLEDGE TRANSFERRING STRATEGY FOR CTC-BASED ASR

Ke-Han Lu, and Kuan-Yu Chen

National Taiwan University of Science and Technology, Taiwan

khlu@nlp.csie.ntust.edu.tw, kychen@mail.ntust.edu.tw

ABSTRACT

Non-autoregressive automatic speech recognition (ASR)

modeling has received increasing attention recently because

of its fast decoding speed and superior performance. Among

representatives, methods based on the connectionist temporal

classiﬁcation (CTC) are still a dominating stream. However,

the theoretically inherent ﬂaw, the assumption of indepen-

dence between tokens, creates a performance barrier for the

school of works. To mitigate the challenge, we propose a

context-aware knowledge transferring strategy, consisting of

a knowledge transferring module and a context-aware train-

ing strategy, for CTC-based ASR. The former is designed

to distill linguistic information from a pre-trained language

model, and the latter is framed to modulate the limitations

caused by the conditional independence assumption. As a

result, a knowledge-injected context-aware CTC-based ASR

built upon the wav2vec2.0 is presented in this paper. A series

of experiments on the AISHELL-1 and AISHELL-2 datasets

demonstrate the effectiveness of the proposed method.

Index Terms—CTC, context-aware, knowledge transfer,

ASR

1. INTRODUCTION

Automatic speech recognition (ASR) systems aim at convert-

ing a given input speech signal into its corresponding token

sequence. They are not only required to model the acous-

tic information from the speech signal but needed to gen-

erate a precise token sequence corresponding to the speech

and the contextual coherence. In recent years, connectionist

temporal classiﬁcation (CTC)-based ASR systems [1] have

attracted signiﬁcant attention since they can achieve a much

faster decoding speed in the non-autoregressive manner and

obtain competitive or even better performance compared to

the conventional auto-regressive models [2, 3, 4, 5, 6, 7, 8].

To be speciﬁc, a standard CTC-based ASR usually consists of

a multi-layer Transformer-based acoustic encoder and a clas-

siﬁcation head based on some layers of simple feedforward

neural network. The acoustic encoder concentrates on encap-

sulating important characteristics of input speech into a set

of feature vectors. The classiﬁcation head aims at translating

the set of feature vectors into a sequence of tokens. Subse-

quently, a CTC loss is employed to guild the model training

so as to minimize the differences between the generated token

sequence and the target gold reference.

Although CTC-based ASR models demonstrate their ef-

ﬁciency and effectiveness on several benchmark corpora, the

school of models usually suffers from the conditional inde-

pendence assumption, making it difﬁcult to consider the re-

lationships among tokens occurring in a sequence. Take the

wav2vec2.0-CTC model (cf. Section 3.1) as an example, we

found most of the substitution errors are mistakenly predicted

to tokens with similar pronunciations. We also observed that

the output representations generated by the encoder for to-

kens with similar pronunciations usually mix together. These

observations indicate that CTC-based ASR systems can learn

acoustic information well but are still imperfect in learning

linguistic information.

Various research has been devoted to improving CTC-

based ASR. The intermediate CTC [2] introduces auxiliary

CTC losses to the intermediate layers of the acoustic encoder.

The self-conditioned CTC [3] and its variants [4, 5] take pre-

dictions from intermediate layers as additional clues for the

following layers of the encoder. These methods mainly fo-

cus on easing the conditional independence assumption from

a theoretical perspective. The contextualized CTC loss [6] is

proposed to guild the model learn contextualized information

by introducing extra prediction heads to predict surrounding

tokens. Some studies aspire to improve CTC-based ASR via

knowledge transferring from pre-trained language models [7].

Following the research line, in this study, we present

a knowledge-injected context-aware CTC-based ASR built

upon the wav2vec2.0 [9]. Speciﬁc characteristics are at least

threefold. First, a knowledge transferring module is designed

to distill linguistic information from a pre-trained language

model and inject the knowledge into the ASR model. Next, a

context-aware training strategy is proposed to relax the condi-

tional independence assumption. Finally, to enjoy the merits

of a pre-trained speech representation learning method, the

wav2vec2.0 [9] is employed as the acoustic encoder for our

ASR model. As a result, a context-aware knowledge trans-

ferred wav2vec2.0-CTC ASR (CAKT) model is proposed in

this study. It not only has a similar model size and decoding

speed to the vanilla CTC-based ASR model, but also inher-

its beneﬁts from a pre-trained language model and a speech

arXiv:2210.06244v1 [cs.CL] 12 Oct 2022

representation learning method. Fig.1 depicts the architecture

of the CAKT model. Extensive experiments are conducted

on the AISHELL-1 and AISHELL-2 datasets, and the CAKT

yields about 14% and 5% relative improvements over the

baseline system, respectively. Furthermore, we will release

the pre-trained Mandarin Chinese wav2vec2.0 model, the ﬁrst

publicly available speech representation model trained on the

AISHELL-2 dataset, to the community.

2. RELATED WORKS

Large-scale pre-trained models have attracted much atten-

tion in recent years because they are trained using unlabeled

data and achieve superior results on several downstream

tasks by simple ﬁne-tuning with only a few task-oriented

labeled data. In the context of natural language processing,

the pre-trained language models, such as bidirectional en-

coder representations from Transformers (BERT) [10] and

its variants[11, 12], are representatives. These models have

demonstrated their achievements in information retrieval, text

summarization, and question answering, to name just a few.

Because of the success, previous studies have investigated

the pre-trained language model to enhance the performance

of ASR. On the one hand, several studies directly leverage a

pre-trained language model as a portion of the ASR model

[13, 14, 15, 16, 17, 18, 19]. Although such designs are

straightforward, they can obtain satisfactory performances.

However, these models often slow down the decoding speed

and usually have a large set of model parameters. On the

other hand, a school of research makes the ASR model to

learn linguistic information from pre-trained language mod-

els in a teacher-student training manner [20, 21, 7, 22, 23].

These models still obtain a fast decoding speed, but their

improvements are usually incremental.

Apart from natural language processing, self-supervised

speech representation learning creates a potential research

subject in the speech processing community. Representative

models include wav2vec [24], vq-wav2vec [25], wav2vec2.0

[9], Hubert [26], and so forth. These methods are usually

trained with unlabeled data self-supervised and concentrate

on deriving informative acoustic representatives for a given

speech. Downstream tasks can be done by simply ﬁne-tuning

some additional layers of neural networks with only a few

task-oriented labeled data. Several studies have explored

novel ways to build ASR systems based on the pre-trained

speech representation learning models. The most straight-

forward method is to employ them as an acoustic feature

encoder and then stack a simple layer of neural network on

top of the encoder to do speech recognition [9]. After that,

some studies present various cascade methods to concate-

nate pre-trained language and speech representation learning

models for ASR [14, 15, 17, 18]. Although these methods

have proven their capabilities and effectiveness on bench-

mark corpora, their complicated model architectures and/or

large-scaled model parameters have usually made them hard

to be used in practice.

3. PROPOSED METHODOLOGY

3.1. Vanilla wav2vec2.0-CTC ASR Model

Among the self-supervised speech representation learning

methods, wav2vec can be treated as the pioneer study in

the research subject. In this study, we thus employ the ad-

vanced variant, i.e., wav2vec2.0, to serve as the cornerstone.

The wav2vec2.0 consists of a CNN-based feature encoder

and a contextualized acoustic representation extractor based

on multi-layer Transformers. More formally, it takes a raw

speech Xas an input and outputs a set of acoustic represen-

tations HX

0=CNN(X),(1)

l=Transformerl(HX

l−1),(2)

where l∈ {1,· · · , L}denotes the layer number of Trans-

formers and HX

l∈Rd×Trepresents a set of Tfeature vec-

tors whose dimension is d. To construct an ASR model, a

layer normalization (LN) layer, a linear layer and a softmax

activation function are sequentially stacked on the top of the

wav2vec2.0:

HX=LN(HX

L),(3)

Y=Softmax(Linear(HX)).(4)

Consequently, the CTC loss LCTC is used to guide the model

training toward minimizing the differences between the pre-

diction ˆ

Yand the ground-truth Y. We denote the simple but

straightforward wav2vec2.0-CTC ASR model as w2v2-CTC.

3.2. Token-dependent Knowledge Transferring Module

Extending from the vanilla w2v2-CTC ASR model, we

present a token-dependent knowledge transferring module

and a context-aware training strategy to not only copy lin-

guistic information from a pre-trained language model to the

ASR model but also reduce the limitations caused by the con-

ditional independence assumption. Since the classic BERT

model remains the most popular, we thus use it as an example

to conduct the framework.

In order to distill knowledge from BERT, a token-dependent

knowledge transferring module, which is mainly based on

multi-head attention, is introduced. First of all, for each train-

ing speech utterance X, special tokens [BOS] and [EOS] are

padded at the beginning and end of its corresponding gold to-

ken sequence Y={y1,· · · , yN}. Then, as with the seminal

literal, we sum each token embedding with its own absolute

sinusoidal positional embedding [27], which is used to distin-

guish the order of each token in the line. The set of resulting

vectors E={e[BOS], ey1,· · · , eyN, e[EOS]}and high-level

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ACONTEXT-AWAREKNOWLEDGETRANSFERRINGSTRATEGYFORCTC-BASEDASRKe-HanLu,andKuan-YuChenNationalTaiwanUniversityofScienceandTechnology,Taiwankhlu@nlp.csie.ntust.edu.tw,kychen@mail.ntust.edu.twABSTRACTNon-autoregressiveautomaticspeechrecognition(ASR)modelinghasreceivedincreasingattentionrecentlybecauseofits...

展开>> 收起<<

A CONTEXT-AWARE KNOWLEDGE TRANSFERRING STRATEGY FOR CTC-BASED ASR Ke-Han Lu and Kuan-Yu Chen National Taiwan University of Science and Technology Taiwan.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A CONTEXT-AWARE KNOWLEDGE TRANSFERRING STRATEGY FOR CTC-BASED ASR Ke-Han Lu and Kuan-Yu Chen National Taiwan University of Science and Technology Taiwan

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: