GUIDED CONTRASTIVE SELF-SUPERVISED PRE-TRAINING FOR AUTOMATIC SPEECH RECOGNITION Aparna Khare1 Minhua Wu1 Saurabhchand Bhati2 Jasha Droppo1 Roland Maas1

2025-05-06 0 0 3.29MB 8 页 10玖币

侵权投诉

GUIDED CONTRASTIVE SELF-SUPERVISED PRE-TRAINING

FOR AUTOMATIC SPEECH RECOGNITION

Aparna Khare1, Minhua Wu1, Saurabhchand Bhati2?, Jasha Droppo1, Roland Maas1

1Amazon Alexa, USA

2Johns Hopkins University, USA

ABSTRACT

Contrastive Predictive Coding (CPC) is a representation

learning method that maximizes the mutual information be-

tween intermediate latent representations and the output of

a given model. It can be used to effectively initialize the

encoder of an Automatic Speech Recognition (ASR) model.

We present a novel modiﬁcation of CPC called Guided Con-

trastive Predictive Coding (GCPC). Our proposed method

maximizes the mutual information between representations

from a prior-knowledge model and the output of the model

being pre-trained, allowing prior knowledge injection dur-

ing pre-training. We validate our method on 3 ASR tasks:

German, French and English. Our method outperforms CPC

pre-training on all three datasets, reducing the Word Error

Rate (WER) by 4.44%, 6.55% and 15.43% relative on the

German, French and English (Librispeech) tasks respectively,

compared to training from scratch, while CPC pre-training

only brings 2.96%, 1.01% and 14.39% relative WER reduc-

tion respectively.

Index Terms—Self-supervised learning, RNN-T, ASR

1. INTRODUCTION

Self-supervised Learning (SSL) has drawn a lot of recent at-

tention in the machine learning community. After its success-

ful applications in the natural language processing domain

[1, 2, 3], it has also become an active research area for speech

processing.

One of the main categories of SSL methods learns rep-

resentations by reconstructing the signal such as full recon-

struction with autoencoders [4, 5], future reconstruction with

Autoregressive Predictive Coding (APC) [6] and masked re-

constructions [7, 8, 9]. Instead of reconstructing the exact

signal, HuBERT [10] learns representations by utilizing an

ofﬂine clustering step to provide aligned target labels for a

masked prediction loss. Another category of SSL technol-

ogy in literature learns representations through a contrastive

loss by distinguishing a true future audio sample from a set of

negative examples, such as the Contrastive Predictive Coding

(CPC) model [11] and wav2vec [12]. Vq-wav2vec [13] uses

a vector quantization module in addition to contrastive loss

*The author contributed to this work during an internship at Amazon

to learn discrete representations and wav2vec 2.0 [14] mini-

mizes the contrastive loss deﬁned over contextual representa-

tions in the masked region. In addition, w2v-BERT [15] com-

bines the two categories by optimizing two self-supervised

losses simultaneously (the contrastive loss and masked lan-

guage modeling loss).

All of these methods learn representations from the acous-

tic data distribution only, which may not be optimal for the

downstream ASR task. More recently, Wang et al. propose

two supervision-guided codebook generation approaches to

get better pre-trained embeddings for the downstream ASR

task in [16]. On top of HuBERT pre-training, it uses the

phoneme alignments as training targets. It also tries to per-

form K-means clustering on the supervised speech features

extracted from an end-to-end CTC model [17]. However, this

work focuses on the masked prediction self-supervised learn-

ing and all the ASR experiments are conducted with the Lib-

rispeech dataset with just a few hundred hours of labeled data.

In our work, we focus on exploring the contrastive loss based

SSL method instead and experiment with large-scale datasets.

We propose to introduce weak guidance to improve align-

ment between the learned representations and the downstream

task. The weak guidance is provided in the form of posteriors

from a prior-knowledge model learned from a small labeled

dataset, which will be discussed in detail in Section 2.2.

To combine the self-supervised and supervised training

to improve performance of the ﬁnal ASR task, most existing

methods in the literature adopt a 2-stage scheme, where only

the self-supervised loss is optimized at the ﬁrst pre-training

stage, and the supervised loss is optimized at the second stage.

Wav2vec [12] and vq-wav2vec [13] build the wav2letter [18]

acoustic model by using the pre-trained embeddings as in-

put features instead of log-mel ﬁlterbanks. Wav2vec 2.0 [14]

and HuBERT [10] pre-train the transformer based encoder us-

ing the self-supervised loss, add a randomly initialized out-

put layer on top and ﬁne-tune with the CTC loss [17]. More

recent research has shown that joint training with both su-

pervised and unsupervised losses during the pre-training/ﬁne-

tuning stage or as a single training process helps improve the

ASR performance. The initial UniSpeech work [19] demon-

strates that representations learned during pre-training can be

improved if the self supervised contrastive loss is combined

with phonetic CTC loss, and the following Unispeech at scale

work [20] demonstrates better representations from the pre-

arXiv:2210.12335v1 [cs.CL] 22 Oct 2022

training stage for the downstream ASR task when combin-

ing the contrastive loss and the transducer loss. [21] alter-

natively minimizes an unsupervised masked CPC loss and a

supervised CTC loss. This single-stage method is shown to

match the performance of the two-stage wav2vec 2.0 on the

Librispeech 100-hours dataset. [22] uses multitask learning

comprising of supervised CTC, attention and self-supervised

reconstruction losses to directly train acoustic models under

low-resource settings. [23] explores the beneﬁt of combin-

ing the supervised RNN-T loss [24], the self-supervised con-

trastive loss and masked language modeling (MLM) losses

during different training stages. In this paper, we demonstrate

beneﬁts of our proposed method mainly on the conventional

2-stage training scheme. We additionally try the joint train-

ing scheme on one ASR task during the ablation study and

demonstrate gains similar to what is reported in literature.

2. METHOD

2.1. Contrastive predictive coding

The left part of Figure 1 gives an overview of conventional

CPC representation learning approach. Given frames of audio

features xt∈ X , we ﬁrst apply the feature encoder network

fenc :X 7→ Z to map the input sequence to a sequence

of latent feature representations zt∈ Z =fenc(xt). An

autoregressive context network far :Z 7→ C summarizes

all z≤tin the latent space and produces a contextual latent

representation ct=far(z≤t)

Both the feature encoder network and the autoregressive

context network are trained to optimize the contrastive loss

deﬁned in Equation 1 based on Noise-Contrastive Estimation

(NCE) [25] for each step k, which equivalently maximizes the

mutual information between ctand the latent representation

zt+kthat is ksteps in the future [11].

Lk=−1

T−k

t=1

log exp(z>

t+khk(ct)/κ)

P˜z∈Z exp(˜z>hk(ct)/κ)(1)

where ˜z is a set of negative samples sampled from the

same audio example to represent the imposter distribution,

hk(ct) = Wkct+bkis a step-speciﬁc afﬁne transformation

applied to ctfor each step k, and κis the temperature. We

optimize the ﬁnal contrastive loss LCby averaging Lkover

the next K steps:

LC=1

k=1

Lk(2)

2.2. Guided contrastive predictive coding model

CPC learns representations from the complete data distribu-

tion, which may not be optimal for the downstream ASR task.

In this paper, we propose to provide weak guidance for the

contrastive loss. This weak guidance is provided in the form

of posteriors from a prior-knowledge model learned from a

small labeled dataset, and we use a monophone classiﬁer for

Predictions

a prior-knowledge model learned from labeled data

(e.g. a phone classifier)

Fig. 1:Illustration of conventional Contrastive Predictive Coding

(CPC) representation learning approach (left part) and our pro-

posed Guided CPC (GCPC) method (right part in red). Parame-

ters of the prior-knowledge model are ﬁxed during training. ptis a

sequence of logits from a monophone classiﬁer in our experiments.

experimentation in the paper. As shown in the right part of

Figure 1, we use an additional encoder network genc :P 7→

Qto map the sequence of unnormalized posteriors (logits) pt

to a sequence of latent representations qt∈ Q =genc(pt),

and then optimize the guided contrastive loss Lguided

Cdeﬁned

in Equation 4.

Lguided

k=−1

T−k

t=1

log exp(q>

t+khk(ct)/κ)

P˜q∈Q exp(˜q>hk(ct)/κ)(3)

Lguided

C=1

k=1

Lguided

k(4)

During training, parameters of the prior-knowledge model

are ﬁxed. We hypothesize that representations ctlearned

through this new technique could capture more phone dis-

criminative characteristics since optimizing the guided con-

trastive loss helps maximizing the mutual information be-

tween ctand transformation of phone posteriors qt+k. Thus,

ctmight be more aligned with the downstream ASR task and

serve as a better initialization point.

2.3. Contrastive pre-training for RNN-T ASR

We use an RNN-T [24] based ASR system for our experi-

ments. The RNN-T model consists of an encoder, a predic-

tion network and a joint network as shown in Figure 2a. Let

D={(X,Y)}denote a single example from a training cor-

pus where X={x1,x2, ..., xT}is a sequence of speech fea-

tures and Y={y1, y2, ...yU}, yu∈ V is a sequence of to-

kens from the vocabulary V(e.g. word pieces) representing

the labels. The encoder maps each frame of the input speech

features xtto a hidden state henc

t. The prediction network

takes the embedding vector of the previous non-blank token

yu−1and generates the hidden state hpred

u. The joint network

is a feed-forward network that combines the outputs of the

encoder and the prediction network to predict the conditional

distribution over the next possible token ˜yi∈ V ∪hblki, where

hblkidenotes the blank symbol. The RNN-T loss is computed

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GUIDEDCONTRASTIVESELF-SUPERVISEDPRE-TRAININGFORAUTOMATICSPEECHRECOGNITIONAparnaKhare1,MinhuaWu1,SaurabhchandBhati2?,JashaDroppo1,RolandMaas11AmazonAlexa,USA2JohnsHopkinsUniversity,USAABSTRACTContrastivePredictiveCoding(CPC)isarepresentationlearningmethodthatmaximizesthemutualinformationbe-tweeninter...

展开>> 收起<<

GUIDED CONTRASTIVE SELF-SUPERVISED PRE-TRAINING FOR AUTOMATIC SPEECH RECOGNITION Aparna Khare1 Minhua Wu1 Saurabhchand Bhati2 Jasha Droppo1 Roland Maas1.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

GUIDED CONTRASTIVE SELF-SUPERVISED PRE-TRAINING FOR AUTOMATIC SPEECH RECOGNITION Aparna Khare1 Minhua Wu1 Saurabhchand Bhati2 Jasha Droppo1 Roland Maas1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: