GUIDED CONTRASTIVE SELF-SUPERVISED PRE-TRAINING FOR AUTOMATIC SPEECH RECOGNITION Aparna Khare1 Minhua Wu1 Saurabhchand Bhati2 Jasha Droppo1 Roland Maas1

2025-05-06 0 0 3.29MB 8 页 10玖币
侵权投诉
GUIDED CONTRASTIVE SELF-SUPERVISED PRE-TRAINING
FOR AUTOMATIC SPEECH RECOGNITION
Aparna Khare1, Minhua Wu1, Saurabhchand Bhati2?, Jasha Droppo1, Roland Maas1
1Amazon Alexa, USA
2Johns Hopkins University, USA
ABSTRACT
Contrastive Predictive Coding (CPC) is a representation
learning method that maximizes the mutual information be-
tween intermediate latent representations and the output of
a given model. It can be used to effectively initialize the
encoder of an Automatic Speech Recognition (ASR) model.
We present a novel modification of CPC called Guided Con-
trastive Predictive Coding (GCPC). Our proposed method
maximizes the mutual information between representations
from a prior-knowledge model and the output of the model
being pre-trained, allowing prior knowledge injection dur-
ing pre-training. We validate our method on 3 ASR tasks:
German, French and English. Our method outperforms CPC
pre-training on all three datasets, reducing the Word Error
Rate (WER) by 4.44%, 6.55% and 15.43% relative on the
German, French and English (Librispeech) tasks respectively,
compared to training from scratch, while CPC pre-training
only brings 2.96%, 1.01% and 14.39% relative WER reduc-
tion respectively.
Index TermsSelf-supervised learning, RNN-T, ASR
1. INTRODUCTION
Self-supervised Learning (SSL) has drawn a lot of recent at-
tention in the machine learning community. After its success-
ful applications in the natural language processing domain
[1, 2, 3], it has also become an active research area for speech
processing.
One of the main categories of SSL methods learns rep-
resentations by reconstructing the signal such as full recon-
struction with autoencoders [4, 5], future reconstruction with
Autoregressive Predictive Coding (APC) [6] and masked re-
constructions [7, 8, 9]. Instead of reconstructing the exact
signal, HuBERT [10] learns representations by utilizing an
offline clustering step to provide aligned target labels for a
masked prediction loss. Another category of SSL technol-
ogy in literature learns representations through a contrastive
loss by distinguishing a true future audio sample from a set of
negative examples, such as the Contrastive Predictive Coding
(CPC) model [11] and wav2vec [12]. Vq-wav2vec [13] uses
a vector quantization module in addition to contrastive loss
*The author contributed to this work during an internship at Amazon
to learn discrete representations and wav2vec 2.0 [14] mini-
mizes the contrastive loss defined over contextual representa-
tions in the masked region. In addition, w2v-BERT [15] com-
bines the two categories by optimizing two self-supervised
losses simultaneously (the contrastive loss and masked lan-
guage modeling loss).
All of these methods learn representations from the acous-
tic data distribution only, which may not be optimal for the
downstream ASR task. More recently, Wang et al. propose
two supervision-guided codebook generation approaches to
get better pre-trained embeddings for the downstream ASR
task in [16]. On top of HuBERT pre-training, it uses the
phoneme alignments as training targets. It also tries to per-
form K-means clustering on the supervised speech features
extracted from an end-to-end CTC model [17]. However, this
work focuses on the masked prediction self-supervised learn-
ing and all the ASR experiments are conducted with the Lib-
rispeech dataset with just a few hundred hours of labeled data.
In our work, we focus on exploring the contrastive loss based
SSL method instead and experiment with large-scale datasets.
We propose to introduce weak guidance to improve align-
ment between the learned representations and the downstream
task. The weak guidance is provided in the form of posteriors
from a prior-knowledge model learned from a small labeled
dataset, which will be discussed in detail in Section 2.2.
To combine the self-supervised and supervised training
to improve performance of the final ASR task, most existing
methods in the literature adopt a 2-stage scheme, where only
the self-supervised loss is optimized at the first pre-training
stage, and the supervised loss is optimized at the second stage.
Wav2vec [12] and vq-wav2vec [13] build the wav2letter [18]
acoustic model by using the pre-trained embeddings as in-
put features instead of log-mel filterbanks. Wav2vec 2.0 [14]
and HuBERT [10] pre-train the transformer based encoder us-
ing the self-supervised loss, add a randomly initialized out-
put layer on top and fine-tune with the CTC loss [17]. More
recent research has shown that joint training with both su-
pervised and unsupervised losses during the pre-training/fine-
tuning stage or as a single training process helps improve the
ASR performance. The initial UniSpeech work [19] demon-
strates that representations learned during pre-training can be
improved if the self supervised contrastive loss is combined
with phonetic CTC loss, and the following Unispeech at scale
work [20] demonstrates better representations from the pre-
arXiv:2210.12335v1 [cs.CL] 22 Oct 2022
training stage for the downstream ASR task when combin-
ing the contrastive loss and the transducer loss. [21] alter-
natively minimizes an unsupervised masked CPC loss and a
supervised CTC loss. This single-stage method is shown to
match the performance of the two-stage wav2vec 2.0 on the
Librispeech 100-hours dataset. [22] uses multitask learning
comprising of supervised CTC, attention and self-supervised
reconstruction losses to directly train acoustic models under
low-resource settings. [23] explores the benefit of combin-
ing the supervised RNN-T loss [24], the self-supervised con-
trastive loss and masked language modeling (MLM) losses
during different training stages. In this paper, we demonstrate
benefits of our proposed method mainly on the conventional
2-stage training scheme. We additionally try the joint train-
ing scheme on one ASR task during the ablation study and
demonstrate gains similar to what is reported in literature.
2. METHOD
2.1. Contrastive predictive coding
The left part of Figure 1 gives an overview of conventional
CPC representation learning approach. Given frames of audio
features xt X , we first apply the feature encoder network
fenc :X 7→ Z to map the input sequence to a sequence
of latent feature representations zt∈ Z =fenc(xt). An
autoregressive context network far :Z 7→ C summarizes
all ztin the latent space and produces a contextual latent
representation ct=far(zt)
Both the feature encoder network and the autoregressive
context network are trained to optimize the contrastive loss
defined in Equation 1 based on Noise-Contrastive Estimation
(NCE) [25] for each step k, which equivalently maximizes the
mutual information between ctand the latent representation
zt+kthat is ksteps in the future [11].
Lk=1
Tk
Tk
X
t=1
log exp(z>
t+khk(ct))
P˜z∈Z exp(˜z>hk(ct))(1)
where ˜z is a set of negative samples sampled from the
same audio example to represent the imposter distribution,
hk(ct) = Wkct+bkis a step-specific affine transformation
applied to ctfor each step k, and κis the temperature. We
optimize the final contrastive loss LCby averaging Lkover
the next K steps:
LC=1
K
K
X
k=1
Lk(2)
2.2. Guided contrastive predictive coding model
CPC learns representations from the complete data distribu-
tion, which may not be optimal for the downstream ASR task.
In this paper, we propose to provide weak guidance for the
contrastive loss. This weak guidance is provided in the form
of posteriors from a prior-knowledge model learned from a
small labeled dataset, and we use a monophone classifier for
Predictions
a prior-knowledge model learned from labeled data
(e.g. a phone classifier)
Fig. 1:Illustration of conventional Contrastive Predictive Coding
(CPC) representation learning approach (left part) and our pro-
posed Guided CPC (GCPC) method (right part in red). Parame-
ters of the prior-knowledge model are fixed during training. ptis a
sequence of logits from a monophone classifier in our experiments.
experimentation in the paper. As shown in the right part of
Figure 1, we use an additional encoder network genc :P 7→
Qto map the sequence of unnormalized posteriors (logits) pt
to a sequence of latent representations qt∈ Q =genc(pt),
and then optimize the guided contrastive loss Lguided
Cdefined
in Equation 4.
Lguided
k=1
Tk
Tk
X
t=1
log exp(q>
t+khk(ct))
P˜q∈Q exp(˜q>hk(ct))(3)
Lguided
C=1
K
K
X
k=1
Lguided
k(4)
During training, parameters of the prior-knowledge model
are fixed. We hypothesize that representations ctlearned
through this new technique could capture more phone dis-
criminative characteristics since optimizing the guided con-
trastive loss helps maximizing the mutual information be-
tween ctand transformation of phone posteriors qt+k. Thus,
ctmight be more aligned with the downstream ASR task and
serve as a better initialization point.
2.3. Contrastive pre-training for RNN-T ASR
We use an RNN-T [24] based ASR system for our experi-
ments. The RNN-T model consists of an encoder, a predic-
tion network and a joint network as shown in Figure 2a. Let
D={(X,Y)}denote a single example from a training cor-
pus where X={x1,x2, ..., xT}is a sequence of speech fea-
tures and Y={y1, y2, ...yU}, yu∈ V is a sequence of to-
kens from the vocabulary V(e.g. word pieces) representing
the labels. The encoder maps each frame of the input speech
features xtto a hidden state henc
t. The prediction network
takes the embedding vector of the previous non-blank token
yu1and generates the hidden state hpred
u. The joint network
is a feed-forward network that combines the outputs of the
encoder and the prediction network to predict the conditional
distribution over the next possible token ˜yi V ∪hblki, where
hblkidenotes the blank symbol. The RNN-T loss is computed
摘要:

GUIDEDCONTRASTIVESELF-SUPERVISEDPRE-TRAININGFORAUTOMATICSPEECHRECOGNITIONAparnaKhare1,MinhuaWu1,SaurabhchandBhati2?,JashaDroppo1,RolandMaas11AmazonAlexa,USA2JohnsHopkinsUniversity,USAABSTRACTContrastivePredictiveCoding(CPC)isarepresentationlearningmethodthatmaximizesthemutualinformationbe-tweeninter...

展开>> 收起<<
GUIDED CONTRASTIVE SELF-SUPERVISED PRE-TRAINING FOR AUTOMATIC SPEECH RECOGNITION Aparna Khare1 Minhua Wu1 Saurabhchand Bhati2 Jasha Droppo1 Roland Maas1.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:3.29MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注