
GUIDED CONTRASTIVE SELF-SUPERVISED PRE-TRAINING
FOR AUTOMATIC SPEECH RECOGNITION
Aparna Khare1, Minhua Wu1, Saurabhchand Bhati2?, Jasha Droppo1, Roland Maas1
1Amazon Alexa, USA
2Johns Hopkins University, USA
ABSTRACT
Contrastive Predictive Coding (CPC) is a representation
learning method that maximizes the mutual information be-
tween intermediate latent representations and the output of
a given model. It can be used to effectively initialize the
encoder of an Automatic Speech Recognition (ASR) model.
We present a novel modification of CPC called Guided Con-
trastive Predictive Coding (GCPC). Our proposed method
maximizes the mutual information between representations
from a prior-knowledge model and the output of the model
being pre-trained, allowing prior knowledge injection dur-
ing pre-training. We validate our method on 3 ASR tasks:
German, French and English. Our method outperforms CPC
pre-training on all three datasets, reducing the Word Error
Rate (WER) by 4.44%, 6.55% and 15.43% relative on the
German, French and English (Librispeech) tasks respectively,
compared to training from scratch, while CPC pre-training
only brings 2.96%, 1.01% and 14.39% relative WER reduc-
tion respectively.
Index Terms—Self-supervised learning, RNN-T, ASR
1. INTRODUCTION
Self-supervised Learning (SSL) has drawn a lot of recent at-
tention in the machine learning community. After its success-
ful applications in the natural language processing domain
[1, 2, 3], it has also become an active research area for speech
processing.
One of the main categories of SSL methods learns rep-
resentations by reconstructing the signal such as full recon-
struction with autoencoders [4, 5], future reconstruction with
Autoregressive Predictive Coding (APC) [6] and masked re-
constructions [7, 8, 9]. Instead of reconstructing the exact
signal, HuBERT [10] learns representations by utilizing an
offline clustering step to provide aligned target labels for a
masked prediction loss. Another category of SSL technol-
ogy in literature learns representations through a contrastive
loss by distinguishing a true future audio sample from a set of
negative examples, such as the Contrastive Predictive Coding
(CPC) model [11] and wav2vec [12]. Vq-wav2vec [13] uses
a vector quantization module in addition to contrastive loss
*The author contributed to this work during an internship at Amazon
to learn discrete representations and wav2vec 2.0 [14] mini-
mizes the contrastive loss defined over contextual representa-
tions in the masked region. In addition, w2v-BERT [15] com-
bines the two categories by optimizing two self-supervised
losses simultaneously (the contrastive loss and masked lan-
guage modeling loss).
All of these methods learn representations from the acous-
tic data distribution only, which may not be optimal for the
downstream ASR task. More recently, Wang et al. propose
two supervision-guided codebook generation approaches to
get better pre-trained embeddings for the downstream ASR
task in [16]. On top of HuBERT pre-training, it uses the
phoneme alignments as training targets. It also tries to per-
form K-means clustering on the supervised speech features
extracted from an end-to-end CTC model [17]. However, this
work focuses on the masked prediction self-supervised learn-
ing and all the ASR experiments are conducted with the Lib-
rispeech dataset with just a few hundred hours of labeled data.
In our work, we focus on exploring the contrastive loss based
SSL method instead and experiment with large-scale datasets.
We propose to introduce weak guidance to improve align-
ment between the learned representations and the downstream
task. The weak guidance is provided in the form of posteriors
from a prior-knowledge model learned from a small labeled
dataset, which will be discussed in detail in Section 2.2.
To combine the self-supervised and supervised training
to improve performance of the final ASR task, most existing
methods in the literature adopt a 2-stage scheme, where only
the self-supervised loss is optimized at the first pre-training
stage, and the supervised loss is optimized at the second stage.
Wav2vec [12] and vq-wav2vec [13] build the wav2letter [18]
acoustic model by using the pre-trained embeddings as in-
put features instead of log-mel filterbanks. Wav2vec 2.0 [14]
and HuBERT [10] pre-train the transformer based encoder us-
ing the self-supervised loss, add a randomly initialized out-
put layer on top and fine-tune with the CTC loss [17]. More
recent research has shown that joint training with both su-
pervised and unsupervised losses during the pre-training/fine-
tuning stage or as a single training process helps improve the
ASR performance. The initial UniSpeech work [19] demon-
strates that representations learned during pre-training can be
improved if the self supervised contrastive loss is combined
with phonetic CTC loss, and the following Unispeech at scale
work [20] demonstrates better representations from the pre-
arXiv:2210.12335v1 [cs.CL] 22 Oct 2022