
terms of their similarity might represent the same phone or
class. Considering such negative examples would contradict
the representation learning task, which should primarily focus
on discriminating between sounds or phones.
To arrive at more informative negatives for the contrastive
loss, we propose to cluster our potential negative examples
and diminish the effect of those negatives in the loss compu-
tation that fall into the same cluster as the positive. Simply
put, this process identifies the weak non-informative nega-
tives from our population and reduces their impact on the loss
computation.
We also demonstrate the robustness of the proposed ap-
proach through tasks such as Domain Adaptation and zero-
shot decoding on the Switchboard [20] and Wall Street Jour-
nal (WSJ) [21] datasets, respectively. To summarize, our pri-
mary contributions are as follows:
• We introduce an augmentation of the original sam-
ple and use its representations to add an auxiliary
Cross-Contrastive loss to the existing contrastive loss
in wav2vec 2.0.
• We demonstrate the usefulness of a clustering module
to segregate the negative examples and thereby control
the effect of the weak non-informative negative exam-
ples in the contrastive learning task.
• Combining the above two modules leads to the devel-
opment of ccc-wav2vec 2.0, a robust pre-training ap-
proach that consistently outperforms wav2vec 2.0 in
tasks such as ASR, Domain Adaptation, and zero-shot
decoding.
Our code and models are publicly available on GitHub1 2.
2. RELATED WORK
SSL for speech representation learning has been prevalent in
the form of MAM. Most of the MAM approaches introduced
in literature aim to either predict the class of the masked entity
using a classification objective as in [5, 6], or reconstruct the
original frame as in [22, 23], or enforce similarity between the
prediction of the network for the masked frame and a quan-
tized representation of the original masked frame by solving a
Contrastive Learning task as in [3]. On the other hand, some
of them propose to solve two of these tasks simultaneously
[24, 4, 25, 26].
2.1. Negative Sampling
Contrastive Learning has been observed to dominate self-
supervised speech representation learning methodologies in
various forms [3, 4], constantly achieving new State of the
Art (SOTA) results on a variety of SLP tasks [27]. However,
it is hard to find works that discuss efficient negative mining
1https://github.com/Speech-Lab-IITM/CCC-wav2vec-2.0
2Correspondence to Vasista Sai Lodagala: vasista.lodagala@gmail.com
for Contrastive Learning in SLP. Alternatively, this line of
research has seen great success in CV [17, 15, 16, 28]. Also,
this idea has been successfully applied in metric learning,
where most of the works [29, 30] observe that it is helpful to
use negative examples that are difficult to be discriminated
from the current embedding. [13] was one of the first works
to analyze the effect of efficient negative sampling, wherein
they observed a drop in performance when negatives were not
mined from the same speaker. To the best of our knowledge,
this is the first work to design the Contrastive Learning task
for speech SSL to explicitly control the choice of negatives.
2.2. Data Augmentation
The usefulness of data augmentations for robust SLP has been
explored extensively, primarily on a supervised learning setup
[8, 9]. A wide range of SLP tasks like ASR [31, 32], Speaker
Identification [33] etc., have been seen to benefit from data
augmentations. Aspects of low-resource learning [31, 32],
far-field and noisy environment recognition [34, 35] have pri-
marily seen to benefit the most with data augmentation. Very
recently, the benefits of data augmentation have been explored
in SSL-based speech representation learning [10, 11], where
the former focuses on low-resource and the latter on improv-
ing ASR in far-field and noisy environments. In a recent work
based on wav2vec 2.0, [11] proposes a Multi-Variant Consis-
tency based objective wherein multiple augmented versions
of the same audio sample are created. The original audio sam-
ple is discarded, and a contrastive loss between the multiple
augmented versions is computed. Our proposed approach dif-
fers from this work in the following ways: 1) We retain the
original audio sample and use the cross-contrastive loss with
the augmentation as an auxiliary loss in addition to the orig-
inal wav2vec 2.0 objective. 2) In our loss computation, the
effect of the various negative examples is controlled by the
clustering module, leading to an informed contrast with the
“anchor” in the contrastive loss.
3. METHODOLOGY
In the following subsections, we elaborate on the Cross-
Contrastive setup and the Clustering module, both of which
are key components of the proposed ccc-wav2vec 2.0. Fi-
nally, we present how these two components are integrated to
form ccc-wav2vec 2.0.
3.1. Cross-Contrastive Learning
Deciphering speech and sound in noisy environments is not
a challenging task for humans. However, the same cannot be
expected of SSL models, unless they have been trained for it.
In order to bring robustness to the pre-training approach, we
tap into the augmentations of speech samples.
Given an audio sample X, we apply an augmentation over
it using the torchaudio-augmentations library [36], to get X0
as the augmented sample. We pass both Xand X0through
the wav2vec 2.0 model to get the quantized representations