
ROBUST DATA2VEC: NOISE-ROBUST SPEECH REPRESENTATION LEARNING FOR ASR
BY COMBINING REGRESSION AND IMPROVED CONTRASTIVE LEARNING
Qiu-Shi Zhu1, Long Zhou2, Jie Zhang1, Shu-Jie Liu2, Yu-Chen Hu3, Li-Rong Dai1
1NERC-SLIP, University of Science and Technology of China (USTC), Hefei, China
2Microsoft Research Asia
3Nanyang Technological University, Singapore
ABSTRACT
Self-supervised pre-training methods based on contrastive learning
or regression tasks can utilize more unlabeled data to improve the
performance of automatic speech recognition (ASR). However, the
robustness impact of combining the two pre-training tasks and con-
structing different negative samples for contrastive learning still re-
mains unclear. In this paper, we propose a noise-robust data2vec for
self-supervised speech representation learning by jointly optimizing
the contrastive learning and regression tasks in the pre-training stage.
Furthermore, we present two improved methods to facilitate con-
trastive learning. More specifically, we first propose to construct
patch-based non-semantic negative samples to boost the noise ro-
bustness of the pre-training model, which is achieved by dividing
the features into patches at different sizes (i.e., so-called negative
samples). Second, by analyzing the distribution of positive and neg-
ative samples, we propose to remove the easily distinguishable neg-
ative samples to improve the discriminative capacity for pre-training
models. Experimental results on the CHiME-4 dataset show that our
method is able to improve the performance of the pre-trained model
in noisy scenarios. We find that joint training of the contrastive learn-
ing and regression tasks can avoid the model collapse to some extent
compared to only training the regression task.
Index Terms—Automatic speech recognition, noise robust-
ness, self-supervised pre-training, contrastive learning.
1. INTRODUCTION
Collecting labeled data is rather time-consuming and economi-
cally expensive, while there exists a large amount of unlabeled
data that can be recorded in reality. How to make a better advan-
tage of unlabeled data for supervised learning has thus become
a hot spot recently. In the speech community, many methods
have been proposed to improve the automatic speech recognition
(ASR) performance using unlabeled speech data, such as self-
supervised pre-training and teacher-student schemes (i.e., so-called
self-training), which were shown to be beneficial for various down-
stream speech tasks [1–6]. For example, the wav2vec2.0 [1] based
on self-supervised pre-training utilizes a contrastive loss function to
narrow the distance between the predicted and positive samples and
meanwhile enlarge the distance between the predicted and negative
samples. In wav2vec2.0 [1], local features are employed as targets
for self-supervised pre-training, where the contextual information
This work is supported by the National Natural Science Foundation of
China (62101523), Hefei Municipal Natural Science Foundation (2022012),
Fundamental Research Funds for the Central Universities and the Leading
Plan of CAS (XDC08010200).
is not fully leveraged. This problem was then considered in Hu-
BERT [2], which offline clusters the representations output from
the middle layer of the pre-trained model to generate targets for
self-supervised pre-training. On the basis of HuBERT, WavLM [3]
utilizes a sentence-level mixing data augmentation approach to en-
hance the speaker information, which performs very well on the
SUPERB benchmark [7]. Unlike wav2vec2.0 and HuBERT which
employ local information and discrete contextual information as
targets, data2vec [5] follows the teacher-student scheme [8, 9] and
adopts continuous contextual representations as targets to preform
regression tasks, leading to an even better performance on down-
stream tasks.
It was shown that self-supervised pre-training can improve the
noise robustness of ASR models. For example, problem-agnostic
speech encoder (PASE+) [10] uses an online speech perturbation
module and employs multiple self-supervised tasks to improve the
noise robustness. In robust wav2vec2.0 [11], a more general case is
explored, where the domain of unlabeled data used for pre-training
is different from that of the labeled data for fine-tuning, exhibit-
ing a stronger generalization capacity for ASR models. By using
quantized clean speech features as pre-training targets, enhanced
wav2vec2.0 [12] can improve the noise robustness of ASR models.
Wav2vec-switch [13] allows the model to have consistent predictions
for both original and noisy speech by contrastive learning. In [14],
a reconstruction module was proposed based on wav2vec2.0 to im-
prove the noise robustness of the learned representations. However,
there is few work to investigate the speech representation robustness
of combining regression and contrastive learning tasks, especially
the different negative samples in contrastive learning.
In the field of computer vision (CV), many analytical works have
been done on negative samples in contrastive learning, and it was
shown that negative samples affect the quality of the pre-trained rep-
resentations. In [15], it was found that many negative samples are
too far away from positive samples within contrastive learning, and
hard negative mixing [15] was therefore proposed, which can im-
prove the performance and training efficiency of pre-trained models
by mixing difficult negative samples at the feature level. To alleviate
the sampling bias of negative samples in contrastive learning, debi-
ased contrastive loss was proposed in [16]. In [17], based on the
fact that the representation of contrastive learning benefits from hard
negative samples, a hard negative samples selection approach was
proposed, in which the user can control the difficulty of the negative
samples. In [18], it was shown that only 5% of the hardest negative
samples are both useful and sufficient for downstream tasks, 95% of
the negative samples are unnecessary, and 0.1% of the hardest are
even harmful. In addition, a false negative detection method was
proposed in [19] to address the problem that positive samples may
arXiv:2210.15324v1 [eess.AS] 27 Oct 2022