ROBUST DATA2VEC NOISE-ROBUST SPEECH REPRESENTATION LEARNING FOR ASR BY COMBINING REGRESSION AND IMPROVED CONTRASTIVE LEARNING Qiu-Shi Zhu1 Long Zhou2 Jie Zhang1 Shu-Jie Liu2 Yu-Chen Hu3 Li-Rong Dai1

2025-05-03 0 0 2.31MB 5 页 10玖币

侵权投诉

ROBUST DATA2VEC: NOISE-ROBUST SPEECH REPRESENTATION LEARNING FOR ASR

BY COMBINING REGRESSION AND IMPROVED CONTRASTIVE LEARNING

Qiu-Shi Zhu1, Long Zhou2, Jie Zhang1, Shu-Jie Liu2, Yu-Chen Hu3, Li-Rong Dai1

1NERC-SLIP, University of Science and Technology of China (USTC), Hefei, China

2Microsoft Research Asia

3Nanyang Technological University, Singapore

ABSTRACT

Self-supervised pre-training methods based on contrastive learning

or regression tasks can utilize more unlabeled data to improve the

performance of automatic speech recognition (ASR). However, the

robustness impact of combining the two pre-training tasks and con-

structing different negative samples for contrastive learning still re-

mains unclear. In this paper, we propose a noise-robust data2vec for

self-supervised speech representation learning by jointly optimizing

the contrastive learning and regression tasks in the pre-training stage.

Furthermore, we present two improved methods to facilitate con-

trastive learning. More speciﬁcally, we ﬁrst propose to construct

patch-based non-semantic negative samples to boost the noise ro-

bustness of the pre-training model, which is achieved by dividing

the features into patches at different sizes (i.e., so-called negative

samples). Second, by analyzing the distribution of positive and neg-

ative samples, we propose to remove the easily distinguishable neg-

ative samples to improve the discriminative capacity for pre-training

models. Experimental results on the CHiME-4 dataset show that our

method is able to improve the performance of the pre-trained model

in noisy scenarios. We ﬁnd that joint training of the contrastive learn-

ing and regression tasks can avoid the model collapse to some extent

compared to only training the regression task.

Index Terms—Automatic speech recognition, noise robust-

ness, self-supervised pre-training, contrastive learning.

1. INTRODUCTION

Collecting labeled data is rather time-consuming and economi-

cally expensive, while there exists a large amount of unlabeled

data that can be recorded in reality. How to make a better advan-

tage of unlabeled data for supervised learning has thus become

a hot spot recently. In the speech community, many methods

have been proposed to improve the automatic speech recognition

(ASR) performance using unlabeled speech data, such as self-

supervised pre-training and teacher-student schemes (i.e., so-called

self-training), which were shown to be beneﬁcial for various down-

stream speech tasks [1–6]. For example, the wav2vec2.0 [1] based

on self-supervised pre-training utilizes a contrastive loss function to

narrow the distance between the predicted and positive samples and

meanwhile enlarge the distance between the predicted and negative

samples. In wav2vec2.0 [1], local features are employed as targets

for self-supervised pre-training, where the contextual information

This work is supported by the National Natural Science Foundation of

China (62101523), Hefei Municipal Natural Science Foundation (2022012),

Fundamental Research Funds for the Central Universities and the Leading

Plan of CAS (XDC08010200).

is not fully leveraged. This problem was then considered in Hu-

BERT [2], which ofﬂine clusters the representations output from

the middle layer of the pre-trained model to generate targets for

self-supervised pre-training. On the basis of HuBERT, WavLM [3]

utilizes a sentence-level mixing data augmentation approach to en-

hance the speaker information, which performs very well on the

SUPERB benchmark [7]. Unlike wav2vec2.0 and HuBERT which

employ local information and discrete contextual information as

targets, data2vec [5] follows the teacher-student scheme [8, 9] and

adopts continuous contextual representations as targets to preform

regression tasks, leading to an even better performance on down-

stream tasks.

It was shown that self-supervised pre-training can improve the

noise robustness of ASR models. For example, problem-agnostic

speech encoder (PASE+) [10] uses an online speech perturbation

module and employs multiple self-supervised tasks to improve the

noise robustness. In robust wav2vec2.0 [11], a more general case is

explored, where the domain of unlabeled data used for pre-training

is different from that of the labeled data for ﬁne-tuning, exhibit-

ing a stronger generalization capacity for ASR models. By using

quantized clean speech features as pre-training targets, enhanced

wav2vec2.0 [12] can improve the noise robustness of ASR models.

Wav2vec-switch [13] allows the model to have consistent predictions

for both original and noisy speech by contrastive learning. In [14],

a reconstruction module was proposed based on wav2vec2.0 to im-

prove the noise robustness of the learned representations. However,

there is few work to investigate the speech representation robustness

of combining regression and contrastive learning tasks, especially

the different negative samples in contrastive learning.

In the ﬁeld of computer vision (CV), many analytical works have

been done on negative samples in contrastive learning, and it was

shown that negative samples affect the quality of the pre-trained rep-

resentations. In [15], it was found that many negative samples are

too far away from positive samples within contrastive learning, and

hard negative mixing [15] was therefore proposed, which can im-

prove the performance and training efﬁciency of pre-trained models

by mixing difﬁcult negative samples at the feature level. To alleviate

the sampling bias of negative samples in contrastive learning, debi-

ased contrastive loss was proposed in [16]. In [17], based on the

fact that the representation of contrastive learning beneﬁts from hard

negative samples, a hard negative samples selection approach was

proposed, in which the user can control the difﬁculty of the negative

samples. In [18], it was shown that only 5% of the hardest negative

samples are both useful and sufﬁcient for downstream tasks, 95% of

the negative samples are unnecessary, and 0.1% of the hardest are

even harmful. In addition, a false negative detection method was

proposed in [19] to address the problem that positive samples may

arXiv:2210.15324v1 [eess.AS] 27 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ROBUSTDATA2VEC:NOISE-ROBUSTSPEECHREPRESENTATIONLEARNINGFORASRBYCOMBININGREGRESSIONANDIMPROVEDCONTRASTIVELEARNINGQiu-ShiZhu1,LongZhou2,JieZhang1,Shu-JieLiu2,Yu-ChenHu3,Li-RongDai11NERC-SLIP,UniversityofScienceandTechnologyofChina(USTC),Hefei,China2MicrosoftResearchAsia3NanyangTechnologicalUniversity,...

展开>> 收起<<

ROBUST DATA2VEC NOISE-ROBUST SPEECH REPRESENTATION LEARNING FOR ASR BY COMBINING REGRESSION AND IMPROVED CONTRASTIVE LEARNING Qiu-Shi Zhu1 Long Zhou2 Jie Zhang1 Shu-Jie Liu2 Yu-Chen Hu3 Li-Rong Dai1.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ROBUST DATA2VEC NOISE-ROBUST SPEECH REPRESENTATION LEARNING FOR ASR BY COMBINING REGRESSION AND IMPROVED CONTRASTIVE LEARNING Qiu-Shi Zhu1 Long Zhou2 Jie Zhang1 Shu-Jie Liu2 Yu-Chen Hu3 Li-Rong Dai1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: