CCC-WA V2VEC 2.0 CLUSTERING AIDED CROSS CONTRASTIVE SELF-SUPERVISED LEARNING OF SPEECH REPRESENTATIONS Vasista Sai Lodagala1 Sreyan Ghosh2 S. Umesh1

2025-04-27 0 0 841.09KB 8 页 10玖币

侵权投诉

CCC-WAV2VEC 2.0: CLUSTERING AIDED CROSS CONTRASTIVE SELF-SUPERVISED

LEARNING OF SPEECH REPRESENTATIONS

Vasista Sai Lodagala1, Sreyan Ghosh2, S. Umesh1

1Indian Institute of Technology, Madras

2University of Maryland, College Park

ABSTRACT

While Self-Supervised Learning has helped reap the bene-

ﬁt of the scale from the available unlabeled data, the learn-

ing paradigms are continously being bettered. We present

a new pre-training strategy named ccc-wav2vec 2.0, which

uses clustering and an augmentation based cross-contrastive

loss as its self-supervised objective. Through the clustering

module we scale down the inﬂuence of those negative ex-

amples that are highly similar to the positive. The Cross-

Contrastive loss is computed between the encoder output of

the original sample and the quantizer output of its augmen-

tation, and vice-versa, bringing robustness to the pre-training

strategy. ccc-wav2vec 2.0 achieves upto 15.6% and 12.7%

relative WER improvement over the baseline wav2vec 2.0 on

the test-clean and test-other sets respectively of LibriSpeech,

without the use of any language model. The proposed method

also achieves upto 14.9% relative WER improvement over the

baseline wav2vec 2.0, when ﬁne-tuned on Switchboard data.

Index Terms—self-supervised learning, automatic speech

recognition, domain adaptation, telephone speech

1. INTRODUCTION

The use of SSL to learn high-level representations from unla-

beled data has received much attention in the last few years in

the domains of Computer Vision (CV) [1], Natural Language

Processing (NLP), [2] and Spoken Language Processing

(SLP) [3]. Though much progress has been made in CV and

NLP, self-supervised learning for SLP has been relatively un-

derstudied. In a majority of prior work, SSL for SLP solves a

variant of Masked Acoustic Modeling (MAM), either through

instance discrimination using contrastive learning [3, 4], or

masked prediction [5, 6]. There is, however, much poten-

tial to improve the existing self-supervised tasks for better

representation learning. In this paper, we introduce two im-

provements over the standard wav2vec 2.0, which help learn

better and more robust speech representations through SSL.

Though SSL has been seen to beneﬁt from scale [7], the

performance of SSL for speech with limited unlabeled data

needs further attention. Considering that even unlabeled data

in most languages is limited in real-world scenarios, SSL al-

gorithms that can learn useful representations even in low-

resource regimes are the need of the hour [7]. Data augmen-

tation has proven to be an effective strategy for supervised

learning setups [8, 9] when the amount of labeled data is lim-

ited. Very recently, self-supervised learning in speech has

also shown to beneﬁt from data augmentation [10, 11]. [12]

showed that data augmentation beneﬁts Contrastive Predic-

tive Coding [13] when a limited amount of unlabeled data is

available. [11] shows how introducing speciﬁc augmentations

makes their speech recognition model more robust to far-ﬁeld,

multi-talker noisy environments. Contrastive learning to learn

representations that maximize the agreement between differ-

ently augmented views of the same data is a methodology

predominant in CV and has achieved state-of-the-art results

in many applications [1, 14]. Inspired by this, we add an aux-

iliary task to the standard wav2vec 2.0 Contrastive Learning

task, wherein we contrast the anchors with negatives gener-

ated from an augmented sample and vice-versa. This makes

the model more robust to augmentations, which in turn helps

learn better representations.

Surprisingly, the choice of negative samples in a Con-

trastive Learning for SLP setup has drawn much less atten-

tion in the literature. Very often, given an “anchor” point

x, “negative samples” x−

is are randomly sampled from the

training data, independent of how informative they may be

for the learned representation. Though very recently CV has

seen growing attention to this line of research [15, 16, 17],

to the best of our knowledge, there is no existing work in

speech despite many state-of-the-art systems solving a con-

trastive learning task for self-supervised speech representa-

tion learning [3, 4].

We look at Masked Acoustic Modelling (MAM) from the

lens of language modeling and hypothesize that, similar to

the Contrastive Learning setups in NLP [18], it is important

to sample negatives that are semantically different. The need

ampliﬁes in speech representation learning, where the neg-

ative sampling strategy becomes all the more important due

to the quasi-stationary nature of speech which makes several

consecutive speech frames correspond to the same phone or

sound. Moreover, Contrastive Learning models that use in-

stance discrimination as a pre-training task tend to fall into

over clustering [19] during training. Thus, we hypothesize

that negative examples mapped very close to the anchor in

arXiv:2210.02592v3 [cs.CL] 13 May 2023

terms of their similarity might represent the same phone or

class. Considering such negative examples would contradict

the representation learning task, which should primarily focus

on discriminating between sounds or phones.

To arrive at more informative negatives for the contrastive

loss, we propose to cluster our potential negative examples

and diminish the effect of those negatives in the loss compu-

tation that fall into the same cluster as the positive. Simply

put, this process identiﬁes the weak non-informative nega-

tives from our population and reduces their impact on the loss

computation.

We also demonstrate the robustness of the proposed ap-

proach through tasks such as Domain Adaptation and zero-

shot decoding on the Switchboard [20] and Wall Street Jour-

nal (WSJ) [21] datasets, respectively. To summarize, our pri-

mary contributions are as follows:

• We introduce an augmentation of the original sam-

ple and use its representations to add an auxiliary

Cross-Contrastive loss to the existing contrastive loss

in wav2vec 2.0.

• We demonstrate the usefulness of a clustering module

to segregate the negative examples and thereby control

the effect of the weak non-informative negative exam-

ples in the contrastive learning task.

• Combining the above two modules leads to the devel-

opment of ccc-wav2vec 2.0, a robust pre-training ap-

proach that consistently outperforms wav2vec 2.0 in

tasks such as ASR, Domain Adaptation, and zero-shot

decoding.

Our code and models are publicly available on GitHub1 2.

2. RELATED WORK

SSL for speech representation learning has been prevalent in

the form of MAM. Most of the MAM approaches introduced

in literature aim to either predict the class of the masked entity

using a classiﬁcation objective as in [5, 6], or reconstruct the

original frame as in [22, 23], or enforce similarity between the

prediction of the network for the masked frame and a quan-

tized representation of the original masked frame by solving a

Contrastive Learning task as in [3]. On the other hand, some

of them propose to solve two of these tasks simultaneously

[24, 4, 25, 26].

2.1. Negative Sampling

Contrastive Learning has been observed to dominate self-

supervised speech representation learning methodologies in

various forms [3, 4], constantly achieving new State of the

Art (SOTA) results on a variety of SLP tasks [27]. However,

it is hard to ﬁnd works that discuss efﬁcient negative mining

1https://github.com/Speech-Lab-IITM/CCC-wav2vec-2.0

2Correspondence to Vasista Sai Lodagala: vasista.lodagala@gmail.com

for Contrastive Learning in SLP. Alternatively, this line of

research has seen great success in CV [17, 15, 16, 28]. Also,

this idea has been successfully applied in metric learning,

where most of the works [29, 30] observe that it is helpful to

use negative examples that are difﬁcult to be discriminated

from the current embedding. [13] was one of the ﬁrst works

to analyze the effect of efﬁcient negative sampling, wherein

they observed a drop in performance when negatives were not

mined from the same speaker. To the best of our knowledge,

this is the ﬁrst work to design the Contrastive Learning task

for speech SSL to explicitly control the choice of negatives.

2.2. Data Augmentation

The usefulness of data augmentations for robust SLP has been

explored extensively, primarily on a supervised learning setup

[8, 9]. A wide range of SLP tasks like ASR [31, 32], Speaker

Identiﬁcation [33] etc., have been seen to beneﬁt from data

augmentations. Aspects of low-resource learning [31, 32],

far-ﬁeld and noisy environment recognition [34, 35] have pri-

marily seen to beneﬁt the most with data augmentation. Very

recently, the beneﬁts of data augmentation have been explored

in SSL-based speech representation learning [10, 11], where

the former focuses on low-resource and the latter on improv-

ing ASR in far-ﬁeld and noisy environments. In a recent work

based on wav2vec 2.0, [11] proposes a Multi-Variant Consis-

tency based objective wherein multiple augmented versions

of the same audio sample are created. The original audio sam-

ple is discarded, and a contrastive loss between the multiple

augmented versions is computed. Our proposed approach dif-

fers from this work in the following ways: 1) We retain the

original audio sample and use the cross-contrastive loss with

the augmentation as an auxiliary loss in addition to the orig-

inal wav2vec 2.0 objective. 2) In our loss computation, the

effect of the various negative examples is controlled by the

clustering module, leading to an informed contrast with the

“anchor” in the contrastive loss.

3. METHODOLOGY

In the following subsections, we elaborate on the Cross-

Contrastive setup and the Clustering module, both of which

are key components of the proposed ccc-wav2vec 2.0. Fi-

nally, we present how these two components are integrated to

form ccc-wav2vec 2.0.

3.1. Cross-Contrastive Learning

Deciphering speech and sound in noisy environments is not

a challenging task for humans. However, the same cannot be

expected of SSL models, unless they have been trained for it.

In order to bring robustness to the pre-training approach, we

tap into the augmentations of speech samples.

Given an audio sample X, we apply an augmentation over

it using the torchaudio-augmentations library [36], to get X0

as the augmented sample. We pass both Xand X0through

the wav2vec 2.0 model to get the quantized representations

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CCC-WAV2VEC2.0:CLUSTERINGAIDEDCROSSCONTRASTIVESELF-SUPERVISEDLEARNINGOFSPEECHREPRESENTATIONSVasistaSaiLodagala1,SreyanGhosh2,S.Umesh11IndianInstituteofTechnology,Madras2UniversityofMaryland,CollegeParkABSTRACTWhileSelf-SupervisedLearninghashelpedreapthebene-tofthescalefromtheavailableunlabeleddata,...

展开>> 收起<<

CCC-WA V2VEC 2.0 CLUSTERING AIDED CROSS CONTRASTIVE SELF-SUPERVISED LEARNING OF SPEECH REPRESENTATIONS Vasista Sai Lodagala1 Sreyan Ghosh2 S. Umesh1.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CCC-WA V2VEC 2.0 CLUSTERING AIDED CROSS CONTRASTIVE SELF-SUPERVISED LEARNING OF SPEECH REPRESENTATIONS Vasista Sai Lodagala1 Sreyan Ghosh2 S. Umesh1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: