CCC-WA V2VEC 2.0 CLUSTERING AIDED CROSS CONTRASTIVE SELF-SUPERVISED LEARNING OF SPEECH REPRESENTATIONS Vasista Sai Lodagala1 Sreyan Ghosh2 S. Umesh1

2025-04-27 0 0 841.09KB 8 页 10玖币
侵权投诉
CCC-WAV2VEC 2.0: CLUSTERING AIDED CROSS CONTRASTIVE SELF-SUPERVISED
LEARNING OF SPEECH REPRESENTATIONS
Vasista Sai Lodagala1, Sreyan Ghosh2, S. Umesh1
1Indian Institute of Technology, Madras
2University of Maryland, College Park
ABSTRACT
While Self-Supervised Learning has helped reap the bene-
fit of the scale from the available unlabeled data, the learn-
ing paradigms are continously being bettered. We present
a new pre-training strategy named ccc-wav2vec 2.0, which
uses clustering and an augmentation based cross-contrastive
loss as its self-supervised objective. Through the clustering
module we scale down the influence of those negative ex-
amples that are highly similar to the positive. The Cross-
Contrastive loss is computed between the encoder output of
the original sample and the quantizer output of its augmen-
tation, and vice-versa, bringing robustness to the pre-training
strategy. ccc-wav2vec 2.0 achieves upto 15.6% and 12.7%
relative WER improvement over the baseline wav2vec 2.0 on
the test-clean and test-other sets respectively of LibriSpeech,
without the use of any language model. The proposed method
also achieves upto 14.9% relative WER improvement over the
baseline wav2vec 2.0, when fine-tuned on Switchboard data.
Index Termsself-supervised learning, automatic speech
recognition, domain adaptation, telephone speech
1. INTRODUCTION
The use of SSL to learn high-level representations from unla-
beled data has received much attention in the last few years in
the domains of Computer Vision (CV) [1], Natural Language
Processing (NLP), [2] and Spoken Language Processing
(SLP) [3]. Though much progress has been made in CV and
NLP, self-supervised learning for SLP has been relatively un-
derstudied. In a majority of prior work, SSL for SLP solves a
variant of Masked Acoustic Modeling (MAM), either through
instance discrimination using contrastive learning [3, 4], or
masked prediction [5, 6]. There is, however, much poten-
tial to improve the existing self-supervised tasks for better
representation learning. In this paper, we introduce two im-
provements over the standard wav2vec 2.0, which help learn
better and more robust speech representations through SSL.
Though SSL has been seen to benefit from scale [7], the
performance of SSL for speech with limited unlabeled data
needs further attention. Considering that even unlabeled data
in most languages is limited in real-world scenarios, SSL al-
gorithms that can learn useful representations even in low-
resource regimes are the need of the hour [7]. Data augmen-
tation has proven to be an effective strategy for supervised
learning setups [8, 9] when the amount of labeled data is lim-
ited. Very recently, self-supervised learning in speech has
also shown to benefit from data augmentation [10, 11]. [12]
showed that data augmentation benefits Contrastive Predic-
tive Coding [13] when a limited amount of unlabeled data is
available. [11] shows how introducing specific augmentations
makes their speech recognition model more robust to far-field,
multi-talker noisy environments. Contrastive learning to learn
representations that maximize the agreement between differ-
ently augmented views of the same data is a methodology
predominant in CV and has achieved state-of-the-art results
in many applications [1, 14]. Inspired by this, we add an aux-
iliary task to the standard wav2vec 2.0 Contrastive Learning
task, wherein we contrast the anchors with negatives gener-
ated from an augmented sample and vice-versa. This makes
the model more robust to augmentations, which in turn helps
learn better representations.
Surprisingly, the choice of negative samples in a Con-
trastive Learning for SLP setup has drawn much less atten-
tion in the literature. Very often, given an “anchor” point
x, “negative samples” x
is are randomly sampled from the
training data, independent of how informative they may be
for the learned representation. Though very recently CV has
seen growing attention to this line of research [15, 16, 17],
to the best of our knowledge, there is no existing work in
speech despite many state-of-the-art systems solving a con-
trastive learning task for self-supervised speech representa-
tion learning [3, 4].
We look at Masked Acoustic Modelling (MAM) from the
lens of language modeling and hypothesize that, similar to
the Contrastive Learning setups in NLP [18], it is important
to sample negatives that are semantically different. The need
amplifies in speech representation learning, where the neg-
ative sampling strategy becomes all the more important due
to the quasi-stationary nature of speech which makes several
consecutive speech frames correspond to the same phone or
sound. Moreover, Contrastive Learning models that use in-
stance discrimination as a pre-training task tend to fall into
over clustering [19] during training. Thus, we hypothesize
that negative examples mapped very close to the anchor in
978-1-6654-7189-3/22/$31.00 ©2023 IEEE
arXiv:2210.02592v3 [cs.CL] 13 May 2023
terms of their similarity might represent the same phone or
class. Considering such negative examples would contradict
the representation learning task, which should primarily focus
on discriminating between sounds or phones.
To arrive at more informative negatives for the contrastive
loss, we propose to cluster our potential negative examples
and diminish the effect of those negatives in the loss compu-
tation that fall into the same cluster as the positive. Simply
put, this process identifies the weak non-informative nega-
tives from our population and reduces their impact on the loss
computation.
We also demonstrate the robustness of the proposed ap-
proach through tasks such as Domain Adaptation and zero-
shot decoding on the Switchboard [20] and Wall Street Jour-
nal (WSJ) [21] datasets, respectively. To summarize, our pri-
mary contributions are as follows:
• We introduce an augmentation of the original sam-
ple and use its representations to add an auxiliary
Cross-Contrastive loss to the existing contrastive loss
in wav2vec 2.0.
We demonstrate the usefulness of a clustering module
to segregate the negative examples and thereby control
the effect of the weak non-informative negative exam-
ples in the contrastive learning task.
Combining the above two modules leads to the devel-
opment of ccc-wav2vec 2.0, a robust pre-training ap-
proach that consistently outperforms wav2vec 2.0 in
tasks such as ASR, Domain Adaptation, and zero-shot
decoding.
Our code and models are publicly available on GitHub1 2.
2. RELATED WORK
SSL for speech representation learning has been prevalent in
the form of MAM. Most of the MAM approaches introduced
in literature aim to either predict the class of the masked entity
using a classification objective as in [5, 6], or reconstruct the
original frame as in [22, 23], or enforce similarity between the
prediction of the network for the masked frame and a quan-
tized representation of the original masked frame by solving a
Contrastive Learning task as in [3]. On the other hand, some
of them propose to solve two of these tasks simultaneously
[24, 4, 25, 26].
2.1. Negative Sampling
Contrastive Learning has been observed to dominate self-
supervised speech representation learning methodologies in
various forms [3, 4], constantly achieving new State of the
Art (SOTA) results on a variety of SLP tasks [27]. However,
it is hard to find works that discuss efficient negative mining
1https://github.com/Speech-Lab-IITM/CCC-wav2vec-2.0
2Correspondence to Vasista Sai Lodagala: vasista.lodagala@gmail.com
for Contrastive Learning in SLP. Alternatively, this line of
research has seen great success in CV [17, 15, 16, 28]. Also,
this idea has been successfully applied in metric learning,
where most of the works [29, 30] observe that it is helpful to
use negative examples that are difficult to be discriminated
from the current embedding. [13] was one of the first works
to analyze the effect of efficient negative sampling, wherein
they observed a drop in performance when negatives were not
mined from the same speaker. To the best of our knowledge,
this is the first work to design the Contrastive Learning task
for speech SSL to explicitly control the choice of negatives.
2.2. Data Augmentation
The usefulness of data augmentations for robust SLP has been
explored extensively, primarily on a supervised learning setup
[8, 9]. A wide range of SLP tasks like ASR [31, 32], Speaker
Identification [33] etc., have been seen to benefit from data
augmentations. Aspects of low-resource learning [31, 32],
far-field and noisy environment recognition [34, 35] have pri-
marily seen to benefit the most with data augmentation. Very
recently, the benefits of data augmentation have been explored
in SSL-based speech representation learning [10, 11], where
the former focuses on low-resource and the latter on improv-
ing ASR in far-field and noisy environments. In a recent work
based on wav2vec 2.0, [11] proposes a Multi-Variant Consis-
tency based objective wherein multiple augmented versions
of the same audio sample are created. The original audio sam-
ple is discarded, and a contrastive loss between the multiple
augmented versions is computed. Our proposed approach dif-
fers from this work in the following ways: 1) We retain the
original audio sample and use the cross-contrastive loss with
the augmentation as an auxiliary loss in addition to the orig-
inal wav2vec 2.0 objective. 2) In our loss computation, the
effect of the various negative examples is controlled by the
clustering module, leading to an informed contrast with the
“anchor” in the contrastive loss.
3. METHODOLOGY
In the following subsections, we elaborate on the Cross-
Contrastive setup and the Clustering module, both of which
are key components of the proposed ccc-wav2vec 2.0. Fi-
nally, we present how these two components are integrated to
form ccc-wav2vec 2.0.
3.1. Cross-Contrastive Learning
Deciphering speech and sound in noisy environments is not
a challenging task for humans. However, the same cannot be
expected of SSL models, unless they have been trained for it.
In order to bring robustness to the pre-training approach, we
tap into the augmentations of speech samples.
Given an audio sample X, we apply an augmentation over
it using the torchaudio-augmentations library [36], to get X0
as the augmented sample. We pass both Xand X0through
the wav2vec 2.0 model to get the quantized representations
摘要:

CCC-WAV2VEC2.0:CLUSTERINGAIDEDCROSSCONTRASTIVESELF-SUPERVISEDLEARNINGOFSPEECHREPRESENTATIONSVasistaSaiLodagala1,SreyanGhosh2,S.Umesh11IndianInstituteofTechnology,Madras2UniversityofMaryland,CollegeParkABSTRACTWhileSelf-SupervisedLearninghashelpedreapthebene-tofthescalefromtheavailableunlabeleddata,...

展开>> 收起<<
CCC-WA V2VEC 2.0 CLUSTERING AIDED CROSS CONTRASTIVE SELF-SUPERVISED LEARNING OF SPEECH REPRESENTATIONS Vasista Sai Lodagala1 Sreyan Ghosh2 S. Umesh1.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:841.09KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注