Self-supervised Rewiring of Pre-trained Speech Encoders Towards Faster Fine-tuning with Less Labels in Speech Processing Hao YangJinming ZhaoGholamreza Haffari Ehsan Shareghi

2025-05-03 1 0 5.93MB 8 页 10玖币

侵权投诉

Self-supervised Rewiring of Pre-trained Speech Encoders:

Towards Faster Fine-tuning with Less Labels in Speech Processing

Hao Yang∗∗Jinming Zhao∗Gholamreza Haffari Ehsan Shareghi

Department of Data Science & AI, Monash University

firstname.lastname@monash.edu

Abstract

Pre-trained speech encoders have facilitated

great success across various speech process-

ing tasks. However, ﬁne-tuning these encoders

for downstream tasks require sufﬁciently large

training data to converge or to achieve state-

of-the-art. In text domain this has been partly

attributed to sub-optimality of the representa-

tion space in pre-trained Transformers. In this

work, we take a sober look into pre-trained

speech encoders and rewire their representa-

tion space without requiring any task-speciﬁc

labels. Our method utilises neutrally syn-

thesised version of audio inputs along with

frame masking to construct positive pairs for

contrastive self-supervised learning. When it

is used for augmenting the WAV2VEC 2 en-

coder, we observe consistent improvement of

isotropy in the representation space. Our ex-

periments on 6 speech processing tasks, ex-

hibit a signiﬁcant convergence speedup dur-

ing task ﬁne-tuning as well as consistent task

improvement, specially in low-resource set-

tings.1

1 Introduction

Self-supervised pre-trained speech encoders (Hsu

et al.,2021a;Baevski et al.,2020) are universal

models that are beneﬁcial to a wide range of speech

processing tasks and domains (Liu et al.,2022;Tsai

et al.,2022). Similar to other modalities such as

text, these pre-trained encoders are ﬁne-tuned to-

wards downstream tasks (Wang et al.,2022;Gál-

lego et al.,2021). While the ﬁne-tuning step often

beneﬁts substantially from the presence of warm

pre-trained data encoders, for involved tasks such

as Automatic Speech Recognition (ASR), it still

requires both sufﬁciently large training sets and sev-

eral iterations (Yang et al.,2021) for convergence

to an acceptable task performance.

∗These authors contributed equally to this work.

Our code and models are available at

https://github.

com/YangHao97/rewireW2V2.

Side-stepping the size of the parameter space as a

well-studied challenge for ﬁne-tuning Transformer

models, a confounding factor contributing to this

issue, which has been recently discussed for text

domain (Su et al.,2022;Gao et al.,2021b;Liu et al.,

2021;Su et al.,2021), is the sub-optimal utilisation

of the representation space (e.g., anisotropy (Etha-

yarajh,2019)). This is of paramount importance

since speech, unlike text, carries information (e.g.,

prosodic and para-linguistic) beyond content which

demands a richer utilisation of the representa-

tion space (Mohamed et al.,2022). Inevitably,

less expressive initial representations translate into

longer training and call for more labelled data,

even in cases of frozen models. Nonetheless,

understanding representation space utilisation in

pre-trained speech Transformers is heavily under-

explored (Pasad et al.,2021;Hsu et al.,2021b).

We move towards addressing this gap by high-

lighting the properties of such representation

spaces, and proposing a self-supervised learning

method that improves their utilisation prior to task

ﬁne-tuning. Our contrastive learning framework

constructs positive pairs by (i) encouraging invari-

ance to local perturbations both at the input and

representation levels, and (ii) enhancing sensitiv-

ity to content by using monotonically synthesised

version of speech inputs.

Our experimental ﬁndings across 6 diverse

speech processing tasks (covering content, speaker

and semantics tasks), built on top of the widely

used WAV2VEC 2LARGE (W2V2) (Baevski

et al.,2020) encoder, demonstrate that contrastive

rewiring brings substantial improvement, both in

task performance and ﬁne-tuning speed. Partic-

ularly, our approach shines in the low-resource

condition, outperforming the W2V2 baseline with

substantially fewer number of ﬁne-tuning updates.

For instance, in ASR with 1% training data, our

approach achieves

1/4

of the error in

1/5

of ﬁne-

tuning updates. Beyond task performance and con-

arXiv:2210.13030v1 [cs.CL] 24 Oct 2022

vergence speed, both our qualitative and quantita-

tive analyses on the representation space highlight

the improvements injected by our rewiring strategy.

2 Self-Supervised Contrastive Rewiring

Our method builds on top of a pre-trained speech

encoder, by using a small (less than

) set of raw

unlabelled audio signals to form the self-supervised

learning basis for contrastive rewiring. In what

follows, we detail how utterance-level speech rep-

resentations are produced from the underlying en-

coder, and provide a brief overview of the InfoNCE

objective function used for our contrastive rewiring.

We ﬁnish by explaining how we construct the pairs

needed for contrastive learning.

Speech Representation.

Most pre-trained

speech encoders, including W2V2, do not have

an explicit token representing utterance-level

representation (e.g., [CLS] for BERT (Kenton and

Toutanova,2019)). Given a raw audio sequence

of length

,W2V2 emits

vectors, where

mL

, at each layer (total of 24 Transformer lay-

ers + 1 feature extractor layer). Similar to Chung

et al. (2021), we take the mean of these vectors to

construct the utterance-level representation used

for contrastive learning.

InfoNCE.

We use the InfoNCE objective (Oord

et al.,2018) to rewire speech representations by

pulling positive examples,

(si, s0

, closer and push-

ing away the negative pairs,

(si, sj)

. The loss for a

batch bof size |Db|is,

L=−

|Db|

i=1

log exp(cos(f(si), f(s0

i))/τ)

sj∈Ni∪{s0

exp(cos(f(si), f(sj))/τ),

where

f(.)

indicates the encoder,

denotes the

temperature hyperparameter,

cos(., .)

denotes the

cosine similarity between two representations,

includes all negative examples for

. All parame-

ters of the encoder are updated during optimisation.

2.1 Contrastive Pair Construction

Positive Pairs.

We form positive pairs both at the

raw and representation levels. For a given audio

signal

, we deploy the following 3 strategies to

construct its corresponding positive pairs, (si, s0

i):

Twin.

Inspired by Liu et al. (2021); Gao et al.

(2021b), given a speech sequence of length

[It’s expensive]

[It’s inexpensive]

[It is cheap] [It is cheap]

[It’s inexpensive]

[It’s expensive]

Contrastive

Rewired Model

Figure 1: Conceptual visualisation: Vanilla representa-

tion space which is very sensitive to surface similarity

of audio signals (left) vs. rewired representation space

with Neutral strategy which places more emphasis on

content similarity (right).

we ﬁrst duplicate it. Then we randomly select a

starting point for a span, and mask

p×L

consec-

utive signals from the audio, replacing them with

[MASK]

. We use

p= 20%

in our experiments.

This is applied always and only once to each si.

Neutral.

For a given audio

, its monotonic neu-

tral version is created from available transcripts

using Festival Speech Synthesis System.

The

synthesizer is chosen because it is able to pro-

duce non-expressive speech, as demonstrated in

previous studies (Lotﬁan and Busso,2017). The

neutral version is devoid of noise, prosody and

para-linguistic features, focusing mostly on con-

tent. Figure 1illustrates a visualisation of the

desired expected effect from Neutral rewiring.

Mixed.

While the

Twin

strategy aims to make the

representations invariant to local changes and

noise, the

Neutral

approach tends to rewire the

space based on content-level similarity. To lever-

age the beneﬁts of both worlds, as our main strat-

egy, we uniformly interchange

Twin

and

Neutral

in the Mixed setting.

Negative Pairs.

In all strategies, given a batch

and a sample

si∈b

, the set

of negative ex-

amples for

Ni={sj|sj∈b, j 6=i}

. Further,

we have speciﬁc negative samples added to

per

each strategy to construct negative pairs, (si, sj):

Twin.Ni∪ {twin(sj)|sj∈b, j 6=i}.

Neutral.Ni∪ {neutral(sj)|sj∈b, j 6=i}.

Mixed.Union of the above two.

Similar to Liu et al. (2021) and Gao et al. (2021b),

in all our strategies, we apply dropout to perturb

Alternatively, one can apply an off-the-shelf ASR ﬁrst

over speech to produce transcripts when transcripts are absent.

3http://festvox.org/festival

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Self-supervisedRewiringofPre-trainedSpeechEncoders:TowardsFasterFine-tuningwithLessLabelsinSpeechProcessingHaoYangJinmingZhaoGholamrezaHaffariEhsanShareghiDepartmentofDataScience&AI,MonashUniversityfirstname.lastname@monash.eduAbstractPre-trainedspeechencodershavefacilitatedgreatsuccessacrossvari...

展开>> 收起<<

Self-supervised Rewiring of Pre-trained Speech Encoders Towards Faster Fine-tuning with Less Labels in Speech Processing Hao YangJinming ZhaoGholamreza Haffari Ehsan Shareghi.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Self-supervised Rewiring of Pre-trained Speech Encoders Towards Faster Fine-tuning with Less Labels in Speech Processing Hao YangJinming ZhaoGholamreza Haffari Ehsan Shareghi

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: