Self-supervised Rewiring of Pre-trained Speech Encoders Towards Faster Fine-tuning with Less Labels in Speech Processing Hao YangJinming ZhaoGholamreza Haffari Ehsan Shareghi

2025-05-03 0 0 5.93MB 8 页 10玖币
侵权投诉
Self-supervised Rewiring of Pre-trained Speech Encoders:
Towards Faster Fine-tuning with Less Labels in Speech Processing
Hao YangJinming ZhaoGholamreza Haffari Ehsan Shareghi
Department of Data Science & AI, Monash University
firstname.lastname@monash.edu
Abstract
Pre-trained speech encoders have facilitated
great success across various speech process-
ing tasks. However, fine-tuning these encoders
for downstream tasks require sufficiently large
training data to converge or to achieve state-
of-the-art. In text domain this has been partly
attributed to sub-optimality of the representa-
tion space in pre-trained Transformers. In this
work, we take a sober look into pre-trained
speech encoders and rewire their representa-
tion space without requiring any task-specific
labels. Our method utilises neutrally syn-
thesised version of audio inputs along with
frame masking to construct positive pairs for
contrastive self-supervised learning. When it
is used for augmenting the WAV2VEC 2 en-
coder, we observe consistent improvement of
isotropy in the representation space. Our ex-
periments on 6 speech processing tasks, ex-
hibit a significant convergence speedup dur-
ing task fine-tuning as well as consistent task
improvement, specially in low-resource set-
tings.1
1 Introduction
Self-supervised pre-trained speech encoders (Hsu
et al.,2021a;Baevski et al.,2020) are universal
models that are beneficial to a wide range of speech
processing tasks and domains (Liu et al.,2022;Tsai
et al.,2022). Similar to other modalities such as
text, these pre-trained encoders are fine-tuned to-
wards downstream tasks (Wang et al.,2022;Gál-
lego et al.,2021). While the fine-tuning step often
benefits substantially from the presence of warm
pre-trained data encoders, for involved tasks such
as Automatic Speech Recognition (ASR), it still
requires both sufficiently large training sets and sev-
eral iterations (Yang et al.,2021) for convergence
to an acceptable task performance.
These authors contributed equally to this work.
1
Our code and models are available at
https://github.
com/YangHao97/rewireW2V2.
Side-stepping the size of the parameter space as a
well-studied challenge for fine-tuning Transformer
models, a confounding factor contributing to this
issue, which has been recently discussed for text
domain (Su et al.,2022;Gao et al.,2021b;Liu et al.,
2021;Su et al.,2021), is the sub-optimal utilisation
of the representation space (e.g., anisotropy (Etha-
yarajh,2019)). This is of paramount importance
since speech, unlike text, carries information (e.g.,
prosodic and para-linguistic) beyond content which
demands a richer utilisation of the representa-
tion space (Mohamed et al.,2022). Inevitably,
less expressive initial representations translate into
longer training and call for more labelled data,
even in cases of frozen models. Nonetheless,
understanding representation space utilisation in
pre-trained speech Transformers is heavily under-
explored (Pasad et al.,2021;Hsu et al.,2021b).
We move towards addressing this gap by high-
lighting the properties of such representation
spaces, and proposing a self-supervised learning
method that improves their utilisation prior to task
fine-tuning. Our contrastive learning framework
constructs positive pairs by (i) encouraging invari-
ance to local perturbations both at the input and
representation levels, and (ii) enhancing sensitiv-
ity to content by using monotonically synthesised
version of speech inputs.
Our experimental findings across 6 diverse
speech processing tasks (covering content, speaker
and semantics tasks), built on top of the widely
used WAV2VEC 2LARGE (W2V2) (Baevski
et al.,2020) encoder, demonstrate that contrastive
rewiring brings substantial improvement, both in
task performance and fine-tuning speed. Partic-
ularly, our approach shines in the low-resource
condition, outperforming the W2V2 baseline with
substantially fewer number of fine-tuning updates.
For instance, in ASR with 1% training data, our
approach achieves
1/4
of the error in
1/5
of fine-
tuning updates. Beyond task performance and con-
arXiv:2210.13030v1 [cs.CL] 24 Oct 2022
vergence speed, both our qualitative and quantita-
tive analyses on the representation space highlight
the improvements injected by our rewiring strategy.
2 Self-Supervised Contrastive Rewiring
Our method builds on top of a pre-trained speech
encoder, by using a small (less than
7k
) set of raw
unlabelled audio signals to form the self-supervised
learning basis for contrastive rewiring. In what
follows, we detail how utterance-level speech rep-
resentations are produced from the underlying en-
coder, and provide a brief overview of the InfoNCE
objective function used for our contrastive rewiring.
We finish by explaining how we construct the pairs
needed for contrastive learning.
Speech Representation.
Most pre-trained
speech encoders, including W2V2, do not have
an explicit token representing utterance-level
representation (e.g., [CLS] for BERT (Kenton and
Toutanova,2019)). Given a raw audio sequence
s
of length
L
,W2V2 emits
m
vectors, where
mL
, at each layer (total of 24 Transformer lay-
ers + 1 feature extractor layer). Similar to Chung
et al. (2021), we take the mean of these vectors to
construct the utterance-level representation used
for contrastive learning.
InfoNCE.
We use the InfoNCE objective (Oord
et al.,2018) to rewire speech representations by
pulling positive examples,
(si, s0
i)
, closer and push-
ing away the negative pairs,
(si, sj)
. The loss for a
batch bof size |Db|is,
L=
|Db|
X
i=1
log exp(cos(f(si), f(s0
i)))
X
sjNi∪{s0
i}
exp(cos(f(si), f(sj))),
where
f(.)
indicates the encoder,
τ
denotes the
temperature hyperparameter,
cos(., .)
denotes the
cosine similarity between two representations,
Ni
includes all negative examples for
si
. All parame-
ters of the encoder are updated during optimisation.
2.1 Contrastive Pair Construction
Positive Pairs.
We form positive pairs both at the
raw and representation levels. For a given audio
signal
si
, we deploy the following 3 strategies to
construct its corresponding positive pairs, (si, s0
i):
Twin.
Inspired by Liu et al. (2021); Gao et al.
(2021b), given a speech sequence of length
L
,
[It’s expensive]
[It’s inexpensive]
[It is cheap] [It is cheap]
[It’s inexpensive]
[It’s expensive]
Contrastive
Rewired Model
Figure 1: Conceptual visualisation: Vanilla representa-
tion space which is very sensitive to surface similarity
of audio signals (left) vs. rewired representation space
with Neutral strategy which places more emphasis on
content similarity (right).
we first duplicate it. Then we randomly select a
starting point for a span, and mask
p×L
consec-
utive signals from the audio, replacing them with
[MASK]
. We use
p= 20%
in our experiments.
This is applied always and only once to each si.
Neutral.
For a given audio
si
, its monotonic neu-
tral version is created from available transcripts
2
using Festival Speech Synthesis System.
3
The
synthesizer is chosen because it is able to pro-
duce non-expressive speech, as demonstrated in
previous studies (Lotfian and Busso,2017). The
neutral version is devoid of noise, prosody and
para-linguistic features, focusing mostly on con-
tent. Figure 1illustrates a visualisation of the
desired expected effect from Neutral rewiring.
Mixed.
While the
Twin
strategy aims to make the
representations invariant to local changes and
noise, the
Neutral
approach tends to rewire the
space based on content-level similarity. To lever-
age the benefits of both worlds, as our main strat-
egy, we uniformly interchange
Twin
and
Neutral
in the Mixed setting.
Negative Pairs.
In all strategies, given a batch
b
and a sample
sib
, the set
Ni
of negative ex-
amples for
si
is
Ni={sj|sjb, j 6=i}
. Further,
we have specific negative samples added to
Ni
per
each strategy to construct negative pairs, (si, sj):
Twin.Ni∪ {twin(sj)|sjb, j 6=i}.
Neutral.Ni∪ {neutral(sj)|sjb, j 6=i}.
Mixed.Union of the above two.
Similar to Liu et al. (2021) and Gao et al. (2021b),
in all our strategies, we apply dropout to perturb
2
Alternatively, one can apply an off-the-shelf ASR first
over speech to produce transcripts when transcripts are absent.
3http://festvox.org/festival
摘要:

Self-supervisedRewiringofPre-trainedSpeechEncoders:TowardsFasterFine-tuningwithLessLabelsinSpeechProcessingHaoYangJinmingZhaoGholamrezaHaffariEhsanShareghiDepartmentofDataScience&AI,MonashUniversityfirstname.lastname@monash.eduAbstractPre-trainedspeechencodershavefacilitatedgreatsuccessacrossvari...

展开>> 收起<<
Self-supervised Rewiring of Pre-trained Speech Encoders Towards Faster Fine-tuning with Less Labels in Speech Processing Hao YangJinming ZhaoGholamreza Haffari Ehsan Shareghi.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:8 页 大小:5.93MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注