
Self-supervised Rewiring of Pre-trained Speech Encoders:
Towards Faster Fine-tuning with Less Labels in Speech Processing
Hao Yang∗∗Jinming Zhao∗Gholamreza Haffari Ehsan Shareghi
Department of Data Science & AI, Monash University
firstname.lastname@monash.edu
Abstract
Pre-trained speech encoders have facilitated
great success across various speech process-
ing tasks. However, fine-tuning these encoders
for downstream tasks require sufficiently large
training data to converge or to achieve state-
of-the-art. In text domain this has been partly
attributed to sub-optimality of the representa-
tion space in pre-trained Transformers. In this
work, we take a sober look into pre-trained
speech encoders and rewire their representa-
tion space without requiring any task-specific
labels. Our method utilises neutrally syn-
thesised version of audio inputs along with
frame masking to construct positive pairs for
contrastive self-supervised learning. When it
is used for augmenting the WAV2VEC 2 en-
coder, we observe consistent improvement of
isotropy in the representation space. Our ex-
periments on 6 speech processing tasks, ex-
hibit a significant convergence speedup dur-
ing task fine-tuning as well as consistent task
improvement, specially in low-resource set-
tings.1
1 Introduction
Self-supervised pre-trained speech encoders (Hsu
et al.,2021a;Baevski et al.,2020) are universal
models that are beneficial to a wide range of speech
processing tasks and domains (Liu et al.,2022;Tsai
et al.,2022). Similar to other modalities such as
text, these pre-trained encoders are fine-tuned to-
wards downstream tasks (Wang et al.,2022;Gál-
lego et al.,2021). While the fine-tuning step often
benefits substantially from the presence of warm
pre-trained data encoders, for involved tasks such
as Automatic Speech Recognition (ASR), it still
requires both sufficiently large training sets and sev-
eral iterations (Yang et al.,2021) for convergence
to an acceptable task performance.
∗These authors contributed equally to this work.
1
Our code and models are available at
https://github.
com/YangHao97/rewireW2V2.
Side-stepping the size of the parameter space as a
well-studied challenge for fine-tuning Transformer
models, a confounding factor contributing to this
issue, which has been recently discussed for text
domain (Su et al.,2022;Gao et al.,2021b;Liu et al.,
2021;Su et al.,2021), is the sub-optimal utilisation
of the representation space (e.g., anisotropy (Etha-
yarajh,2019)). This is of paramount importance
since speech, unlike text, carries information (e.g.,
prosodic and para-linguistic) beyond content which
demands a richer utilisation of the representa-
tion space (Mohamed et al.,2022). Inevitably,
less expressive initial representations translate into
longer training and call for more labelled data,
even in cases of frozen models. Nonetheless,
understanding representation space utilisation in
pre-trained speech Transformers is heavily under-
explored (Pasad et al.,2021;Hsu et al.,2021b).
We move towards addressing this gap by high-
lighting the properties of such representation
spaces, and proposing a self-supervised learning
method that improves their utilisation prior to task
fine-tuning. Our contrastive learning framework
constructs positive pairs by (i) encouraging invari-
ance to local perturbations both at the input and
representation levels, and (ii) enhancing sensitiv-
ity to content by using monotonically synthesised
version of speech inputs.
Our experimental findings across 6 diverse
speech processing tasks (covering content, speaker
and semantics tasks), built on top of the widely
used WAV2VEC 2LARGE (W2V2) (Baevski
et al.,2020) encoder, demonstrate that contrastive
rewiring brings substantial improvement, both in
task performance and fine-tuning speed. Partic-
ularly, our approach shines in the low-resource
condition, outperforming the W2V2 baseline with
substantially fewer number of fine-tuning updates.
For instance, in ASR with 1% training data, our
approach achieves
1/4
of the error in
1/5
of fine-
tuning updates. Beyond task performance and con-
arXiv:2210.13030v1 [cs.CL] 24 Oct 2022