
increase. Rather than requiring a separate copy of the model
for each downstream task, a single generalized upstream
model can simultaneously transfer to many different tasks.
Adapters have been shown to work well for machine transla-
tion [6], cross-lingual transfer [7], as well as transfer learning
in automatic speech recognition (ASR) [8]. However, these
efficient tuning methods are not systematically studied with
SSL speech models.
In order to utilize efficient tuning methods to the field
of SSL speech representation, in this work, we explore the
effectiveness of efficient tuning methods for self-supervised
speech models on the SUPERB benchmark [3]. We apply dif-
ferent efficient tuning methods, including adapter tuning and
prompt tuning, on SSL speech models with different training
objectives. We propose an adapter framework for multiple
downstream speech processing tasks, including the recogni-
tion tasks, classification, as well as speaker tasks. To inves-
tigate the effectiveness of these efficient methods, we con-
duct experiment on 3 SSL models with different training ob-
jectives: HuBERT, Wav2vec2 [9], and DeCoAR2 [10]. The
main concept of our work is shown in Fig 1. To our best
knowledge, this is the first comprehensive investigation of
various efficient tuning methods on different speech tasks. We
show that the performance parity can be achieved with over
90% parameter reduction. Furthermore, we show the pros and
cons of various efficient tuning techniques, e.g., the Houlsby
adapter [4] is the most efficient in the trade of between per-
formance and the number of parameters, and weighted sum is
a very suitable efficient method to use in SSL speech tasks.
2. RELATED WORKS
2.1. Adapter Approach
For NLP tasks, adapters are introduced for the transformer ar-
chitecture. An adapter typically comes with a two-layer feed-
forward bottleneck architecture [4]. It was found that adapters
approach the performance of full fine-tuning with only a frac-
tion of the parameters in NLP tasks using a PLM. Inspired by
the success of prompting methods that control PLMs through
textual prompts [11], prefix tuning [5], and neural reprogram-
ming [12] prepends an additional tunable prefix tokens to the
hidden layers and only optimized these soft prompts when
fine-tuning. More recently, LoRA [13] learns low-rank matri-
ces for parameter updates approximation. AdapterBias [14]
adds a token-dependent parameter shift to transfer from PLM
in a more parameter-efficient manner. Beyond its parameter
efficiency, adapter tuning is also shown to be more robust due
to its ability to preserve the pre-trained knowledge [15], and
often exhibits robustness in out-of-distribution evaluation[5].
In the field of speech processing tasks, adapters have also
been utilized for efficient SSL tuning. Using adapters on
Wav2vec2 for efficient tuning for ASR has been proposed [8].
Moreover, The work [16] proposes residual adapters (RAs)
Fig. 2. Illustration of the transformer architecture and
parameter-efficient tuning methods. The blocks with dashed
borderlines are the added parameters by the efficient method.
Wq, Wk, Wvrepresents the weights of query, key and value,
respectively.
which are inserted in the pre-trained model to learn domain-
related information with the same SSL loss as the pretraining
stage. Adapters have also been employed for efficient SSL
speech pre-training of new tasks in a continual learning set-
ting [17]. As for prompting, it has been applied to speech
task [18] with a prompt tuning paradigm for Generative Spo-
ken Language Model [19].
However, the above works either apply adapters on one
SSL speech model on a specific task, or they did not exam-
ine the different efficient tuning methods on different down-
stream tasks in a comprehensive way. This leaves the question
of whether the efficient tuning methods in NLP will yield the
same effectiveness when utilized in speech processing tasks.
We hypothesize that we will see the same benefits of adapters
in a speech model as in an NLP model, namely parameter ef-
ficient transfer of the pre-trained network to different down-
stream tasks with little performance degradation.
2.2. The SUPERB benchmark
As more powerful SSL models are being proposed with more
promising performance on various tasks, researchers continu-
ally try to find extensive evaluation methods to assess model
performance, in the hope of understanding the capability of
the learned representation in these models. SUPERB [3] is a