EXPLORING EFFICIENT-TUNING METHODS IN SELF-SUPERVISED SPEECH MODELS Zih-Ching Chen1 Chin-Lun Fu1 Chih-Ying Liu1 Shang-Wen Daniel Li2 Hung-yi Lee1 1National Taiwan University

2025-04-26 0 0 343.77KB 8 页 10玖币
侵权投诉
EXPLORING EFFICIENT-TUNING METHODS IN SELF-SUPERVISED SPEECH MODELS
Zih-Ching Chen?1, Chin-Lun Fu?1, Chih-Ying Liu1, Shang-Wen (Daniel) Li2, Hung-yi Lee1
1National Taiwan University
2Amazon AI
ABSTRACT
In this study, we aim to explore efficient tuning methods
for speech self-supervised learning. Recent studies show
that self-supervised learning (SSL) can learn powerful rep-
resentations for different speech tasks. However, fine-tuning
pre-trained models for each downstream task is parameter-
inefficient since SSL models are notoriously large with mil-
lions of parameters. Adapters are lightweight modules com-
monly used in NLP to solve this problem. In downstream
tasks, the parameters of SSL models are frozen, and only
the adapters are trained. Given the lack of studies generally
exploring the effectiveness of adapters for self-supervised
speech tasks, we intend to fill this gap by adding various
adapter modules in pre-trained speech SSL models. We show
that the performance parity can be achieved with over 90%
parameter reduction, and discussed the pros and cons of ef-
ficient tuning techniques. This is the first comprehensive
investigation of various adapter types across speech tasks.
1. INTRODUCTION
Recently, self-supervised learning (SSL) has gained popular-
ity in the field of computer vision (CV), natural language
processing (NLP), as well as speech tasks. SSL pre-trains
a shared representation model on a huge amount of unlabeled
data. The pre-trained SSL model can be used for various
downstream tasks with minimal adaptation via either fine-
tuning or utilizing the learned representation from the frozen
model [1]. Applying a SSL model to different downstream
tasks can significantly lower the entry barrier for developing
a model compared to training the model from scratch. Yield-
ing state-of-the-art (SOTA) performance, SSL is desirable for
deep learning not only for its outstanding performance, but
also for its generalizability and reusability for different tasks
in various application scenarios. Transfering from pre-trained
models yields strong performance on not only many NLP
tasks but speech tasks as well.
Despite the huge success and popularity SSL has gained,
there are some drawbacks when utilizing SSL models. In the
presence of various downstream tasks, fine-tuning pre-trained
models for each downstream task is still parameter-inefficient
since massively self-supervised pre-trained models are noto-
riously deep, requiring millions or even billions of parame-
Fig. 1. The trade-off between accuracy and number of trained
task-specific parameters, for several efficient tuning methods
and fine-tuning. The x-axis represents trainable parameter of
the upstream model, while the y-axis represents the accuracy
of Speaker Identification task (SID). The red point is fine-
tuning (FT), and the blue points are the efficient methods.
ters. Due to this reason, adapting the SSL speech model by
fine-tuning requires large storage space. For example, Hu-
BERT X-Large [2] contains 964M parameters. This results in
requiring large storage space for each complete set of tuned
parameters per downstream task. Furthermore, overwriting
the pre-trained model parameters may not be the best way of
utilizing the pre-trained knowledge from the SSL model.
To overcome these shortcomings, researchers then utilize
the SSL speech model by only using the frozen represen-
tation [3]. In NLP, efficient tuning techniques have been
proposed for leveraging SSL models. One of the most pop-
ular efficient methods is adapters [4], which introduce extra
tunable weights and freeze the original parameters of the pre-
trained language model (PLM). Adapters have demonstrated
comparable performance with fully fine-tuning the entire
model while being parameter-efficient. More recently, the
prompting technique has shown to be surprisingly effective
on PLM [5]. Both methods shows that “freezing” pre-trained
models is appealing, especially as model size continues to
978-1-6654-7189-3/22/$31.00 ©2023 IEEE
arXiv:2210.06175v3 [eess.AS] 30 Jan 2023
increase. Rather than requiring a separate copy of the model
for each downstream task, a single generalized upstream
model can simultaneously transfer to many different tasks.
Adapters have been shown to work well for machine transla-
tion [6], cross-lingual transfer [7], as well as transfer learning
in automatic speech recognition (ASR) [8]. However, these
efficient tuning methods are not systematically studied with
SSL speech models.
In order to utilize efficient tuning methods to the field
of SSL speech representation, in this work, we explore the
effectiveness of efficient tuning methods for self-supervised
speech models on the SUPERB benchmark [3]. We apply dif-
ferent efficient tuning methods, including adapter tuning and
prompt tuning, on SSL speech models with different training
objectives. We propose an adapter framework for multiple
downstream speech processing tasks, including the recogni-
tion tasks, classification, as well as speaker tasks. To inves-
tigate the effectiveness of these efficient methods, we con-
duct experiment on 3 SSL models with different training ob-
jectives: HuBERT, Wav2vec2 [9], and DeCoAR2 [10]. The
main concept of our work is shown in Fig 1. To our best
knowledge, this is the first comprehensive investigation of
various efficient tuning methods on different speech tasks. We
show that the performance parity can be achieved with over
90% parameter reduction. Furthermore, we show the pros and
cons of various efficient tuning techniques, e.g., the Houlsby
adapter [4] is the most efficient in the trade of between per-
formance and the number of parameters, and weighted sum is
a very suitable efficient method to use in SSL speech tasks.
2. RELATED WORKS
2.1. Adapter Approach
For NLP tasks, adapters are introduced for the transformer ar-
chitecture. An adapter typically comes with a two-layer feed-
forward bottleneck architecture [4]. It was found that adapters
approach the performance of full fine-tuning with only a frac-
tion of the parameters in NLP tasks using a PLM. Inspired by
the success of prompting methods that control PLMs through
textual prompts [11], prefix tuning [5], and neural reprogram-
ming [12] prepends an additional tunable prefix tokens to the
hidden layers and only optimized these soft prompts when
fine-tuning. More recently, LoRA [13] learns low-rank matri-
ces for parameter updates approximation. AdapterBias [14]
adds a token-dependent parameter shift to transfer from PLM
in a more parameter-efficient manner. Beyond its parameter
efficiency, adapter tuning is also shown to be more robust due
to its ability to preserve the pre-trained knowledge [15], and
often exhibits robustness in out-of-distribution evaluation[5].
In the field of speech processing tasks, adapters have also
been utilized for efficient SSL tuning. Using adapters on
Wav2vec2 for efficient tuning for ASR has been proposed [8].
Moreover, The work [16] proposes residual adapters (RAs)
Fig. 2. Illustration of the transformer architecture and
parameter-efficient tuning methods. The blocks with dashed
borderlines are the added parameters by the efficient method.
Wq, Wk, Wvrepresents the weights of query, key and value,
respectively.
which are inserted in the pre-trained model to learn domain-
related information with the same SSL loss as the pretraining
stage. Adapters have also been employed for efficient SSL
speech pre-training of new tasks in a continual learning set-
ting [17]. As for prompting, it has been applied to speech
task [18] with a prompt tuning paradigm for Generative Spo-
ken Language Model [19].
However, the above works either apply adapters on one
SSL speech model on a specific task, or they did not exam-
ine the different efficient tuning methods on different down-
stream tasks in a comprehensive way. This leaves the question
of whether the efficient tuning methods in NLP will yield the
same effectiveness when utilized in speech processing tasks.
We hypothesize that we will see the same benefits of adapters
in a speech model as in an NLP model, namely parameter ef-
ficient transfer of the pre-trained network to different down-
stream tasks with little performance degradation.
2.2. The SUPERB benchmark
As more powerful SSL models are being proposed with more
promising performance on various tasks, researchers continu-
ally try to find extensive evaluation methods to assess model
performance, in the hope of understanding the capability of
the learned representation in these models. SUPERB [3] is a
摘要:

EXPLORINGEFFICIENT-TUNINGMETHODSINSELF-SUPERVISEDSPEECHMODELSZih-ChingChen?1,Chin-LunFu?1,Chih-YingLiu1,Shang-Wen(Daniel)Li2,Hung-yiLee11NationalTaiwanUniversity2AmazonAIABSTRACTInthisstudy,weaimtoexploreefcienttuningmethodsforspeechself-supervisedlearning.Recentstudiesshowthatself-supervisedlearni...

展开>> 收起<<
EXPLORING EFFICIENT-TUNING METHODS IN SELF-SUPERVISED SPEECH MODELS Zih-Ching Chen1 Chin-Lun Fu1 Chih-Ying Liu1 Shang-Wen Daniel Li2 Hung-yi Lee1 1National Taiwan University.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:343.77KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注