EXPLORING EFFICIENT-TUNING METHODS IN SELF-SUPERVISED SPEECH MODELS Zih-Ching Chen1 Chin-Lun Fu1 Chih-Ying Liu1 Shang-Wen Daniel Li2 Hung-yi Lee1 1National Taiwan University

2025-04-26 0 0 343.77KB 8 页 10玖币

侵权投诉

EXPLORING EFFICIENT-TUNING METHODS IN SELF-SUPERVISED SPEECH MODELS

Zih-Ching Chen?1, Chin-Lun Fu?1, Chih-Ying Liu1, Shang-Wen (Daniel) Li2, Hung-yi Lee1

1National Taiwan University

2Amazon AI

ABSTRACT

In this study, we aim to explore efﬁcient tuning methods

for speech self-supervised learning. Recent studies show

that self-supervised learning (SSL) can learn powerful rep-

resentations for different speech tasks. However, ﬁne-tuning

pre-trained models for each downstream task is parameter-

inefﬁcient since SSL models are notoriously large with mil-

lions of parameters. Adapters are lightweight modules com-

monly used in NLP to solve this problem. In downstream

tasks, the parameters of SSL models are frozen, and only

the adapters are trained. Given the lack of studies generally

exploring the effectiveness of adapters for self-supervised

speech tasks, we intend to ﬁll this gap by adding various

adapter modules in pre-trained speech SSL models. We show

that the performance parity can be achieved with over 90%

parameter reduction, and discussed the pros and cons of ef-

ﬁcient tuning techniques. This is the ﬁrst comprehensive

investigation of various adapter types across speech tasks.

1. INTRODUCTION

Recently, self-supervised learning (SSL) has gained popular-

ity in the ﬁeld of computer vision (CV), natural language

processing (NLP), as well as speech tasks. SSL pre-trains

a shared representation model on a huge amount of unlabeled

data. The pre-trained SSL model can be used for various

downstream tasks with minimal adaptation via either ﬁne-

tuning or utilizing the learned representation from the frozen

model [1]. Applying a SSL model to different downstream

tasks can signiﬁcantly lower the entry barrier for developing

a model compared to training the model from scratch. Yield-

ing state-of-the-art (SOTA) performance, SSL is desirable for

deep learning not only for its outstanding performance, but

also for its generalizability and reusability for different tasks

in various application scenarios. Transfering from pre-trained

models yields strong performance on not only many NLP

tasks but speech tasks as well.

Despite the huge success and popularity SSL has gained,

there are some drawbacks when utilizing SSL models. In the

presence of various downstream tasks, ﬁne-tuning pre-trained

models for each downstream task is still parameter-inefﬁcient

since massively self-supervised pre-trained models are noto-

riously deep, requiring millions or even billions of parame-

Fig. 1. The trade-off between accuracy and number of trained

task-speciﬁc parameters, for several efﬁcient tuning methods

and ﬁne-tuning. The x-axis represents trainable parameter of

the upstream model, while the y-axis represents the accuracy

of Speaker Identiﬁcation task (SID). The red point is ﬁne-

tuning (FT), and the blue points are the efﬁcient methods.

ters. Due to this reason, adapting the SSL speech model by

ﬁne-tuning requires large storage space. For example, Hu-

BERT X-Large [2] contains 964M parameters. This results in

requiring large storage space for each complete set of tuned

parameters per downstream task. Furthermore, overwriting

the pre-trained model parameters may not be the best way of

utilizing the pre-trained knowledge from the SSL model.

To overcome these shortcomings, researchers then utilize

the SSL speech model by only using the frozen represen-

tation [3]. In NLP, efﬁcient tuning techniques have been

proposed for leveraging SSL models. One of the most pop-

ular efﬁcient methods is adapters [4], which introduce extra

tunable weights and freeze the original parameters of the pre-

trained language model (PLM). Adapters have demonstrated

comparable performance with fully ﬁne-tuning the entire

model while being parameter-efﬁcient. More recently, the

prompting technique has shown to be surprisingly effective

on PLM [5]. Both methods shows that “freezing” pre-trained

models is appealing, especially as model size continues to

arXiv:2210.06175v3 [eess.AS] 30 Jan 2023

increase. Rather than requiring a separate copy of the model

for each downstream task, a single generalized upstream

model can simultaneously transfer to many different tasks.

Adapters have been shown to work well for machine transla-

tion [6], cross-lingual transfer [7], as well as transfer learning

in automatic speech recognition (ASR) [8]. However, these

efﬁcient tuning methods are not systematically studied with

SSL speech models.

In order to utilize efﬁcient tuning methods to the ﬁeld

of SSL speech representation, in this work, we explore the

effectiveness of efﬁcient tuning methods for self-supervised

speech models on the SUPERB benchmark [3]. We apply dif-

ferent efﬁcient tuning methods, including adapter tuning and

prompt tuning, on SSL speech models with different training

objectives. We propose an adapter framework for multiple

downstream speech processing tasks, including the recogni-

tion tasks, classiﬁcation, as well as speaker tasks. To inves-

tigate the effectiveness of these efﬁcient methods, we con-

duct experiment on 3 SSL models with different training ob-

jectives: HuBERT, Wav2vec2 [9], and DeCoAR2 [10]. The

main concept of our work is shown in Fig 1. To our best

knowledge, this is the ﬁrst comprehensive investigation of

various efﬁcient tuning methods on different speech tasks. We

show that the performance parity can be achieved with over

90% parameter reduction. Furthermore, we show the pros and

cons of various efﬁcient tuning techniques, e.g., the Houlsby

adapter [4] is the most efﬁcient in the trade of between per-

formance and the number of parameters, and weighted sum is

a very suitable efﬁcient method to use in SSL speech tasks.

2. RELATED WORKS

2.1. Adapter Approach

For NLP tasks, adapters are introduced for the transformer ar-

chitecture. An adapter typically comes with a two-layer feed-

forward bottleneck architecture [4]. It was found that adapters

approach the performance of full ﬁne-tuning with only a frac-

tion of the parameters in NLP tasks using a PLM. Inspired by

the success of prompting methods that control PLMs through

textual prompts [11], preﬁx tuning [5], and neural reprogram-

ming [12] prepends an additional tunable preﬁx tokens to the

hidden layers and only optimized these soft prompts when

ﬁne-tuning. More recently, LoRA [13] learns low-rank matri-

ces for parameter updates approximation. AdapterBias [14]

adds a token-dependent parameter shift to transfer from PLM

in a more parameter-efﬁcient manner. Beyond its parameter

efﬁciency, adapter tuning is also shown to be more robust due

to its ability to preserve the pre-trained knowledge [15], and

often exhibits robustness in out-of-distribution evaluation[5].

In the ﬁeld of speech processing tasks, adapters have also

been utilized for efﬁcient SSL tuning. Using adapters on

Wav2vec2 for efﬁcient tuning for ASR has been proposed [8].

Moreover, The work [16] proposes residual adapters (RAs)

Fig. 2. Illustration of the transformer architecture and

parameter-efﬁcient tuning methods. The blocks with dashed

borderlines are the added parameters by the efﬁcient method.

Wq, Wk, Wvrepresents the weights of query, key and value,

respectively.

which are inserted in the pre-trained model to learn domain-

related information with the same SSL loss as the pretraining

stage. Adapters have also been employed for efﬁcient SSL

speech pre-training of new tasks in a continual learning set-

ting [17]. As for prompting, it has been applied to speech

task [18] with a prompt tuning paradigm for Generative Spo-

ken Language Model [19].

However, the above works either apply adapters on one

SSL speech model on a speciﬁc task, or they did not exam-

ine the different efﬁcient tuning methods on different down-

stream tasks in a comprehensive way. This leaves the question

of whether the efﬁcient tuning methods in NLP will yield the

same effectiveness when utilized in speech processing tasks.

We hypothesize that we will see the same beneﬁts of adapters

in a speech model as in an NLP model, namely parameter ef-

ﬁcient transfer of the pre-trained network to different down-

stream tasks with little performance degradation.

2.2. The SUPERB benchmark

As more powerful SSL models are being proposed with more

promising performance on various tasks, researchers continu-

ally try to ﬁnd extensive evaluation methods to assess model

performance, in the hope of understanding the capability of

the learned representation in these models. SUPERB [3] is a

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EXPLORINGEFFICIENT-TUNINGMETHODSINSELF-SUPERVISEDSPEECHMODELSZih-ChingChen?1,Chin-LunFu?1,Chih-YingLiu1,Shang-Wen(Daniel)Li2,Hung-yiLee11NationalTaiwanUniversity2AmazonAIABSTRACTInthisstudy,weaimtoexploreefcienttuningmethodsforspeechself-supervisedlearning.Recentstudiesshowthatself-supervisedlearni...

展开>> 收起<<

EXPLORING EFFICIENT-TUNING METHODS IN SELF-SUPERVISED SPEECH MODELS Zih-Ching Chen1 Chin-Lun Fu1 Chih-Ying Liu1 Shang-Wen Daniel Li2 Hung-yi Lee1 1National Taiwan University.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

EXPLORING EFFICIENT-TUNING METHODS IN SELF-SUPERVISED SPEECH MODELS Zih-Ching Chen1 Chin-Lun Fu1 Chih-Ying Liu1 Shang-Wen Daniel Li2 Hung-yi Lee1 1National Taiwan University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: