Evaluating Parameter Efﬁcient Learning for Generation Peng Xux Mostofa Patwaryx Shrimai Prabhumoyex Virginia Adamsx Ryan J. Prengerx Wei Pingx Nayeon Leez Mohammad Shoeybix Bryan Catanzarox

2025-05-06 0 0 310.15KB 10 页 10玖币

侵权投诉

Evaluating Parameter Efﬁcient Learning for Generation

Peng Xu§, Mostofa Patwary§, Shrimai Prabhumoye§, Virginia Adams§, Ryan J. Prenger§,

Wei Ping§, Nayeon Lee‡, Mohammad Shoeybi§, Bryan Catanzaro§

‡The Hong Kong University of Science and Technology, §NVIDIA

pengx@nvidia.com

Abstract

Parameter efﬁcient learning methods

(PERMs) have recently gained signiﬁ-

cant attention as they provide an efﬁcient way

for pre-trained language models (PLMs) to

adapt to a downstream task. However, these

conclusions are mostly drawn from in-domain

evaluations over the full training set. In this pa-

per, we present comparisons between PERMs

and ﬁnetuning from three new perspectives:

(1) the effect of sample and model size to

in-domain evaluations, (2) generalization to

unseen domains and new datasets, and (3) the

faithfulness of generations. Our results show

that for in-domain settings (a) there is a cross

point of sample size for which PERMs will

perform better than ﬁnetuning when training

with fewer samples, and (b) larger PLMs

have larger cross points. For cross-domain

and cross-dataset cases, we show that (a)

Adapter (Houlsby et al.,2019) performs the

best amongst all the PERMs studied here, and

(b) it outperforms ﬁnetuning if the task dataset

is below a certain size. We also compare

the faithfulness of generations and show that

PERMs can achieve better faithfulness score

than ﬁnetuning, especially for small training

set, by as much as 6%. Finally, we apply

Adapter to MT-NLG 530b (Smith et al.,2022)

and achieve new state-of-the-art results on

Xsum (Narayan et al.,2018) for all ROUGE

scores (ROUGE-1 49.17, ROUGE-2 27.20,

ROUGE-L 40.98).

1 Introduction

Parameter efﬁcient learning methods (PERMs)

serve as potential alternatives to ﬁnetuning for

adapting and deploying language models in real

world scenarios (Ding et al.,2022). They allow

users to ﬁnetune only a small number of parame-

ters while freezing the rest of the shared parameters

of pre-trained language models (PLMs). This is

especially important for large language models (e.g.

GPT-3 (Brown et al.,2020) and MT-NLG (Smith

et al.,2022)) as ﬁnetuning the entire model will

be very expensive or infeasible due to their model

size.

Preﬁx tuning (Li and Liang,2021), which is one

of the PERMs, draws inspiration from prompting

and introduces a small set of continuous vectors

as virtual prompts to allow subsequent tokens to

attend to, which obtains comparable performance

to ﬁnetuning in the full data setting. Prompt tun-

ing (Lester et al.,2021) shows the power of scal-

ing PLMs and that tuning only a few extra em-

beddings is sufﬁcient to achieve similar perfor-

mance to ﬁnetuning the entire 11b T5-XXL (Raf-

fel et al.,2020) model. P-tuning v2 (Liu et al.,

2022a) further demonstrates that small PLMs can

also achieve comparable results to ﬁnetuning with

Preﬁx tuning. Different from adding new param-

eters through prompts, Adapter (Houlsby et al.,

2019) injects trainable parameters through low-

rank structure in a skip-connection way. Other

PERMs includes LoRA (Hu et al.,2021), Mix-

And-Match adapter (He et al.,2021a), Compactor

(Karimi Mahabadi et al.,2021), BitFit (Zaken et al.,

2022), diff-pruning (Guo et al.,2021) and etc.

Most conclusions about PERMs so far are drawn

from their in-domain evaluations over full training

samples. To the best of our knowledge, it is not

yet investigated (1) how these conclusions apply

to different training sizes and model sizes, and (2)

how PERMs generalize to unseen domains and

new datasets, which are both important aspects for

deploying PERMs in real-world applications.

In addition, faithfulness in natural language gen-

eration has become an important topic as it is vital

to real-world applications. Various efforts are made

to systematically measure and mitigate factual er-

rors in many generation tasks, including summa-

rization (Huang et al.,2021) and dialogue gener-

ations (Rashkin et al.,2021;Shuster et al.,2021;

Dziri et al.,2021;Wu et al.,2021). However, exist-

ing work on faithfulness only focuses on faithful-

arXiv:2210.13673v1 [cs.CL] 25 Oct 2022

ness of ﬁnetuning, and the impact of PERMs on

the faithfulness of generation is not yet explored.

In this paper, we provide an in-depth study of

PERMs for generation tasks through three impor-

tant aspects when deploying PERMs in practical

applications: (1) in-domain evaluation by scaling

both training dataset size and model size of PLMs,

(2) cross-domain and cross-dataset generalization,

and (3) faithfulness assessment. Two generation

tasks are used for evaluation: summarization and

dialogue generation. We study four representative

methods: P-tuning, Prompt tuning, Preﬁx tuning,

and Adapter, but mainly focus on Preﬁx tuning

and Adapter as our preliminary results show that

they are better than the others. Our contributions

are summarized as follows: (1) To the best of our

knowledge, we present the ﬁrst comparisons of

faithfulness for PERMs. Our experimental results

show that PERMs, especially preﬁx tuning can

achieve better faithfulness than ﬁnetuning by up to

6%. (2) For in-domain settings, there is always a

cross point of sample size for which PERMs will

be better than ﬁnetuning when training on fewer

samples. Larger PLMs have larger cross points.

Users need to choose which method to use based

on their own training sample size and model size.

(3) Compared to ﬁnetuning, not all PERMs can eas-

ily achieve better cross-domain and cross-dataset

scores than ﬁnetuning even with 8.3b PLM. Our

results show that Adapter is a better method than

Preﬁx tuning on 13 out of 15 comparison settings.

(4) New state-of-the-art results on Xsum (Narayan

et al.,2018) are obtained by applying Adapter to

MT-NLG 530b model.

2 Methodology

We compare the following four PERMs to

ﬁnetun-

ing

(

) using GPT-style models from Megatron-

LM (Shoeybi et al.,2019).

(1) Adapter (AP)

adds

an extra layer with a bottleneck structure by ﬁrst

projecting input

to a low dimension using train-

able weights

Wdown

and then projecting up to the

original dimension using trainable weights

Wup

It is incorporated into backbone model in a skip-

connection way.

Adapter(h) = h+g(hWdown)Wup,

where

is the activation function. In our case, we

insert Adapter layer both after the multi-head atten-

tion (MHA) and feedforward layer (FFD) of Trans-

former (Vaswani et al.,2017).

(2) Preﬁx Tuning

(PF)

adds trainable preﬁx tokens at the beginning

of each transformer block. We follow the imple-

mentation of Li and Liang (2021) to replace the

keys

, values

of MHA with the concatenation

of the trainable preﬁx weights

and the

K, V .

K←concat([WK;K])

V←concat([WV;V])

We also add reparameterization trick suggested by

Li and Liang (2021).

(3) Prompt Tuning (PT)

adds extra parameters to the embedding layer and

uses these trainable embeddings to prompt the

input.

(4) P-tuning (Liu et al.,2021b)

adds a

prompt encoder to encode pseudo prompts and the

encoded representation is used to prompt the input.

3 Experimental Setup

3.1 Datasets

Summarization

We use Xsum (Narayan et al.,

2018), a widely used summarization dataset, to

train and evaluate different methods. It con-

sists of 204,017/11,327/11,333 pairs for the train-

ing/validation/test. As Xsum does not divide the

dataset based on topics, we follow Li and Liang

(2021) to split the Xsum dataset into news articles

for training and sports articles for testing. This

cross-domain version has 149,115/8,263/2,823

pairs for training/validation/test. For the cross-

dataset evaluation, we choose the test set from

CNN/Daily Mail (Nallapati et al.,2016). It con-

tains 11,490 samples.

Dialogue

We use Wizard of Wazards (WoW) (Di-

nan et al.,2018) dataset for our dialogue generation

task. The modeling of the wizard response is usu-

ally composed of two steps: knowledge retrieval

and response generation. To simplify the prob-

lem, following Rashkin et al. (2021), we ignore the

knowledge retrieval step and take the golden knowl-

edge for the response generation. The response of

the wizard is then used to train the model. For the

cross-dataset evaluation, we use the CMU_DoG

(Zhou et al.,2018) dataset. We test our model over

all test set dialogue turns except the starting one.

3.2 Metrics

Quality Metrics

We use ROUGE-1 (R-1),

ROUGE-2 (R-2), ROUGE-L (R-L) (Lin,2004)

scores to evaluate the generations for summariza-

tion task as it is well adopted in all summarization

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EvaluatingParameterEfcientLearningforGenerationPengXux,MostofaPatwaryx,ShrimaiPrabhumoyex,VirginiaAdamsx,RyanJ.Prengerx,WeiPingx,NayeonLeez,MohammadShoeybix,BryanCatanzaroxzTheHongKongUniversityofScienceandTechnology,xNVIDIApengx@nvidia.comAbstractParameterefcientlearningmethods(PERMs)haverecently...

展开>> 收起<<

Evaluating Parameter Efﬁcient Learning for Generation Peng Xux Mostofa Patwaryx Shrimai Prabhumoyex Virginia Adamsx Ryan J. Prengerx Wei Pingx Nayeon Leez Mohammad Shoeybix Bryan Catanzarox.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Evaluating Parameter Efﬁcient Learning for Generation Peng Xux Mostofa Patwaryx Shrimai Prabhumoyex Virginia Adamsx Ryan J. Prengerx Wei Pingx Nayeon Leez Mohammad Shoeybix Bryan Catanzarox

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: