Evaluating Parameter Efficient Learning for Generation Peng Xux Mostofa Patwaryx Shrimai Prabhumoyex Virginia Adamsx Ryan J. Prengerx Wei Pingx Nayeon Leez Mohammad Shoeybix Bryan Catanzarox

2025-05-06 0 0 310.15KB 10 页 10玖币
侵权投诉
Evaluating Parameter Efficient Learning for Generation
Peng Xu§, Mostofa Patwary§, Shrimai Prabhumoye§, Virginia Adams§, Ryan J. Prenger§,
Wei Ping§, Nayeon Lee, Mohammad Shoeybi§, Bryan Catanzaro§
The Hong Kong University of Science and Technology, §NVIDIA
pengx@nvidia.com
Abstract
Parameter efficient learning methods
(PERMs) have recently gained signifi-
cant attention as they provide an efficient way
for pre-trained language models (PLMs) to
adapt to a downstream task. However, these
conclusions are mostly drawn from in-domain
evaluations over the full training set. In this pa-
per, we present comparisons between PERMs
and finetuning from three new perspectives:
(1) the effect of sample and model size to
in-domain evaluations, (2) generalization to
unseen domains and new datasets, and (3) the
faithfulness of generations. Our results show
that for in-domain settings (a) there is a cross
point of sample size for which PERMs will
perform better than finetuning when training
with fewer samples, and (b) larger PLMs
have larger cross points. For cross-domain
and cross-dataset cases, we show that (a)
Adapter (Houlsby et al.,2019) performs the
best amongst all the PERMs studied here, and
(b) it outperforms finetuning if the task dataset
is below a certain size. We also compare
the faithfulness of generations and show that
PERMs can achieve better faithfulness score
than finetuning, especially for small training
set, by as much as 6%. Finally, we apply
Adapter to MT-NLG 530b (Smith et al.,2022)
and achieve new state-of-the-art results on
Xsum (Narayan et al.,2018) for all ROUGE
scores (ROUGE-1 49.17, ROUGE-2 27.20,
ROUGE-L 40.98).
1 Introduction
Parameter efficient learning methods (PERMs)
serve as potential alternatives to finetuning for
adapting and deploying language models in real
world scenarios (Ding et al.,2022). They allow
users to finetune only a small number of parame-
ters while freezing the rest of the shared parameters
of pre-trained language models (PLMs). This is
especially important for large language models (e.g.
GPT-3 (Brown et al.,2020) and MT-NLG (Smith
et al.,2022)) as finetuning the entire model will
be very expensive or infeasible due to their model
size.
Prefix tuning (Li and Liang,2021), which is one
of the PERMs, draws inspiration from prompting
and introduces a small set of continuous vectors
as virtual prompts to allow subsequent tokens to
attend to, which obtains comparable performance
to finetuning in the full data setting. Prompt tun-
ing (Lester et al.,2021) shows the power of scal-
ing PLMs and that tuning only a few extra em-
beddings is sufficient to achieve similar perfor-
mance to finetuning the entire 11b T5-XXL (Raf-
fel et al.,2020) model. P-tuning v2 (Liu et al.,
2022a) further demonstrates that small PLMs can
also achieve comparable results to finetuning with
Prefix tuning. Different from adding new param-
eters through prompts, Adapter (Houlsby et al.,
2019) injects trainable parameters through low-
rank structure in a skip-connection way. Other
PERMs includes LoRA (Hu et al.,2021), Mix-
And-Match adapter (He et al.,2021a), Compactor
(Karimi Mahabadi et al.,2021), BitFit (Zaken et al.,
2022), diff-pruning (Guo et al.,2021) and etc.
Most conclusions about PERMs so far are drawn
from their in-domain evaluations over full training
samples. To the best of our knowledge, it is not
yet investigated (1) how these conclusions apply
to different training sizes and model sizes, and (2)
how PERMs generalize to unseen domains and
new datasets, which are both important aspects for
deploying PERMs in real-world applications.
In addition, faithfulness in natural language gen-
eration has become an important topic as it is vital
to real-world applications. Various efforts are made
to systematically measure and mitigate factual er-
rors in many generation tasks, including summa-
rization (Huang et al.,2021) and dialogue gener-
ations (Rashkin et al.,2021;Shuster et al.,2021;
Dziri et al.,2021;Wu et al.,2021). However, exist-
ing work on faithfulness only focuses on faithful-
arXiv:2210.13673v1 [cs.CL] 25 Oct 2022
ness of finetuning, and the impact of PERMs on
the faithfulness of generation is not yet explored.
In this paper, we provide an in-depth study of
PERMs for generation tasks through three impor-
tant aspects when deploying PERMs in practical
applications: (1) in-domain evaluation by scaling
both training dataset size and model size of PLMs,
(2) cross-domain and cross-dataset generalization,
and (3) faithfulness assessment. Two generation
tasks are used for evaluation: summarization and
dialogue generation. We study four representative
methods: P-tuning, Prompt tuning, Prefix tuning,
and Adapter, but mainly focus on Prefix tuning
and Adapter as our preliminary results show that
they are better than the others. Our contributions
are summarized as follows: (1) To the best of our
knowledge, we present the first comparisons of
faithfulness for PERMs. Our experimental results
show that PERMs, especially prefix tuning can
achieve better faithfulness than finetuning by up to
6%. (2) For in-domain settings, there is always a
cross point of sample size for which PERMs will
be better than finetuning when training on fewer
samples. Larger PLMs have larger cross points.
Users need to choose which method to use based
on their own training sample size and model size.
(3) Compared to finetuning, not all PERMs can eas-
ily achieve better cross-domain and cross-dataset
scores than finetuning even with 8.3b PLM. Our
results show that Adapter is a better method than
Prefix tuning on 13 out of 15 comparison settings.
(4) New state-of-the-art results on Xsum (Narayan
et al.,2018) are obtained by applying Adapter to
MT-NLG 530b model.
2 Methodology
We compare the following four PERMs to
finetun-
ing
(
FT
) using GPT-style models from Megatron-
LM (Shoeybi et al.,2019).
(1) Adapter (AP)
adds
an extra layer with a bottleneck structure by first
projecting input
h
to a low dimension using train-
able weights
Wdown
and then projecting up to the
original dimension using trainable weights
Wup
.
It is incorporated into backbone model in a skip-
connection way.
Adapter(h) = h+g(hWdown)Wup,
where
g
is the activation function. In our case, we
insert Adapter layer both after the multi-head atten-
tion (MHA) and feedforward layer (FFD) of Trans-
former (Vaswani et al.,2017).
(2) Prefix Tuning
(PF)
adds trainable prefix tokens at the beginning
of each transformer block. We follow the imple-
mentation of Li and Liang (2021) to replace the
keys
K
, values
V
of MHA with the concatenation
of the trainable prefix weights
WK
,
WV
and the
K, V .
Kconcat([WK;K])
Vconcat([WV;V])
We also add reparameterization trick suggested by
Li and Liang (2021).
(3) Prompt Tuning (PT)
adds extra parameters to the embedding layer and
uses these trainable embeddings to prompt the
input.
(4) P-tuning (Liu et al.,2021b)
adds a
prompt encoder to encode pseudo prompts and the
encoded representation is used to prompt the input.
3 Experimental Setup
3.1 Datasets
Summarization
We use Xsum (Narayan et al.,
2018), a widely used summarization dataset, to
train and evaluate different methods. It con-
sists of 204,017/11,327/11,333 pairs for the train-
ing/validation/test. As Xsum does not divide the
dataset based on topics, we follow Li and Liang
(2021) to split the Xsum dataset into news articles
for training and sports articles for testing. This
cross-domain version has 149,115/8,263/2,823
pairs for training/validation/test. For the cross-
dataset evaluation, we choose the test set from
CNN/Daily Mail (Nallapati et al.,2016). It con-
tains 11,490 samples.
Dialogue
We use Wizard of Wazards (WoW) (Di-
nan et al.,2018) dataset for our dialogue generation
task. The modeling of the wizard response is usu-
ally composed of two steps: knowledge retrieval
and response generation. To simplify the prob-
lem, following Rashkin et al. (2021), we ignore the
knowledge retrieval step and take the golden knowl-
edge for the response generation. The response of
the wizard is then used to train the model. For the
cross-dataset evaluation, we use the CMU_DoG
(Zhou et al.,2018) dataset. We test our model over
all test set dialogue turns except the starting one.
3.2 Metrics
Quality Metrics
We use ROUGE-1 (R-1),
ROUGE-2 (R-2), ROUGE-L (R-L) (Lin,2004)
scores to evaluate the generations for summariza-
tion task as it is well adopted in all summarization
摘要:

EvaluatingParameterEfcientLearningforGenerationPengXux,MostofaPatwaryx,ShrimaiPrabhumoyex,VirginiaAdamsx,RyanJ.Prengerx,WeiPingx,NayeonLeez,MohammadShoeybix,BryanCatanzaroxzTheHongKongUniversityofScienceandTechnology,xNVIDIApengx@nvidia.comAbstractParameterefcientlearningmethods(PERMs)haverecently...

展开>> 收起<<
Evaluating Parameter Efficient Learning for Generation Peng Xux Mostofa Patwaryx Shrimai Prabhumoyex Virginia Adamsx Ryan J. Prengerx Wei Pingx Nayeon Leez Mohammad Shoeybix Bryan Catanzarox.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:310.15KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注