Efficiently Tuned Parameters are Task Embeddings Wangchunshu Zhou1 Canwen Xu2 Julian McAuley2 1ETH Zurich2University of California San Diego

2025-05-03 0 0 284.11KB 8 页 10玖币
侵权投诉
Efficiently Tuned Parameters are Task Embeddings
Wangchunshu Zhou1
, Canwen Xu2, Julian McAuley2
1ETH Zurich 2University of California, San Diego
1wangchunshu.zhou@inf.ethz.ch, 2{cxu,jmcauley}@ucsd.edu
Abstract
Intermediate-task transfer can benefit a wide
range of NLP tasks with properly selected
source datasets. However, it is computation-
ally infeasible to experiment with all interme-
diate transfer combinations, making choosing
a useful source task a challenging problem. In
this paper, we anticipate that task-specific pa-
rameters updated in parameter-efficient tuning
methods are likely to encode task-specific in-
formation. Therefore, such parameters can be
predictive for inter-task transferability. Thus,
we propose to exploit these efficiently tuned
parameters as off-the-shelf task embeddings
for the efficient selection of source datasets
for intermediate-task transfer. We experiment
with 11 text classification tasks and 11 ques-
tion answering tasks. Experimental results
show that our approach can consistently out-
perform existing inter-task transferability pre-
diction methods while being conceptually sim-
ple and computationally efficient. Our anal-
ysis also reveals that the ability of efficiently
tuned parameters on transferability prediction
is disentangled with their in-task performance.
This allows us to use parameters from early
checkpoints as task embeddings to further im-
prove efficiency.1
1 Introduction
The pretraining then fine-tuning paradigm (Peters
et al.,2018;Devlin et al.,2019;Radford et al.,
2018,2019;Brown et al.,2020;Lewis et al.,2020;
Raffel et al.,2019) has substantially improved the
state-of-the-art on a wide range of natural language
processing (NLP) tasks. In this paradigm, we first
pretrain a large language model on large-scale cor-
pora in a general domain, and then fine-tune the
pretrained model to be a task-specific model on
the target dataset. In addition to directly trans-
ferring from a general pretrained language model,
Equal contribution.
1
Code available at
https://github.com/JetRunner/
TuPaTE.
Layer n
Layer 1
Layer 2
Extract
Average
Train
Set
Task Dataset
Parameter-Ecient
Tuning
Task
Embedding
Figure 1: The workflow of using efficiently tuned pa-
rameters as task embeddings. The yellow boxes repre-
sent tunable parameters in Transformer layers.
prior work (Phang et al.,2018) also shows that
intermediate-task transfer, i.e., fine-tuning on in-
termediate source tasks before the target task, can
further improve target task performance. However,
the success of intermediate-task transfer heavily
relies on the selection of a proper source dataset
while an inappropriate source dataset often leads
to performance degradation compared to plain fine-
tuning. Therefore, some recent works (Vu et al.,
2020;Poth et al.,2021) investigate methods to ef-
ficiently predict inter-task transferability without
actually trying out all intermediate-task combina-
tions.
The current state of the art (Vu et al.,2020)
on predicting inter-task transferability is built on
Task2Vec (Achille et al.,2019), which considers
the Fisher information matrix of a model fine-
tuned on a task as the “task embedding”, and pre-
dicts inter-task transferability by computing the
cosine similarity between the task embedding of
the source and target tasks. Despite empirically per-
forming well, this approach requires fine-tuning the
full model and (inefficiently) computing the Fisher
matrix of the model. Moreover, the resulting task
embeddings generally have a high dimensionality
similar to the size of the underlying model. There-
fore, intermediate task selection, which requires
storing task embeddings for each source/target task,
can be space-consuming, especially when experi-
arXiv:2210.11705v1 [cs.CL] 21 Oct 2022
menting with large language models.
In this work, we opt for parameter-efficient tun-
ing approaches (Houlsby et al.,2019;Li and Liang,
2021;Guo et al.,2021;Hu et al.,2022;Zaken
et al.,2022) for the efficient and accurate pre-
diction of inter-task transferability. Our key in-
sight is that task-specific parameters updated in
parameter-efficient tuning methods are likely to en-
code high density task-specific information since
they are used as a query for retrieving task-related
knowledge in a frozen pretrained language model.
Therefore, we propose to directly use task-specific
parameters learned via parameter-efficient tuning
on source/target datasets as task embeddings, as
shown in Figure 1. Compared to task embed-
dings obtained by calculating the Fisher matrix
of the fine-tuned model (Achille et al.,2019;Vu
et al.,2020), efficiently tuned parameters are of
much lower dimensionality and do not suffer from
noise from uninformative weights in the model
parameters, thus leading to more accurate trans-
ferability prediction. Also, our method only re-
quires parameter-efficient tuning on the tasks and
stores task-specific parameters, making both com-
puting and storing task embeddings more efficient.
Moreover, with the development of open-source
parameter-efficient tuning platforms like Adapter-
Hub (Pfeiffer et al.,2020), we can easily access
off-the-shelf parameters of the source and target
datasets downloaded from the model zoo and then
compute the similarity between the downloaded
parameters.
We empirically verify the effectiveness of our
approach by experimenting with 11 text classifi-
cation tasks and 11 question answering tasks, fol-
lowing Vu et al. (2020). Our results show that our
approach consistently outperforms existing inter-
task transferability prediction methods while being
simpler and more efficient. In addition, we find that
the ability of efficiently tuned parameters on trans-
ferability prediction is not strongly correlated with
their in-task performance. Therefore, task-specific
parameters tuned with a relatively small number
of steps are already highly predictive for inter-task
transferability, allowing us to further improve the
efficiency of intermediate task selection.
2 Related Work
Prior work (Phang et al.,2018) shows that posi-
tive transfer can be elicited by training a model
on intermediate source tasks before fine-tuning on
the target task. However, the choice of an appro-
priate source task is crucial for effective transfer.
Phang et al. (2018) show that the size of the source
dataset is an good prior for source task selection.
Pruksachatkun et al. (2020) propose to use task re-
quiring complex reasoning and inference as source
tasks. Besides these heuristics, a number of work
also focuses on systematic prediction of interme-
diate task transferability. Vu et al. (2020) propose
to used TASK2VEC to construct task embeddings
based on the input text or Fisher information ma-
trix of a fine-tuned model. Poth et al. (2021) fur-
ther extend similar ideas for adapter-based trans-
fer learning. More recently, Vu et al. (2021) ex-
plore prompt-based transfer and propose to use
prompt similarity as a predictor for prompt trans-
ferability to select proper soft prompts for initial-
ization. This can be viewed as a special case of
our proposed method where the parameter-efficient
tuning method is restricted to vanilla prompt tun-
ing (Lester et al.,2021) and the transfer method
is restricted to prompt transfer instead of general
intermediate-task transfer.
3 Methodology
3.1 Parameter-Efficient Tuning
Parameter-efficient tuning only updates a small
portion of parameters in a large pretrained model.
In this paper, we experiment with three types of
parameter-efficient tuning: Prompt Tuning (Liu
et al.,2021), Bias Tuning (Zaken et al.,2022), and
Low-Rank Tuning (Hu et al.,2022).
Prompt Tuning
We experiment with P-Tuning
v2 (Liu et al.,2021). Specifically, P-Tuning v2
implements a prompt tuning method by introduc-
ing additional attention prefix matrices
Kt=
{k1. . . kn}
and
Vt={v1. . . vn}
for each Trans-
former layer, where
n
is a hyperparameter control-
ling the added prefix length;
k
and
v
are vectors
with dimension
dh
;
dh
is the hidden size of the
Transformer model.
For each Transformer layer, the added vectors
are concatenated with the original key and value
matrices to be
K0=KtK
and
V0=VtV
,
where
K
and
V
are the original key and value in
each layer’s attention block. Then, the new scaled
dot-product attention is calculated by replacing the
original
K
and
V
with the new
K0
and
V0
, respec-
tively.
摘要:

EfcientlyTunedParametersareTaskEmbeddingsWangchunshuZhou1,CanwenXu2,JulianMcAuley21ETHZurich2UniversityofCalifornia,SanDiego1wangchunshu.zhou@inf.ethz.ch,2{cxu,jmcauley}@ucsd.eduAbstractIntermediate-tasktransfercanbenetawiderangeofNLPtaskswithproperlyselectedsourcedatasets.However,itiscomputatio...

展开>> 收起<<
Efficiently Tuned Parameters are Task Embeddings Wangchunshu Zhou1 Canwen Xu2 Julian McAuley2 1ETH Zurich2University of California San Diego.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:284.11KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注