menting with large language models.
In this work, we opt for parameter-efficient tun-
ing approaches (Houlsby et al.,2019;Li and Liang,
2021;Guo et al.,2021;Hu et al.,2022;Zaken
et al.,2022) for the efficient and accurate pre-
diction of inter-task transferability. Our key in-
sight is that task-specific parameters updated in
parameter-efficient tuning methods are likely to en-
code high density task-specific information since
they are used as a query for retrieving task-related
knowledge in a frozen pretrained language model.
Therefore, we propose to directly use task-specific
parameters learned via parameter-efficient tuning
on source/target datasets as task embeddings, as
shown in Figure 1. Compared to task embed-
dings obtained by calculating the Fisher matrix
of the fine-tuned model (Achille et al.,2019;Vu
et al.,2020), efficiently tuned parameters are of
much lower dimensionality and do not suffer from
noise from uninformative weights in the model
parameters, thus leading to more accurate trans-
ferability prediction. Also, our method only re-
quires parameter-efficient tuning on the tasks and
stores task-specific parameters, making both com-
puting and storing task embeddings more efficient.
Moreover, with the development of open-source
parameter-efficient tuning platforms like Adapter-
Hub (Pfeiffer et al.,2020), we can easily access
off-the-shelf parameters of the source and target
datasets downloaded from the model zoo and then
compute the similarity between the downloaded
parameters.
We empirically verify the effectiveness of our
approach by experimenting with 11 text classifi-
cation tasks and 11 question answering tasks, fol-
lowing Vu et al. (2020). Our results show that our
approach consistently outperforms existing inter-
task transferability prediction methods while being
simpler and more efficient. In addition, we find that
the ability of efficiently tuned parameters on trans-
ferability prediction is not strongly correlated with
their in-task performance. Therefore, task-specific
parameters tuned with a relatively small number
of steps are already highly predictive for inter-task
transferability, allowing us to further improve the
efficiency of intermediate task selection.
2 Related Work
Prior work (Phang et al.,2018) shows that posi-
tive transfer can be elicited by training a model
on intermediate source tasks before fine-tuning on
the target task. However, the choice of an appro-
priate source task is crucial for effective transfer.
Phang et al. (2018) show that the size of the source
dataset is an good prior for source task selection.
Pruksachatkun et al. (2020) propose to use task re-
quiring complex reasoning and inference as source
tasks. Besides these heuristics, a number of work
also focuses on systematic prediction of interme-
diate task transferability. Vu et al. (2020) propose
to used TASK2VEC to construct task embeddings
based on the input text or Fisher information ma-
trix of a fine-tuned model. Poth et al. (2021) fur-
ther extend similar ideas for adapter-based trans-
fer learning. More recently, Vu et al. (2021) ex-
plore prompt-based transfer and propose to use
prompt similarity as a predictor for prompt trans-
ferability to select proper soft prompts for initial-
ization. This can be viewed as a special case of
our proposed method where the parameter-efficient
tuning method is restricted to vanilla prompt tun-
ing (Lester et al.,2021) and the transfer method
is restricted to prompt transfer instead of general
intermediate-task transfer.
3 Methodology
3.1 Parameter-Efficient Tuning
Parameter-efficient tuning only updates a small
portion of parameters in a large pretrained model.
In this paper, we experiment with three types of
parameter-efficient tuning: Prompt Tuning (Liu
et al.,2021), Bias Tuning (Zaken et al.,2022), and
Low-Rank Tuning (Hu et al.,2022).
Prompt Tuning
We experiment with P-Tuning
v2 (Liu et al.,2021). Specifically, P-Tuning v2
implements a prompt tuning method by introduc-
ing additional attention prefix matrices
Kt=
{k1. . . kn}
and
Vt={v1. . . vn}
for each Trans-
former layer, where
n
is a hyperparameter control-
ling the added prefix length;
k∗
and
v∗
are vectors
with dimension
dh
;
dh
is the hidden size of the
Transformer model.
For each Transformer layer, the added vectors
are concatenated with the original key and value
matrices to be
K0=Kt⊕K
and
V0=Vt⊕V
,
where
K
and
V
are the original key and value in
each layer’s attention block. Then, the new scaled
dot-product attention is calculated by replacing the
original
K
and
V
with the new
K0
and
V0
, respec-
tively.