Hidden State Variability of Pretrained Language Models Can Guide Computation Reduction for Transfer Learning Shuo Xie12Jiahao Qiu3Ankita Pasad2Li Du4Qing Qu3Hongyuan Mei2

2025-05-06 0 0 2.13MB 19 页 10玖币
侵权投诉
Hidden State Variability of Pretrained Language Models
Can Guide Computation Reduction for Transfer Learning
Shuo Xie1,2Jiahao Qiu3Ankita Pasad2Li Du4Qing Qu3Hongyuan Mei2
1University of Chicago 2Toyota Technological Institute at Chicago
3University of Michigan 4Johns Hopkins University
shuox@uchicago.edu,hongyuan@ttic.edu
Abstract
While transferring a pretrained language
model, common approaches conventionally at-
tach their task-specific classifiers to the top
layer and adapt all the pretrained layers. We
investigate whether one could make a task-
specific selection on which subset of the lay-
ers to adapt and where to place the classi-
fier.The goal is to reduce the computation cost
of transfer learning methods (e.g. fine-tuning
or adapter-tuning) without sacrificing its per-
formance.
We propose to select layers based on the vari-
ability of their hidden states given a task-
specific corpus. We say a layer is already
“well-specialized” in a task if the within-class
variability of its hidden states is low relative to
the between-class variability. Our variability
metric is cheap to compute and doesn’t need
any training or hyperparameter tuning. It is ro-
bust to data imbalance and data scarcity. Ex-
tensive experiments on the GLUE benchmark
demonstrate that selecting layers based on our
metric can yield significantly stronger perfor-
mance than using the same number of top lay-
ers and often match the performance of fine-
tuning or adapter-tuning the entire language
model.
1 Introduction
Transfer learning from a pretrained language
model (PLM) is now the de-facto paradigm in nat-
ural language processing (NLP). The conventional
approaches of leveraging PLMs include fine-tuning
all the parameters in the language model (LM)
and some lightweight alternatives that can decrease
the number of tuning parameters such as adapter-
tuning (Houlsby et al.,2019;Hu et al.,2022;He
et al.,2022) and prefix-tuning (Li and Liang,2021).
These methods have one thing in common: they
all involve the entire PLM and attach a classifier to
its top layer. However, PLMs were optimized via
Work done during internship at TTI-Chicago.
the language modeling objective and thus their top
layers have been specialized in producing represen-
tations which facilitate optimizing that objective.
Such mismatch between the pretraining and fine-
tuning objectives poses the following questions:
¬
Given a pretrained language model and a down-
stream task, can we measure how “well-specialized
each layer has already been in that task, without
any task-specific tuning?
If the answer to
¬
is yes, can we use the
layer-wise “task-specialty” as a guide in improving
the computation efficiency of the transfer learning
methods such as fine-tuning and adapter-tuning?
In this paper, we take a technically principled
approach to investigate the research questions
¬
and
. First, we define a metric in section 3.1
to measure the “task-specialty” of each layer in a
given PLM. Our task-speciality score is inspired
by the neural collapse (NC) phenomenon which
has been widely observed in the computer vision
community (Papyan et al.,2020): as training con-
verges, the top-layer representations of the images
with the same label form an extremely tight cluster.
In our setting, we examine the variability of the
representations of the linguistic sequences given
by each layer of the PLM, and define our layer-
wise task-specialty to be the within-class variability
normalized by the between-class variability. Com-
puting our metric does not require any training or
hyperparameter tuning. Experiments on the GLUE
benchmark demonstrate that it is highly correlated
with layer-wise probing performance, thus giving a
clear “yes” to the question ¬above.
We propose several layer-selecting strategies in
section 3.2 based on our proposed task-specialty
metric. Our strategies are complementary to all
the major paradigms of transfer learning (such as
fine-tuning and adapter-tuning) and thus can take
advantages of the state-of-the-art at the time: only
arXiv:2210.10041v2 [cs.CL] 19 Oct 2022
the selected layers will be tuned (e.g., via fine-
tuning or using adapters) such that the computation
cost of the tuning methods can be further reduced.
Experiments on the GLUE benchmark demonstrate
that our proposed strategies are highly effective: un-
der comparable computation budget, fine-tuning or
adapter-tuning the layers selected by our strategies
can achieve significantly higher performance than
using the layers selected by the widely adopted
baseline strategies; it can even often match the per-
formance of fine-tuning or adapter-tuning the entire
PLM which takes 500% more computation cost.
Through extensive ablation studies, we demon-
strate the comparable advantages of our proposed
task-specialty metric over potential alternatives
(such as CCA and mutual information) as well as
its robustness to data scarcity and data imbalance.
2 Technical Background
In this paper, we focus on classification tasks.
Technically, each classification task has a corpus
of training data
{(xn, yn)}N
n=1
where each
xn=
(xn,1, . . . , xn,T )
is a sequence of linguistic tokens
and each
yn∈ Y
is a discrete class label. Such
tasks include
Sentiment analysis. Each
x
is a single sequence
of words such as “This movie is fantastic” and
y
is a sentiment label from
{positive,negative}
.
Thus, the sentiment analysis can be cast as a
binary-class classification problem.
Natural language inference. Each
x
is of the
form “premise [SEP] hypothesis” such as “Fun
for adults and children. [SEP] Fun for only chil-
dren. where “[SEP]” is a special separator to-
ken. The label
y∈ {yes,neutral,no}
indicates
whether the premise entails the hypothesis. It is
a three-class classification problem.
A PLM performs a classification task as follows:
1.
It reads each given sequence
xn
and embed it
into a series of hidden state vectors
layer Lh(L)
n,0h(L)
n,1. . . h(L)
n,t . . . h(L)
n,T
. . .
layer `h(`)
n,0h(`)
n,1. . . h(`)
n,t . . . h(`)
n,T
. . .
layer 1h(1)
n,0h(1)
n,1. . . h(1)
n,t . . . h(1)
n,T
where
h(`)
n,t
denotes the hidden state of token
xn,t
given by layer
`
and
xn,0=CLS
is a special
classification (CLS) token.
2.
The top-layer hidden state
h(L)
n,0
of the CLS to-
ken is read by a neural network
f
followed by a
(a) High within-class vari-
ability and low between-class
variability.
(b) Low within-class variabil-
ity and high between-class
variability.
Figure 1: An illustration of our variability-based task-
specialty metric with hypothetical data. Each dot de-
notes a two-dimensional hidden state vector and its
color denotes its target label. Each colored star denotes
the mean vector of its class.
softmax layer, which gives the probability dis-
tribution over the target label y∈ Y:
p(y|xn) = softmaxy(f(h(L)
n,0)) (1)
The net fis also called “classification head”.
Transfer learning is to maximize the log
probability of the ground-truth label
yn
—i.e.,
log p(yn|xn)
—by learning the parameters of the
classification head
f
as well as certain method-
specific parameters:
Fine-tuning updates all the PLM parameters (Pe-
ters et al.,2018;Devlin et al.,2018).
Adapter-tuning inserts adapters (i.e., small neural
networks) into the LM layers and updates the
new adapter parameters (Houlsby et al.,2019;
Hu et al.,2022;He et al.,2022).
Prefix-tuning augments trainable tokens to the
input
x
and tunes the new token embeddings (Li
and Liang,2021;Qin and Eisner,2021;Ham-
bardzumyan et al.,2021).
3 The Method
Our goal is to answer the research questions
¬
and
introduced in section 1. That involves find-
ing a layer-specific metric
ν(1), . . . , ν(`), . . . , ν(L)
where each
ν(`)
measures the task-specialty of layer
`
. Suppose that we use
s(`)
to denote the task score
that we can achieve by letting the classification
head read the layer
`
hidden state
h(`)
n,0
of the CLS
token. If
ν(`)
is highly (positively or negatively)
correlated with
s(`)
, then the answer to question
¬
is yes. To answer question
involves designing
ν
-based strategies that select a subset of layers to
use in transfer learning approaches.
In this section, we introduce our task-specialty
metric
ν(`)
(section 3.1) along with a few strategies
for selecting layers (section 3.2). In section 4, we
will empirically demonstrate the effectiveness of
our proposed metric and strategies.
3.1 Hidden State Variability Ratio
For a given task, we define our task-specialty
metric
ν(1), . . . , ν(L)
based on the variability of
the hidden state vectors that the PLM produces by
embedding the training input sequences
{xn}N
n=1
.
We use hypothetical data to illustrate our intuition
in Figure 1: after grouped based on the target labels
yn
, the variability of the hidden states within the
same group (dots of same color) measures the dif-
ficulty of separating them, while the variability of
the mean states (stars of different colors) quantifies
how easy it is to tell the different groups apart.
Technically, for each layer
`
, we first define the
sequence-level hidden state
h(`)
n
for each input
xn
to be the average of the hidden states of all the
(non-CLS) tokens
h(`)
n
def
=1
TPT
t=1 h(`)
n,t
. These
sequence-level states correspond to the dots in Fig-
ure 1. Then we group all the
h(`)
n
based on the
target labels
yn
:
G(`)
y
def
={h(`)
n:yn=y}
. The
mean vector of each group is defined as
¯
h(`)
y
def
=
1
|G(`)
y|Ph∈G(`)
yh
and they correspond to the stars
in Figure 1. The mean vector between classes is
defined as
¯
h(`)def
=1
|Y| Py∈Y ¯
h(`)
y
. Then the within-
group variability
Σ(`)
w
and between-group variabil-
ity
Σ(`)
b
are defined using the sequence-level states
and mean states:
Σ(`)
w
def
=1
|Y| X
y∈Y
1
|G(`)
y|X
h∈G(`)
y
(h¯
h(`)
y)(h¯
h(`)
y)>
Σ(`)
b
def
=1
|Y| X
y∈Y
(¯
h(`)
y¯
h(`))(¯
h(`)
y¯
h(`))>
Both Σ(`)
wand Σ(`)
bare a lot like covariance matri-
ces since they measure the deviation from the mean
vectors. Finally, we define our task-specialty met-
ric to be the within-group variability
Σ(`)
w
scaled
and rotated by the pseudo-inverse of between-class
variability Σ(`)
b
ν(`)def
=1
|Y| trace Σ(`)
wΣ(`)
b(3)
The pseudo-inverse in equation (3) is why we use
the average state as our sequence-level representa-
tion: averaging reduces the noise in the state vec-
tors and thus leads to stable computation of ν(`).
We believe that the layers with small
ν(`)
are
likely to do better than those with large
ν(`)
when
transferred to the downstream task. Our belief
stems from two key insights.
Remark-I: neural collapse.
Our proposed metric
is mainly inspired by the neural collapse (NC) phe-
nomenon: when training a deep neural model in
classifying images, one can see that the top-layer
representations of the images with the same label
form an extremely tight cluster as training con-
verges. Extensive theoretical and empirical studies
show that a lower within-class variability can in-
dicate a better generalization (Papyan et al.,2020;
Hui et al.,2022;Galanti et al.,2022). Thus we
examine the variability of the layer-wise represen-
tations of the linguistic sequences and hope that it
can measure the task-specialty of each layer of the
given PLM. Our metric is slightly different from
the widely accepted neural collapse metric; please
see Appendix A.1 for a detailed discussion.
Remark-II: signal-to-noise ratio.
In multivariate
statistics (Anderson,1973),
trace ΣwΣ
b
is able
to measure the inverse signal-to-noise ratio for clas-
sification problems and thus a lower value indicates
a lower chance of misclassification. Intuitively, the
between-class variability
Σb
is the signal which
one can use to tell different clusters apart while the
within-class variability
Σw
is the noise that makes
the clusters overlapped and thus the separation dif-
ficult; see Figure 1for examples.
Remark-III: linear discriminant analysis.
A
low
ν
implies that it is easy to correctly clas-
sify the data with linear discriminant analysis
(LDA) (Hastie et al.,2009). Technically, LDA
assumes that the data of each class is Gaussian-
distributed and it classifies a new data point
h
by
checking how close it is to each mean vector
¯
hy
scaled by the covariance matrix
Σ
which is typ-
ically shared across classes. Though our metric
does not make the Gaussian assumption, a low
ν
suggests that the class means
¯
hy
are far from each
other relative to the within-class variations
Σw
,
meaning that the decision boundary of LDA would
tend to be sharp. Actually, our
Σw
is an estimate
to the Gaussian covariance matrix Σof LDA.
3.2 Layer-Selecting Strategies
Suppose that our metric
ν
can indeed measure
the task-specialty of each layer. Then it is natural to
investigate how the knowledge of layer-wise task-
specialty can be leveraged to improve the transfer
learning methods; that is what the question
in
section 1is concerned with. Recall from section 2
that the major paradigms of transfer learning use
all the layers of the given PLM by default. We
propose to select a subset of the layers based on
Classification head
[CLS] x1x2xT
Layer 1
Layer 3
Layer 5
Layer 2
Layer 4
(a) (2,3,5)
Classification head
[CLS] x1x2
Layer 1
Layer 3
xT
Layer 5
Layer 2
Layer 4
(b) (1,3,3)
Classification head
[CLS] x1x2
Layer 1
Layer 3
xT
Layer 5
Layer 2
Layer 4
(c) (2,3,3)
Classification head
[CLS] x1x2xT
Layer 1
Layer 3
Layer 5
Layer 2
Layer 4
(d) (4,5,5)
Figure 2: We present our strategies with a toy model of L= 5 layers and `= 3. The green layers will be tuned
(e.g. fine-tuned or adapter-tuned) during the task-specific training while the grey layers are not. The white layers
are dropped from the tuning and inference procedures, thus further reducing the computation and memory cost.
the their task-specialty which will benefit all the
paradigms of transfer learning: they may be able to
only use the selected layers yet still achieve strong
performance. Only using the selected layers will
result in significant reduction of computation cost:
In fine-tuning, only the parameters of the selected
layers will be updated.
In adapter-tuning, adapters are only added to the
selected layers but not all the layers.
In prefix-tuning, we “deep-perturb” fewer layers.
A smaller number of task-specific parameters
means not only less training cost but also less stor-
age cost and less inference cost.
Strategy-I: `-down.
We use
`
to denote the
layer which achieves the best task-specialty: i.e.,
`def
= argmin`ν(`)
. Our first strategy is motivated
by the following intuition: if layer
`
has already
been well-specialized in the given task, then it may
suffice to just mildly tune it along with a few layers
below it. Meanwhile, we may keep the classifi-
cation head on the top layer
L
or move it to the
best-specialized layer
`
: the former still utilizes
the higher layers in training and inference; the latter
does not and thus will result in even less computa-
tion and memory cost.
Technically, we use
(`bottom, `top, `head)
to denote
the strategy of selecting the layers
`bottom, `bottom +
1, . . . , `top
and connect the classification head to
the layer
`head
. Then all the instances of our
first strategy can be denoted as
(`bottom, `, L)
or
(`bottom, `, `)
with appropriately chosen
`bottom
.
Figures 2a2c illustrate a few specific instances of
our `-down strategy.
Strategy-II: `-up.
Alternative to the
`
-down
strategy, our second strategy is to select the layers
above the best-specialized layer
`
and we call it
`
-
up strategy. Intuitively, if layer
`
is already well-
specialized in the given task, then what we need is
perhaps just a powerful classification head. That
is, we can regard the higher layers
`+ 1, . . . , L
along with the original classification head
f
as a
new “deep” classification head and then tune it to
better utilize the layer `representations.
In principle, all the instances of our second
strategy can be denoted as
(`+ 1, `top, `top)
or
(`+ 1, `top, L)
since we may select the layers up
through
`top L
and move the classification head
f
to layer
`top
. Figure 2d shows an instance of our
`-up strategy.
Note that our
(`bottom, `top, `head)
notation can
apply to the conventional layer-selecting strategies
as well. For example,
(1, L, L)
denotes the naive
option of tuning all the layers of the given PLM;
(L2, L, L)
denotes a baseline method of only
selecting the top three layers.
4 Experiments
We evaluated the effectiveness of our task-
specialty metric along with our layer-selecting
strategies through extensive experiments on the six
classification tasks of the GLUE benchmark (Wang
et al.,2019). The tasks are: CoLA, MNLI, MRPC,
QNLI, QQP, and SST-2.All of them are sequence-
level classification tasks related to natural language
understanding, thus being very different from how
language models are pretrained.
We chose the widely accepted RoBERTa
model (Liu et al.,2019b) to be our PLM and used
the pretrained roberta-large instance (355M pa-
rameters) downloaded from HuggingFace (Wolf
et al.,2020). Our experiments are mainly con-
ducted with this model. We also experimented with
DeBERTa (He et al.,2020) to investigate whether
our methods generalize across models:
1
those re-
sults are in Appendix C.2 and are similar to the
RoBERTa results. Prior work (Mosbach et al.,
2020a) found that fine-tuning RoBERTa on GLUE
could be unstable, so we ran each of our exper-
iments with five random seeds and reported the
means and standard errors. Experiment details (e.g.,
1
Bowman (2022) advocate that it is important to experi-
ment with more than one pretrained models before drawing
any general conclusions about “pretrained language models”.
摘要:

HiddenStateVariabilityofPretrainedLanguageModelsCanGuideComputationReductionforTransferLearningShuoXie1;2JiahaoQiu3AnkitaPasad2LiDu4QingQu3HongyuanMei21UniversityofChicago2ToyotaTechnologicalInstituteatChicago3UniversityofMichigan4JohnsHopkinsUniversityshuox@uchicago.edu,hongyuan@ttic.eduAbstractWh...

展开>> 收起<<
Hidden State Variability of Pretrained Language Models Can Guide Computation Reduction for Transfer Learning Shuo Xie12Jiahao Qiu3Ankita Pasad2Li Du4Qing Qu3Hongyuan Mei2.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:2.13MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注