Hidden State Variability of Pretrained Language Models Can Guide Computation Reduction for Transfer Learning Shuo Xie12Jiahao Qiu3Ankita Pasad2Li Du4Qing Qu3Hongyuan Mei2

2025-05-06 0 0 2.13MB 19 页 10玖币

侵权投诉

Hidden State Variability of Pretrained Language Models

Can Guide Computation Reduction for Transfer Learning

Shuo Xie∗1,2Jiahao Qiu3Ankita Pasad2Li Du4Qing Qu3Hongyuan Mei2

1University of Chicago 2Toyota Technological Institute at Chicago

3University of Michigan 4Johns Hopkins University

shuox@uchicago.edu,hongyuan@ttic.edu

Abstract

While transferring a pretrained language

model, common approaches conventionally at-

tach their task-speciﬁc classiﬁers to the top

layer and adapt all the pretrained layers. We

investigate whether one could make a task-

speciﬁc selection on which subset of the lay-

ers to adapt and where to place the classi-

ﬁer.The goal is to reduce the computation cost

of transfer learning methods (e.g. ﬁne-tuning

or adapter-tuning) without sacriﬁcing its per-

formance.

We propose to select layers based on the vari-

ability of their hidden states given a task-

speciﬁc corpus. We say a layer is already

“well-specialized” in a task if the within-class

variability of its hidden states is low relative to

the between-class variability. Our variability

metric is cheap to compute and doesn’t need

any training or hyperparameter tuning. It is ro-

bust to data imbalance and data scarcity. Ex-

tensive experiments on the GLUE benchmark

demonstrate that selecting layers based on our

metric can yield signiﬁcantly stronger perfor-

mance than using the same number of top lay-

ers and often match the performance of ﬁne-

tuning or adapter-tuning the entire language

model.

1 Introduction

Transfer learning from a pretrained language

model (PLM) is now the de-facto paradigm in nat-

ural language processing (NLP). The conventional

approaches of leveraging PLMs include ﬁne-tuning

all the parameters in the language model (LM)

and some lightweight alternatives that can decrease

the number of tuning parameters such as adapter-

tuning (Houlsby et al.,2019;Hu et al.,2022;He

et al.,2022) and preﬁx-tuning (Li and Liang,2021).

These methods have one thing in common: they

all involve the entire PLM and attach a classiﬁer to

its top layer. However, PLMs were optimized via

∗Work done during internship at TTI-Chicago.

the language modeling objective and thus their top

layers have been specialized in producing represen-

tations which facilitate optimizing that objective.

Such mismatch between the pretraining and ﬁne-

tuning objectives poses the following questions:

Given a pretrained language model and a down-

stream task, can we measure how “well-specialized”

each layer has already been in that task, without

any task-speciﬁc tuning?



If the answer to

is yes, can we use the

layer-wise “task-specialty” as a guide in improving

the computation efﬁciency of the transfer learning

methods such as ﬁne-tuning and adapter-tuning?

In this paper, we take a technically principled

approach to investigate the research questions

and



. First, we deﬁne a metric in section 3.1

to measure the “task-specialty” of each layer in a

given PLM. Our task-speciality score is inspired

by the neural collapse (NC) phenomenon which

has been widely observed in the computer vision

community (Papyan et al.,2020): as training con-

verges, the top-layer representations of the images

with the same label form an extremely tight cluster.

In our setting, we examine the variability of the

representations of the linguistic sequences given

by each layer of the PLM, and deﬁne our layer-

wise task-specialty to be the within-class variability

normalized by the between-class variability. Com-

puting our metric does not require any training or

hyperparameter tuning. Experiments on the GLUE

benchmark demonstrate that it is highly correlated

with layer-wise probing performance, thus giving a

clear “yes” to the question ¬above.

We propose several layer-selecting strategies in

section 3.2 based on our proposed task-specialty

metric. Our strategies are complementary to all

the major paradigms of transfer learning (such as

ﬁne-tuning and adapter-tuning) and thus can take

advantages of the state-of-the-art at the time: only

arXiv:2210.10041v2 [cs.CL] 19 Oct 2022

the selected layers will be tuned (e.g., via ﬁne-

tuning or using adapters) such that the computation

cost of the tuning methods can be further reduced.

Experiments on the GLUE benchmark demonstrate

that our proposed strategies are highly effective: un-

der comparable computation budget, ﬁne-tuning or

adapter-tuning the layers selected by our strategies

can achieve signiﬁcantly higher performance than

using the layers selected by the widely adopted

baseline strategies; it can even often match the per-

formance of ﬁne-tuning or adapter-tuning the entire

PLM which takes 500% more computation cost.

Through extensive ablation studies, we demon-

strate the comparable advantages of our proposed

task-specialty metric over potential alternatives

(such as CCA and mutual information) as well as

its robustness to data scarcity and data imbalance.

2 Technical Background

In this paper, we focus on classiﬁcation tasks.

Technically, each classiﬁcation task has a corpus

of training data

{(xn, yn)}N

n=1

where each

xn=

(xn,1, . . . , xn,T )

is a sequence of linguistic tokens

and each

yn∈ Y

is a discrete class label. Such

tasks include

•

Sentiment analysis. Each

is a single sequence

of words such as “This movie is fantastic” and

is a sentiment label from

{positive,negative}

Thus, the sentiment analysis can be cast as a

binary-class classiﬁcation problem.

•

Natural language inference. Each

is of the

form “premise [SEP] hypothesis” such as “Fun

for adults and children. [SEP] Fun for only chil-

dren.” where “[SEP]” is a special separator to-

ken. The label

y∈ {yes,neutral,no}

indicates

whether the premise entails the hypothesis. It is

a three-class classiﬁcation problem.

A PLM performs a classiﬁcation task as follows:

It reads each given sequence

and embed it

into a series of hidden state vectors

layer Lh(L)

n,0h(L)

n,1. . . h(L)

n,t . . . h(L)

n,T

. . .

layer `h(`)

n,0h(`)

n,1. . . h(`)

n,t . . . h(`)

n,T

. . .

layer 1h(1)

n,0h(1)

n,1. . . h(1)

n,t . . . h(1)

n,T

where

h(`)

n,t

denotes the hidden state of token

xn,t

given by layer

and

xn,0=CLS

is a special

classiﬁcation (CLS) token.

The top-layer hidden state

h(L)

n,0

of the CLS to-

ken is read by a neural network

followed by a

(a) High within-class vari-

ability and low between-class

variability.

(b) Low within-class variabil-

ity and high between-class

variability.

Figure 1: An illustration of our variability-based task-

specialty metric with hypothetical data. Each dot de-

notes a two-dimensional hidden state vector and its

color denotes its target label. Each colored star denotes

the mean vector of its class.

softmax layer, which gives the probability dis-

tribution over the target label y∈ Y:

p(y|xn) = softmaxy(f(h(L)

n,0)) (1)

The net fis also called “classiﬁcation head”.

Transfer learning is to maximize the log

probability of the ground-truth label

—i.e.,

log p(yn|xn)

—by learning the parameters of the

classiﬁcation head

as well as certain method-

speciﬁc parameters:

•

Fine-tuning updates all the PLM parameters (Pe-

ters et al.,2018;Devlin et al.,2018).

•

Adapter-tuning inserts adapters (i.e., small neural

networks) into the LM layers and updates the

new adapter parameters (Houlsby et al.,2019;

Hu et al.,2022;He et al.,2022).

•

Preﬁx-tuning augments trainable tokens to the

input

and tunes the new token embeddings (Li

and Liang,2021;Qin and Eisner,2021;Ham-

bardzumyan et al.,2021).

3 The Method

Our goal is to answer the research questions

and



introduced in section 1. That involves ﬁnd-

ing a layer-speciﬁc metric

ν(1), . . . , ν(`), . . . , ν(L)

where each

ν(`)

measures the task-specialty of layer

. Suppose that we use

s(`)

to denote the task score

that we can achieve by letting the classiﬁcation

head read the layer

hidden state

h(`)

n,0

of the CLS

token. If

ν(`)

is highly (positively or negatively)

correlated with

s(`)

, then the answer to question

is yes. To answer question



involves designing

-based strategies that select a subset of layers to

use in transfer learning approaches.

In this section, we introduce our task-specialty

metric

ν(`)

(section 3.1) along with a few strategies

for selecting layers (section 3.2). In section 4, we

will empirically demonstrate the effectiveness of

our proposed metric and strategies.

3.1 Hidden State Variability Ratio

For a given task, we deﬁne our task-specialty

metric

ν(1), . . . , ν(L)

based on the variability of

the hidden state vectors that the PLM produces by

embedding the training input sequences

{xn}N

n=1

We use hypothetical data to illustrate our intuition

in Figure 1: after grouped based on the target labels

, the variability of the hidden states within the

same group (dots of same color) measures the dif-

ﬁculty of separating them, while the variability of

the mean states (stars of different colors) quantiﬁes

how easy it is to tell the different groups apart.

Technically, for each layer

, we ﬁrst deﬁne the

sequence-level hidden state

h(`)

for each input

to be the average of the hidden states of all the

(non-CLS) tokens

h(`)

def

TPT

t=1 h(`)

n,t

. These

sequence-level states correspond to the dots in Fig-

ure 1. Then we group all the

h(`)

based on the

target labels

G(`)

def

={h(`)

n:yn=y}

. The

mean vector of each group is deﬁned as

h(`)

def

|G(`)

y|Ph∈G(`)

and they correspond to the stars

in Figure 1. The mean vector between classes is

deﬁned as

h(`)def

|Y| Py∈Y ¯

h(`)

. Then the within-

group variability

Σ(`)

and between-group variabil-

ity

Σ(`)

are deﬁned using the sequence-level states

and mean states:

Σ(`)

def

|Y| X

y∈Y

|G(`)

y|X

h∈G(`)

(h−¯

h(`)

y)(h−¯

h(`)

y)>

Σ(`)

def

|Y| X

y∈Y

(¯

h(`)

y−¯

h(`))(¯

h(`)

y−¯

h(`))>

Both Σ(`)

wand Σ(`)

bare a lot like covariance matri-

ces since they measure the deviation from the mean

vectors. Finally, we deﬁne our task-specialty met-

ric to be the within-group variability

Σ(`)

scaled

and rotated by the pseudo-inverse of between-class

variability Σ(`)

ν(`)def

|Y| trace Σ(`)

wΣ(`)†

b(3)

The pseudo-inverse in equation (3) is why we use

the average state as our sequence-level representa-

tion: averaging reduces the noise in the state vec-

tors and thus leads to stable computation of ν(`).

We believe that the layers with small

ν(`)

are

likely to do better than those with large

ν(`)

when

transferred to the downstream task. Our belief

stems from two key insights.

Remark-I: neural collapse.

Our proposed metric

is mainly inspired by the neural collapse (NC) phe-

nomenon: when training a deep neural model in

classifying images, one can see that the top-layer

representations of the images with the same label

form an extremely tight cluster as training con-

verges. Extensive theoretical and empirical studies

show that a lower within-class variability can in-

dicate a better generalization (Papyan et al.,2020;

Hui et al.,2022;Galanti et al.,2022). Thus we

examine the variability of the layer-wise represen-

tations of the linguistic sequences and hope that it

can measure the task-specialty of each layer of the

given PLM. Our metric is slightly different from

the widely accepted neural collapse metric; please

see Appendix A.1 for a detailed discussion.

Remark-II: signal-to-noise ratio.

In multivariate

statistics (Anderson,1973),

trace ΣwΣ†

b

is able

to measure the inverse signal-to-noise ratio for clas-

siﬁcation problems and thus a lower value indicates

a lower chance of misclassiﬁcation. Intuitively, the

between-class variability

Σb

is the signal which

one can use to tell different clusters apart while the

within-class variability

Σw

is the noise that makes

the clusters overlapped and thus the separation dif-

ﬁcult; see Figure 1for examples.

Remark-III: linear discriminant analysis.

low

implies that it is easy to correctly clas-

sify the data with linear discriminant analysis

(LDA) (Hastie et al.,2009). Technically, LDA

assumes that the data of each class is Gaussian-

distributed and it classiﬁes a new data point

checking how close it is to each mean vector

scaled by the covariance matrix

which is typ-

ically shared across classes. Though our metric

does not make the Gaussian assumption, a low

suggests that the class means

are far from each

other relative to the within-class variations

Σw

meaning that the decision boundary of LDA would

tend to be sharp. Actually, our

Σw

is an estimate

to the Gaussian covariance matrix Σof LDA.

3.2 Layer-Selecting Strategies

Suppose that our metric

can indeed measure

the task-specialty of each layer. Then it is natural to

investigate how the knowledge of layer-wise task-

specialty can be leveraged to improve the transfer

learning methods; that is what the question



section 1is concerned with. Recall from section 2

that the major paradigms of transfer learning use

all the layers of the given PLM by default. We

propose to select a subset of the layers based on

Classification head

…

[CLS] x1x2xT

Layer 1

Layer 3

Layer 5

Layer 2

Layer 4

(a) (2,3,5)

Classification head

…

[CLS] x1x2

Layer 1

Layer 3

Layer 5

Layer 2

Layer 4

(b) (1,3,3)

Classification head

…

[CLS] x1x2

Layer 1

Layer 3

Layer 5

Layer 2

Layer 4

Classification head

…

[CLS] x1x2xT

Layer 1

Layer 3

Layer 5

Layer 2

Layer 4

(d) (4,5,5)

Figure 2: We present our strategies with a toy model of L= 5 layers and `∗= 3. The green layers will be tuned

(e.g. ﬁne-tuned or adapter-tuned) during the task-speciﬁc training while the grey layers are not. The white layers

are dropped from the tuning and inference procedures, thus further reducing the computation and memory cost.

the their task-specialty which will beneﬁt all the

paradigms of transfer learning: they may be able to

only use the selected layers yet still achieve strong

performance. Only using the selected layers will

result in signiﬁcant reduction of computation cost:

•

In ﬁne-tuning, only the parameters of the selected

layers will be updated.

•

In adapter-tuning, adapters are only added to the

selected layers but not all the layers.

•

In preﬁx-tuning, we “deep-perturb” fewer layers.

A smaller number of task-speciﬁc parameters

means not only less training cost but also less stor-

age cost and less inference cost.

Strategy-I: `∗-down.

We use

`∗

to denote the

layer which achieves the best task-specialty: i.e.,

`∗def

= argmin`ν(`)

. Our ﬁrst strategy is motivated

by the following intuition: if layer

`∗

has already

been well-specialized in the given task, then it may

sufﬁce to just mildly tune it along with a few layers

below it. Meanwhile, we may keep the classiﬁ-

cation head on the top layer

or move it to the

best-specialized layer

`∗

: the former still utilizes

the higher layers in training and inference; the latter

does not and thus will result in even less computa-

tion and memory cost.

Technically, we use

(`bottom, `top, `head)

to denote

the strategy of selecting the layers

`bottom, `bottom +

1, . . . , `top

and connect the classiﬁcation head to

the layer

`head

. Then all the instances of our

ﬁrst strategy can be denoted as

(`bottom, `∗, L)

(`bottom, `∗, `∗)

with appropriately chosen

`bottom

Figures 2a–2c illustrate a few speciﬁc instances of

our `∗-down strategy.

Strategy-II: `∗-up.

Alternative to the

`∗

-down

strategy, our second strategy is to select the layers

above the best-specialized layer

`∗

and we call it

`∗

up strategy. Intuitively, if layer

`∗

is already well-

specialized in the given task, then what we need is

perhaps just a powerful classiﬁcation head. That

is, we can regard the higher layers

`∗+ 1, . . . , L

along with the original classiﬁcation head

as a

new “deep” classiﬁcation head and then tune it to

better utilize the layer `∗representations.

In principle, all the instances of our second

strategy can be denoted as

(`∗+ 1, `top, `top)

(`∗+ 1, `top, L)

since we may select the layers up

through

`top ≤L

and move the classiﬁcation head

to layer

`top

. Figure 2d shows an instance of our

`∗-up strategy.

Note that our

(`bottom, `top, `head)

notation can

apply to the conventional layer-selecting strategies

as well. For example,

(1, L, L)

denotes the naive

option of tuning all the layers of the given PLM;

(L−2, L, L)

denotes a baseline method of only

selecting the top three layers.

4 Experiments

We evaluated the effectiveness of our task-

specialty metric along with our layer-selecting

strategies through extensive experiments on the six

classiﬁcation tasks of the GLUE benchmark (Wang

et al.,2019). The tasks are: CoLA, MNLI, MRPC,

QNLI, QQP, and SST-2.All of them are sequence-

level classiﬁcation tasks related to natural language

understanding, thus being very different from how

language models are pretrained.

We chose the widely accepted RoBERTa

model (Liu et al.,2019b) to be our PLM and used

the pretrained roberta-large instance (355M pa-

rameters) downloaded from HuggingFace (Wolf

et al.,2020). Our experiments are mainly con-

ducted with this model. We also experimented with

DeBERTa (He et al.,2020) to investigate whether

our methods generalize across models:

those re-

sults are in Appendix C.2 and are similar to the

RoBERTa results. Prior work (Mosbach et al.,

2020a) found that ﬁne-tuning RoBERTa on GLUE

could be unstable, so we ran each of our exper-

iments with ﬁve random seeds and reported the

means and standard errors. Experiment details (e.g.,

Bowman (2022) advocate that it is important to experi-

ment with more than one pretrained models before drawing

any general conclusions about “pretrained language models”.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HiddenStateVariabilityofPretrainedLanguageModelsCanGuideComputationReductionforTransferLearningShuoXie1;2JiahaoQiu3AnkitaPasad2LiDu4QingQu3HongyuanMei21UniversityofChicago2ToyotaTechnologicalInstituteatChicago3UniversityofMichigan4JohnsHopkinsUniversityshuox@uchicago.edu,hongyuan@ttic.eduAbstractWh...

展开>> 收起<<

Hidden State Variability of Pretrained Language Models Can Guide Computation Reduction for Transfer Learning Shuo Xie12Jiahao Qiu3Ankita Pasad2Li Du4Qing Qu3Hongyuan Mei2.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Hidden State Variability of Pretrained Language Models Can Guide Computation Reduction for Transfer Learning Shuo Xie12Jiahao Qiu3Ankita Pasad2Li Du4Qing Qu3Hongyuan Mei2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: