
3.1 Hidden State Variability Ratio
For a given task, we define our task-specialty
metric
ν(1), . . . , ν(L)
based on the variability of
the hidden state vectors that the PLM produces by
embedding the training input sequences
{xn}N
n=1
.
We use hypothetical data to illustrate our intuition
in Figure 1: after grouped based on the target labels
yn
, the variability of the hidden states within the
same group (dots of same color) measures the dif-
ficulty of separating them, while the variability of
the mean states (stars of different colors) quantifies
how easy it is to tell the different groups apart.
Technically, for each layer
`
, we first define the
sequence-level hidden state
h(`)
n
for each input
xn
to be the average of the hidden states of all the
(non-CLS) tokens
h(`)
n
def
=1
TPT
t=1 h(`)
n,t
. These
sequence-level states correspond to the dots in Fig-
ure 1. Then we group all the
h(`)
n
based on the
target labels
yn
:
G(`)
y
def
={h(`)
n:yn=y}
. The
mean vector of each group is defined as
¯
h(`)
y
def
=
1
|G(`)
y|Ph∈G(`)
yh
and they correspond to the stars
in Figure 1. The mean vector between classes is
defined as
¯
h(`)def
=1
|Y| Py∈Y ¯
h(`)
y
. Then the within-
group variability
Σ(`)
w
and between-group variabil-
ity
Σ(`)
b
are defined using the sequence-level states
and mean states:
Σ(`)
w
def
=1
|Y| X
y∈Y
1
|G(`)
y|X
h∈G(`)
y
(h−¯
h(`)
y)(h−¯
h(`)
y)>
Σ(`)
b
def
=1
|Y| X
y∈Y
(¯
h(`)
y−¯
h(`))(¯
h(`)
y−¯
h(`))>
Both Σ(`)
wand Σ(`)
bare a lot like covariance matri-
ces since they measure the deviation from the mean
vectors. Finally, we define our task-specialty met-
ric to be the within-group variability
Σ(`)
w
scaled
and rotated by the pseudo-inverse of between-class
variability Σ(`)
b
ν(`)def
=1
|Y| trace Σ(`)
wΣ(`)†
b(3)
The pseudo-inverse in equation (3) is why we use
the average state as our sequence-level representa-
tion: averaging reduces the noise in the state vec-
tors and thus leads to stable computation of ν(`).
We believe that the layers with small
ν(`)
are
likely to do better than those with large
ν(`)
when
transferred to the downstream task. Our belief
stems from two key insights.
Remark-I: neural collapse.
Our proposed metric
is mainly inspired by the neural collapse (NC) phe-
nomenon: when training a deep neural model in
classifying images, one can see that the top-layer
representations of the images with the same label
form an extremely tight cluster as training con-
verges. Extensive theoretical and empirical studies
show that a lower within-class variability can in-
dicate a better generalization (Papyan et al.,2020;
Hui et al.,2022;Galanti et al.,2022). Thus we
examine the variability of the layer-wise represen-
tations of the linguistic sequences and hope that it
can measure the task-specialty of each layer of the
given PLM. Our metric is slightly different from
the widely accepted neural collapse metric; please
see Appendix A.1 for a detailed discussion.
Remark-II: signal-to-noise ratio.
In multivariate
statistics (Anderson,1973),
trace ΣwΣ†
b
is able
to measure the inverse signal-to-noise ratio for clas-
sification problems and thus a lower value indicates
a lower chance of misclassification. Intuitively, the
between-class variability
Σb
is the signal which
one can use to tell different clusters apart while the
within-class variability
Σw
is the noise that makes
the clusters overlapped and thus the separation dif-
ficult; see Figure 1for examples.
Remark-III: linear discriminant analysis.
A
low
ν
implies that it is easy to correctly clas-
sify the data with linear discriminant analysis
(LDA) (Hastie et al.,2009). Technically, LDA
assumes that the data of each class is Gaussian-
distributed and it classifies a new data point
h
by
checking how close it is to each mean vector
¯
hy
scaled by the covariance matrix
Σ
which is typ-
ically shared across classes. Though our metric
does not make the Gaussian assumption, a low
ν
suggests that the class means
¯
hy
are far from each
other relative to the within-class variations
Σw
,
meaning that the decision boundary of LDA would
tend to be sharp. Actually, our
Σw
is an estimate
to the Gaussian covariance matrix Σof LDA.
3.2 Layer-Selecting Strategies
Suppose that our metric
ν
can indeed measure
the task-specialty of each layer. Then it is natural to
investigate how the knowledge of layer-wise task-
specialty can be leveraged to improve the transfer
learning methods; that is what the question
in
section 1is concerned with. Recall from section 2
that the major paradigms of transfer learning use
all the layers of the given PLM by default. We
propose to select a subset of the layers based on