GULP a prediction-based metric between representations Enric Boix-Adserà

2025-04-24 0 0 5.55MB 34 页 10玖币
侵权投诉
GULP: a prediction-based metric between
representations
Enric Boix-Adserà
MIT
eboix@mit.edu
Hannah Lawrence
MIT
hanlaw@mit.edu
George Stepaniants
MIT
gstepan@mit.edu
Philippe Rigollet
MIT
rigollet@math.mit.edu
Abstract
Comparing the representations learned by different neural networks has recently
emerged as a key tool to understand various architectures and ultimately optimize
them. In this work, we introduce GULP, a family of distance measures between
representations that is explicitly motivated by downstream predictive tasks. By
construction, GULP provides uniform control over the difference in prediction
performance between two representations, with respect to regularized linear pre-
diction tasks. Moreover, it satisfies several desirable structural properties, such
as the triangle inequality and invariance under orthogonal transformations, and
thus lends itself to data embedding and visualization. We extensively evaluate
GULP relative to other methods, and demonstrate that it correctly differentiates
between architecture families, converges over the course of training, and captures
generalization performance on downstream linear tasks.
1 Introduction
The spectacular success of deep neural networks (DNN) witnessed over the past decade has been
largely attributed to their ability to generate good representations of the data [
BCV13
] . But what
makes a representation good? Answering this question is a necessary step towards a principled
theory of DNN design. This fundamental question calls for a metric over representations as a basic
primitive. Indeed, embedding representations into a metric space enables comparison, modifications
and ultimately optimization of DNN architectures [LTQ+18]; see Figure 1.
In light of the practical impact of a meaningful metric over representations, this question has
recently garnered significant attention, leading to a myriad of propositions such as CCA,CKA, and
PROCRUSTES. Their relative pros and cons are currently the subject of a lively debate [
DDS21
,
DHN+22] whose resolution calls for a theoretically grounded notion of metric.
Our contributions.
In this work, we define a new family of metrics
1
, called GULP
2
, over the space
of representations. Our construction rests on a functional notion of what makes two representations
similar: namely, that two representations are similar if and only if they are equally useful as inputs to
downstream, linear transfer learning tasks. This idea is partially inspired by feature-based transfer
learning, in which simple models adapt pretrained representations, such as Inceptionv3 [
SVI+16
],
1
More specifically, we define pseudo-metrics rather than metrics. However, these can be readily turned into a
metric using metric identification. This amounts to allowing equivalence classes of representations.
2GULP is Uniform Linear Probing.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06545v1 [cs.LG] 12 Oct 2022
Figure 1: t-SNE embedding of various pretrained DNN representations of the ImageNet [
KSH12
]
dataset with GULP distance (
λ= 102
), colored by architecture type (gray denotes architectures that
do not belong to a family). The embedding shows a good clustering of various architectures (ResNets,
EfficientNets etc.), indicating that GULP captures intrinsic aspects of representations shared within an
architecture family.
CLIP [
RKH+21
], and ELMo [
PNI+18
], for specific tasks [
RASC14
]; indeed, this is a key use for
pretrained representations. Moreover, our application of linear transfer learning is reminiscent of
linear probes, which were introduced by [
AB17
] as a tool to compare internal layers of a DNN in terms
of prediction accuracy. Linear probes play a central role in the literature on hidden representations.
They have been used not only to study the information captured by hidden representations [
RBH20
],
but also to themselves define desiderata of distances between representations [
DDS21
]. However,
previous applications of linear probing required hand-selecting the task on which prediction accuracy
is measured, whereas our GULP distance provides a uniform bound over all norm-bounded tasks.
We establish various theoretical properties of the GULP pseudo-metric, including the triangle inequal-
ity (Thm 2), sample complexity (Thm 3), and vanishing cases. In particular, we show that akin to
the PROCRUSTES pseudo-metric, GULP is invariant under orthogonal transformations (Thm 1) and
vanishes precisely when the two representations are related by an orthogonal transformation (Thm 2).
In turn, we use GULP to produce low-dimensional embeddings of various DNNs that provide new
insights on the relationship between various architectures (Figures 1,5, and 6). Moreover, in Figure 7,
we showcase a numerical experiment to demonstrate that the
GULP
distance between two independent
networks decreases during training on the same dataset.
Related work.
This contribution is part of a growing body of work that aims at providing tools
to understand and quantify the metric space of representations [
RGYSD17
,
MRB18
,
KNLH19
,
AB17
,
ALM17
,
CLR+18
,
LC00
,
LV15
,
LYC+15
,
LLL+19
,
Mig19
,
STHH17
,
WHG+18
,
DDS21
,
DHN+22
,
CKMK22
]. Several of these measures, such as SVCCA [
RGYSD17
] and PWCCA [
MRB18
],
are based on a classical canonical correlation analysis (CCA) from multivariate analysis [
And84
].
More recently, centered kernel alignment CKA [
CSTEK01
,
CMR12
,
KNLH19
,
DHN+22
] has
emerged as a popular measure; see Section 2for more details on these methods. The orthogo-
nal procrustes metric (PROCRUSTES) is a classical tool of shape analysis [
DM16
] to compute the
distance between labelled point clouds. Though not as conspicuous as CKA-based methods in the
context of DNN representations, it was recently presented under a favorable light in [DDS21].
Various desirable properties of a similarity measure between representations have been put forward.
These include structural properties such as invariance or equivariance [
LC00
,
KNLH19
], as well as
sanity checks such as specificity against random initialization [
DDS21
], for example. Such desiderata
can serve as diagnostics for existing similarity measures, but fall short of providing concrete design
guidelines.
Outline
The rest of the paper proceeds as follows. Section 2lays out the derivation of GULP, as
well as important theoretical properties: conditions under which it is zero, and limiting cases in terms
of the regularization parameter
λ
, demonstrating that it interpolates between
CCA
and a version of
CKA
. Section 3establishes concentration results for the finite-sample version, justifying its use in
2
practice. In Section 4we validate GULP through extensive experiments
3
. Finally, we conclude in
Section 5.
2 The GULP distance
As stated in the introduction, the goal of this paper is to develop a pseudo-metric over the space of
representations of a given dataset. Unlike previous approaches, which work with finite datasets, we
take a statistical perspective and formulate the population version of our problem. We defer statistical
questions arising from finite sample size to Section 3.
Let
XRd
be a random input with distribution
PX
and let
f:RdRk
denote a representation
map, such as a trained DNN. The random vector
f(X)Rk
is the representation of
X
by
f
. We
assume throughout that a representation map is centered and normalized, so that
E[f(X)] = 0
and
Ekf(X)k2= 1
. In particular, this normalization allows us to identify (unnormalized) representation
maps
φ, ψ
that are related by
ψ(x) = (x) + b
,
PX
-a.s. for
aR
and
bRd
, down to a single
representation of
X
(after normalizing), which is a well-known requirement for distances between
representations [KNLH19, Sec. 2.3].
We are now in a position to define the GULP distance between representations; the terminology
“distance" is justified in Theorem 2. To that end, let
φ:RdRk
and
ψ:RdR`
be two
representation maps, where
`
may differ from
k
. Let
(X, Y )Rd×R
be a random pair and let
η(x) = E[Y|X=x]
denote the regression function of
Y
onto
X
. Moreover, for any
λ > 0
, let
βλ
denote the population ridge regression solution given by
βλ= arg min
β
E[(β>φ(X)Y)2] + λkβk2
and similarly for
γλ
with respect to
ψ(·)
. Since we use squared error, these only depend the
distribution of Ythrough the regression function η.
Definition 1. Fix λ > 0. The GULP distance between representations φ(X)and ψ(X)is given by
dλ(φ, ψ) := sup
ηE(β>
λφ(X)γ>
λψ(X))21
2,
where the supremum is taken over all regression functions ηsuch that kηkL2(PX)1.
The
GULP
distance measures the discrepancy between the prediction of an optimal ridge regression
estimator based on
φ
, and its counterpart based on
ψ
, uniformly over all regression tasks. While this
notion of distance is intuitive and motivated by a clear regression task, it is unclear how to compute it
a priori. The next proposition provides an equivalent formulation of GULP, which is amenable to
accurate and efficient estimation; see Section 3. It is based on the following covariance matrices:
Σφ= cov(φ(X)) = E[φ(X)φ(X)>] Σψ= cov(ψ(X)) = E[ψ(X)ψ(X)>](1)
We implicitly used the centering assumption in the above definition, and the normalization condition
implies that the covariance matrices have unit trace. Throughout, we assume these matrices are
invertible, which is without loss of generality by projecting onto the image of the representation map.
We also define the regularized inverses:
Σλ
φ:= (Σφ+λIk)1,Σλ
ψ:= (Σψ+λI`)1
as well as the cross-covariance matrices Σφψ and Σψφ as follows:
Σφψ =E[φ(X)ψ(X)>]=Σ>
ψφ .(2)
Proposition 1. Fix λ0. The GULP distance between representations φ(X)and ψ(X)satisfies
d2
λ(φ, ψ) = tr(Σλ
φΣφΣλ
φΣφ) + tr(Σλ
ψΣψΣλ
ψΣψ)2 tr(Σλ
φΣφψΣλ
ψΣ>
φψ)(3)
Proof. See Appendix A.1.
3Our code is available at https://github.com/sgstepaniants/GULP.
3
2.1 Structural properties
In this section, we show that GULP is invariant under orthogonal transformations and that it is a valid
metric on the space of representations. We begin by establishing a third characterization of GULP that
is useful for the purposes of this section; the proof can be found in Appendix A.1.
Lemma 1.
Fix
λ0
. The GULP distance
dλ(φ, ψ)
between the representations
φ(X)
and
ψ(X)
satisfies
d2
λ(φ, ψ) = E(φ(X)>Σλ
φφ(X0)ψ(X)>Σλ
ψψ(X0))2,
where X0is an independent copy of X.
We are now in a position to state our main structural results. We begin with a key invariance result.
Theorem 1.
Fix
λ0
. The GULP distance
dλ(φ, ψ)
between the representations
φ(X)Rk
and
ψ(X)R`
is invariant under orthogonal transformations: for any orthogonal transformations
U:RkRkand V:R`R`, it holds
dλ(Uφ, V ψ) = dλ(φ, ψ)
Proof.
We slightly abuse notation by identifying any orthogonal transformation
W
to a matrix
W
such that W(x) = W·x. Note that for any representation map, we have ΣWf=WΣfW>and
Σλ
Wf= (WΣfW>+λW W >)1=W>
f+λI)W>=WΣλ
fW>.
Hence, using Lemma 1, we get that
d2
λ(Uφ, V ψ) = E(φ(X)>U>UΣλ
φU>Uφ(X0)ψ(X)>V>VΣλ
ψV>V ψ(X0))2
=E(φ(X)>Σλ
φφ(X0)ψ(X)>Σλ
ψψ(X0))2=d2
λ(φ, ψ),
where we used the fact that U>U=Ikand V>V=I`.
Next, we show that GULP satisfies the axioms of a metric.
Theorem 2.
Fix
λ > 0
. The GULP distance
dλ(φ, ψ)
satisfies the axioms of a pseudometric, namely
for all representation maps φ, ψ, ϕ, it holds
dλ(φ, φ)=0, dλ(φ, ψ) = dλ(ψ, φ),and dλ(φ, ψ)dλ(φ, ϕ) + dλ(ϕ, ψ)
Moreover,
dλ(φ, ψ)=0
if and only if
k=`
and there exists an orthogonal transformation
U
such
that φ(X) = Uψ(X)a.s.
Proof.
Lemma 1provides an isometric embedding of representations
f7→ f(Xλ
ff(X0)
into the
Hilbert space
L2(P2
X)
. It readily yields that
dλ
is a pseudometric. It remains to identify for which
φ, ψ it holds that dλ(φ, ψ)=0.
The “easy” direction follows from the invariance property of Theorem 1: if
φ
and
ψ
satisfy
φ(X) =
Uψ(X)
almost surely, then
dλ(φ, ψ) = dλ(Uψ, ψ) = 0
. We sketch the proof of the other direction,
and defer the full proof to Appendix A.2. Define
˜
φ= (Σφ+λI)1/2φ
and
˜
ψ= (Σψ+λI)1/2ψ
.
By Lemma 1, the condition that
dλ(φ, ψ)=0
is equivalent to
˜
φ(X)>˜
φ(X0) = ˜
ψ(X)>˜
ψ(X)
almost
surely over
X, X0
. So if
dλ(φ, ψ)=0
, then we can leverage a classical fact that the Gram matrix of
a set of vectors determines the vectors up to an isometry [
HJ12
], to prove that there is an orthogonal
transformation
URk×k
such that
˜
φ(X) = U˜
ψ(X)
almost surely over
X
. Finally, via analyzing a
homogeneous Sylvester equation, this implies that φ(X) = Uψ(X)almost surely.
Note that when
λ= 0
, the conclusion of this theorem fails to hold:
d0
still satisfies the axioms of a
pseudo-distance, but the cases for which
d0(φ, ψ) = 0
are different. This point is illustrated in the
next section where we establish that d0is the CCA distance commonly employed in the literature.
2.2 Comparison with CCA, ridge-CCA,CKA, and PROCRUSTES
Throughout this section, we assume that k=`for simplicity.
4
Ridge-CCA.
Our distance is most closely related to ridge-CCA, introduced by [
Vin76
] as a regu-
larized version of Canonical Covariance Analysis (CCA) when the covariance matrices
Σφ
or
Σψ
are close to singular. More specifically, for any
λ0
, define the matrix
Cλ:= Σλ
φΣφψΣλ
ψΣψφ
;
the ridge-CCA similarity measure is defined as
ρλCCA = tr(Cλ)
. Hence, we readily see from
Proposition 1that GULP and ridge-CCA are describing the same geometry over representations. To see
this, recall that Lemma 1provides an isometric embedding
f7→ f(Xλ
ff(X0)
of representation
maps into
L2(P2
X)
. While GULP is the distance on this Hilbert space, ridge-CCA is the inner product.
Ridge-CCA was briefly considered in the seminal work [
KNLH19
] but discarded because of (i) its lack
of interpretability and (ii) the absence of a rule to select
λ
. We argue that in fact, our prediction-driven
derivation of GULP gives a clear and compelling interpretation of this geometry (as well as suggests
several extensions; see Section 5). Moreover, we show that tunability of
λ
is, in fact, a desirable
feature that allows to represent the space of representations at various resolutions, giving various
levels of information; for example, in Figure 6, higher λleads to a coarser clustering structure.
CCA.
Due to the connection with ridge-
CCA
, our
GULP
distance is related to (unregularized)
CCA
when
λ= 0
. Specifically, defining
C:= Σ1
φΣφψΣ1
ψΣψφ
, the mean-squared-CCA similarity
measure is given by (see [Eat07, Def. 10.2]):
ρCCA(φ, ψ) := tr(C)
k= 1 1
2kE(φ(X)>Σ1
φφ(X0)ψ(X)>Σ1
ψψ(X0))2,
where
X
is an independent copy of
X0
; the last identity can be checked directly. From Lemma 1it
can be seen that our GULP distance d0(φ, ψ)with λ= 0 is a linear transformation of ρCCA.
It can be checked that
ρCCA
takes values in
[0,1]
, which has led researchers to simply propose
1ρCCA
as a dissimilarity measure. Interestingly, this choice turns out to produce a valid (squared) metric, i.e.,
a dissimilarity measure that satisfies the triangle inequality. Indeed, we get that
d2
CCA(φ, ψ)=1ρCCA(φ, ψ) = 1
2kE(K(˜
φ(X),˜
φ(X0)) K(˜
ψ(X),˜
ψ(X0)))2
where
K(u, v) = u>v
is the linear kernel over
Rd
and
˜
φ:= Σ1/2
φφ
(where
˜
ψ
and
˜
φ
are the
whitened versions of
ψ
and
φ
respectively). These identities have two consequences: (i) we see
from Lemma 1that
dCCA
corresponds to the GULP distance with
λ= 0
up to a scaling factor and
(ii)
dCCA
is a valid pseudometric on the space of representations, since we just exhibited an isometry
T:˜
f7→ K(˜
f(X),˜
f(X0))
with
L2(P2
X)
. We show in Appendix A.2 that
dCCA (φ, ψ)=0
iff
ψ(X) = (X)
a.s. for some matrix
A
. Note that the invariance of
ρCCA
to linear transformations
was previously known and criticized in [KNLH19] as arguably too strong.
CKA.
In fact, thanks to the additional structure of the Hilbert space
L2(P2
X)
, the
dCCA
distance
comes with an inner product
hT(˜
φ), T (˜
ψ)iCCA := 1
2kE[K(˜
φ(X),˜
φ(X0))K(˜
ψ(X),˜
ψ(X0))]
This observation allows us to connect CCA with CKA, another measure of similarity between distribu-
tions that is borrowed from classical literature on kernel methods [
CSTEK01
,
CMR12
] and that was
recently made popular by [
KNLH19
]. Under our normalization assumptions, CKA is a measure of
similarity given by
ρCKA (φ, ψ) = E[K(φ(X), φ(X0))K(ψ(X), ψ(X0))]
pE[K(φ(X), φ(X0))2]E[K(ψ(X), ψ(X0))2]
=hT(φ), T (ψ)iCCA
kT(φ)kCCAkT(ψ)kCCA
= cos (](T(φ), T (ψ))) ,
where
kTk2
CCA =hT, T iCCA
and
]
denotes the angle in the geometry induced by
,·iCCA
. In turn,
d2
CKA
is chosen as
d2
CKA = 1 ρCKA
, which does not yield a pseudometric. This observation highlights
two major differences between CCA and CKA: the first measures inner products and works with
whitened representations, while the second measures angles and works with raw representations. As
illustrated in the experimental section 4as well as in [
DDS21
], this additional whitening step appears
to be detrimental to the overall qualities of this distance measure.
The fact that GULP with
λ= 0
recovers
dCCA
(i.e.
d2
0= 2kd2
CCA
) is illustrated in Figure 2. As shown,
although GULP has a roughly monotone relationship with CKA, they remain quite different.
5
摘要:

GULP:aprediction-basedmetricbetweenrepresentationsEnricBoix-AdseràMITeboix@mit.eduHannahLawrenceMIThanlaw@mit.eduGeorgeStepaniantsMITgstepan@mit.eduPhilippeRigolletMITrigollet@math.mit.eduAbstractComparingtherepresentationslearnedbydifferentneuralnetworkshasrecentlyemergedasakeytooltounderstandvario...

展开>> 收起<<
GULP a prediction-based metric between representations Enric Boix-Adserà.pdf

共34页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:34 页 大小:5.55MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 34
客服
关注