GULP a prediction-based metric between representations Enric Boix-Adserà

2025-04-24 1 0 5.55MB 34 页 10玖币

侵权投诉

GULP: a prediction-based metric between

representations

Enric Boix-Adserà

MIT

eboix@mit.edu

Hannah Lawrence

MIT

hanlaw@mit.edu

George Stepaniants

MIT

gstepan@mit.edu

Philippe Rigollet

MIT

rigollet@math.mit.edu

Abstract

Comparing the representations learned by different neural networks has recently

emerged as a key tool to understand various architectures and ultimately optimize

them. In this work, we introduce GULP, a family of distance measures between

representations that is explicitly motivated by downstream predictive tasks. By

construction, GULP provides uniform control over the difference in prediction

performance between two representations, with respect to regularized linear pre-

diction tasks. Moreover, it satisﬁes several desirable structural properties, such

as the triangle inequality and invariance under orthogonal transformations, and

thus lends itself to data embedding and visualization. We extensively evaluate

GULP relative to other methods, and demonstrate that it correctly differentiates

between architecture families, converges over the course of training, and captures

generalization performance on downstream linear tasks.

1 Introduction

The spectacular success of deep neural networks (DNN) witnessed over the past decade has been

largely attributed to their ability to generate good representations of the data [

BCV13

] . But what

makes a representation good? Answering this question is a necessary step towards a principled

theory of DNN design. This fundamental question calls for a metric over representations as a basic

primitive. Indeed, embedding representations into a metric space enables comparison, modiﬁcations

and ultimately optimization of DNN architectures [LTQ+18]; see Figure 1.

In light of the practical impact of a meaningful metric over representations, this question has

recently garnered signiﬁcant attention, leading to a myriad of propositions such as CCA,CKA, and

PROCRUSTES. Their relative pros and cons are currently the subject of a lively debate [

DDS21

DHN+22] whose resolution calls for a theoretically grounded notion of metric.

Our contributions.

In this work, we deﬁne a new family of metrics

, called GULP

, over the space

of representations. Our construction rests on a functional notion of what makes two representations

similar: namely, that two representations are similar if and only if they are equally useful as inputs to

downstream, linear transfer learning tasks. This idea is partially inspired by feature-based transfer

learning, in which simple models adapt pretrained representations, such as Inceptionv3 [

SVI+16

More speciﬁcally, we deﬁne pseudo-metrics rather than metrics. However, these can be readily turned into a

metric using metric identiﬁcation. This amounts to allowing equivalence classes of representations.

2GULP is Uniform Linear Probing.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06545v1 [cs.LG] 12 Oct 2022

Figure 1: t-SNE embedding of various pretrained DNN representations of the ImageNet [

KSH12

]

dataset with GULP distance (

λ= 10−2

), colored by architecture type (gray denotes architectures that

do not belong to a family). The embedding shows a good clustering of various architectures (ResNets,

EfﬁcientNets etc.), indicating that GULP captures intrinsic aspects of representations shared within an

architecture family.

CLIP [

RKH+21

], and ELMo [

PNI+18

], for speciﬁc tasks [

RASC14

]; indeed, this is a key use for

pretrained representations. Moreover, our application of linear transfer learning is reminiscent of

linear probes, which were introduced by [

AB17

] as a tool to compare internal layers of a DNN in terms

of prediction accuracy. Linear probes play a central role in the literature on hidden representations.

They have been used not only to study the information captured by hidden representations [

RBH20

but also to themselves deﬁne desiderata of distances between representations [

DDS21

]. However,

previous applications of linear probing required hand-selecting the task on which prediction accuracy

is measured, whereas our GULP distance provides a uniform bound over all norm-bounded tasks.

We establish various theoretical properties of the GULP pseudo-metric, including the triangle inequal-

ity (Thm 2), sample complexity (Thm 3), and vanishing cases. In particular, we show that akin to

the PROCRUSTES pseudo-metric, GULP is invariant under orthogonal transformations (Thm 1) and

vanishes precisely when the two representations are related by an orthogonal transformation (Thm 2).

In turn, we use GULP to produce low-dimensional embeddings of various DNNs that provide new

insights on the relationship between various architectures (Figures 1,5, and 6). Moreover, in Figure 7,

we showcase a numerical experiment to demonstrate that the

GULP

distance between two independent

networks decreases during training on the same dataset.

Related work.

This contribution is part of a growing body of work that aims at providing tools

to understand and quantify the metric space of representations [

RGYSD17

MRB18

KNLH19

AB17

ALM17

CLR+18

LC00

LV15

LYC+15

LLL+19

Mig19

STHH17

WHG+18

DDS21

DHN+22

CKMK22

]. Several of these measures, such as SVCCA [

RGYSD17

] and PWCCA [

MRB18

are based on a classical canonical correlation analysis (CCA) from multivariate analysis [

And84

More recently, centered kernel alignment CKA [

CSTEK01

CMR12

KNLH19

DHN+22

] has

emerged as a popular measure; see Section 2for more details on these methods. The orthogo-

nal procrustes metric (PROCRUSTES) is a classical tool of shape analysis [

DM16

] to compute the

distance between labelled point clouds. Though not as conspicuous as CKA-based methods in the

context of DNN representations, it was recently presented under a favorable light in [DDS21].

Various desirable properties of a similarity measure between representations have been put forward.

These include structural properties such as invariance or equivariance [

LC00

KNLH19

], as well as

sanity checks such as speciﬁcity against random initialization [

DDS21

], for example. Such desiderata

can serve as diagnostics for existing similarity measures, but fall short of providing concrete design

guidelines.

Outline

The rest of the paper proceeds as follows. Section 2lays out the derivation of GULP, as

well as important theoretical properties: conditions under which it is zero, and limiting cases in terms

of the regularization parameter

, demonstrating that it interpolates between

CCA

and a version of

CKA

. Section 3establishes concentration results for the ﬁnite-sample version, justifying its use in

practice. In Section 4we validate GULP through extensive experiments

. Finally, we conclude in

Section 5.

2 The GULP distance

As stated in the introduction, the goal of this paper is to develop a pseudo-metric over the space of

representations of a given dataset. Unlike previous approaches, which work with ﬁnite datasets, we

take a statistical perspective and formulate the population version of our problem. We defer statistical

questions arising from ﬁnite sample size to Section 3.

Let

X∈Rd

be a random input with distribution

and let

f:Rd→Rk

denote a representation

map, such as a trained DNN. The random vector

f(X)∈Rk

is the representation of

. We

assume throughout that a representation map is centered and normalized, so that

E[f(X)] = 0

and

Ekf(X)k2= 1

. In particular, this normalization allows us to identify (unnormalized) representation

maps

φ, ψ

that are related by

ψ(x) = aφ(x) + b

-a.s. for

a∈R

and

b∈Rd

, down to a single

representation of

(after normalizing), which is a well-known requirement for distances between

representations [KNLH19, Sec. 2.3].

We are now in a position to deﬁne the GULP distance between representations; the terminology

“distance" is justiﬁed in Theorem 2. To that end, let

φ:Rd→Rk

and

ψ:Rd→R`

be two

representation maps, where

may differ from

. Let

(X, Y )∈Rd×R

be a random pair and let

η(x) = E[Y|X=x]

denote the regression function of

onto

. Moreover, for any

λ > 0

, let

βλ

denote the population ridge regression solution given by

βλ= arg min

E[(β>φ(X)−Y)2] + λkβk2

and similarly for

γλ

with respect to

ψ(·)

. Since we use squared error, these only depend the

distribution of Ythrough the regression function η.

Deﬁnition 1. Fix λ > 0. The GULP distance between representations φ(X)and ψ(X)is given by

dλ(φ, ψ) := sup

ηE(β>

λφ(X)−γ>

λψ(X))21

where the supremum is taken over all regression functions ηsuch that kηkL2(PX)≤1.

The

GULP

distance measures the discrepancy between the prediction of an optimal ridge regression

estimator based on

, and its counterpart based on

, uniformly over all regression tasks. While this

notion of distance is intuitive and motivated by a clear regression task, it is unclear how to compute it

a priori. The next proposition provides an equivalent formulation of GULP, which is amenable to

accurate and efﬁcient estimation; see Section 3. It is based on the following covariance matrices:

Σφ= cov(φ(X)) = E[φ(X)φ(X)>] Σψ= cov(ψ(X)) = E[ψ(X)ψ(X)>](1)

We implicitly used the centering assumption in the above deﬁnition, and the normalization condition

implies that the covariance matrices have unit trace. Throughout, we assume these matrices are

invertible, which is without loss of generality by projecting onto the image of the representation map.

We also deﬁne the regularized inverses:

Σ−λ

φ:= (Σφ+λIk)−1,Σ−λ

ψ:= (Σψ+λI`)−1

as well as the cross-covariance matrices Σφψ and Σψφ as follows:

Σφψ =E[φ(X)ψ(X)>]=Σ>

ψφ .(2)

Proposition 1. Fix λ≥0. The GULP distance between representations φ(X)and ψ(X)satisﬁes

λ(φ, ψ) = tr(Σ−λ

φΣφΣ−λ

φΣφ) + tr(Σ−λ

ψΣψΣ−λ

ψΣψ)−2 tr(Σ−λ

φΣφψΣ−λ

ψΣ>

φψ)(3)

Proof. See Appendix A.1.

3Our code is available at https://github.com/sgstepaniants/GULP.

2.1 Structural properties

In this section, we show that GULP is invariant under orthogonal transformations and that it is a valid

metric on the space of representations. We begin by establishing a third characterization of GULP that

is useful for the purposes of this section; the proof can be found in Appendix A.1.

Lemma 1.

Fix

λ≥0

. The GULP distance

dλ(φ, ψ)

between the representations

φ(X)

and

ψ(X)

satisﬁes

λ(φ, ψ) = E(φ(X)>Σ−λ

φφ(X0)−ψ(X)>Σ−λ

ψψ(X0))2,

where X0is an independent copy of X.

We are now in a position to state our main structural results. We begin with a key invariance result.

Theorem 1.

Fix

λ≥0

. The GULP distance

dλ(φ, ψ)

between the representations

φ(X)∈Rk

and

ψ(X)∈R`

is invariant under orthogonal transformations: for any orthogonal transformations

U:Rk→Rkand V:R`→R`, it holds

dλ(U◦φ, V ◦ψ) = dλ(φ, ψ)

Proof.

We slightly abuse notation by identifying any orthogonal transformation

to a matrix

such that W(x) = W·x. Note that for any representation map, we have ΣW◦f=WΣfW>and

Σ−λ

W◦f= (WΣfW>+λW W >)−1=W(Σ>

f+λI)W>=WΣ−λ

fW>.

Hence, using Lemma 1, we get that

λ(U◦φ, V ◦ψ) = E(φ(X)>U>UΣ−λ

φU>Uφ(X0)−ψ(X)>V>VΣ−λ

ψV>V ψ(X0))2

=E(φ(X)>Σ−λ

φφ(X0)−ψ(X)>Σ−λ

ψψ(X0))2=d2

λ(φ, ψ),

where we used the fact that U>U=Ikand V>V=I`.

Next, we show that GULP satisﬁes the axioms of a metric.

Theorem 2.

Fix

λ > 0

. The GULP distance

dλ(φ, ψ)

satisﬁes the axioms of a pseudometric, namely

for all representation maps φ, ψ, ϕ, it holds

dλ(φ, φ)=0, dλ(φ, ψ) = dλ(ψ, φ),and dλ(φ, ψ)≤dλ(φ, ϕ) + dλ(ϕ, ψ)

Moreover,

dλ(φ, ψ)=0

if and only if

k=`

and there exists an orthogonal transformation

such

that φ(X) = Uψ(X)a.s.

Proof.

Lemma 1provides an isometric embedding of representations

f7→ f(X)Σ−λ

ff(X0)

into the

Hilbert space

L2(P⊗2

. It readily yields that

dλ

is a pseudometric. It remains to identify for which

φ, ψ it holds that dλ(φ, ψ)=0.

The “easy” direction follows from the invariance property of Theorem 1: if

and

satisfy

φ(X) =

Uψ(X)

almost surely, then

dλ(φ, ψ) = dλ(Uψ, ψ) = 0

. We sketch the proof of the other direction,

and defer the full proof to Appendix A.2. Deﬁne

φ= (Σφ+λI)−1/2φ

and

ψ= (Σψ+λI)−1/2ψ

By Lemma 1, the condition that

dλ(φ, ψ)=0

is equivalent to

φ(X)>˜

φ(X0) = ˜

ψ(X)>˜

ψ(X)

almost

surely over

X, X0

. So if

dλ(φ, ψ)=0

, then we can leverage a classical fact that the Gram matrix of

a set of vectors determines the vectors up to an isometry [

HJ12

], to prove that there is an orthogonal

transformation

U∈Rk×k

such that

φ(X) = U˜

ψ(X)

almost surely over

. Finally, via analyzing a

homogeneous Sylvester equation, this implies that φ(X) = Uψ(X)almost surely.

Note that when

λ= 0

, the conclusion of this theorem fails to hold:

still satisﬁes the axioms of a

pseudo-distance, but the cases for which

d0(φ, ψ) = 0

are different. This point is illustrated in the

next section where we establish that d0is the CCA distance commonly employed in the literature.

2.2 Comparison with CCA, ridge-CCA,CKA, and PROCRUSTES

Throughout this section, we assume that k=`for simplicity.

Ridge-CCA.

Our distance is most closely related to ridge-CCA, introduced by [

Vin76

] as a regu-

larized version of Canonical Covariance Analysis (CCA) when the covariance matrices

Σφ

Σψ

are close to singular. More speciﬁcally, for any

λ≥0

, deﬁne the matrix

Cλ:= Σ−λ

φΣφψΣ−λ

ψΣψφ

;

the ridge-CCA similarity measure is deﬁned as

ρλ−CCA = tr(Cλ)

. Hence, we readily see from

Proposition 1that GULP and ridge-CCA are describing the same geometry over representations. To see

this, recall that Lemma 1provides an isometric embedding

f7→ f(X)Σ−λ

ff(X0)

of representation

maps into

L2(P⊗2

. While GULP is the distance on this Hilbert space, ridge-CCA is the inner product.

Ridge-CCA was brieﬂy considered in the seminal work [

KNLH19

] but discarded because of (i) its lack

of interpretability and (ii) the absence of a rule to select

. We argue that in fact, our prediction-driven

derivation of GULP gives a clear and compelling interpretation of this geometry (as well as suggests

several extensions; see Section 5). Moreover, we show that tunability of

is, in fact, a desirable

feature that allows to represent the space of representations at various resolutions, giving various

levels of information; for example, in Figure 6, higher λleads to a coarser clustering structure.

CCA.

Due to the connection with ridge-

CCA

, our

GULP

distance is related to (unregularized)

CCA

when

λ= 0

. Speciﬁcally, deﬁning

C:= Σ−1

φΣφψΣ−1

ψΣψφ

, the mean-squared-CCA similarity

measure is given by (see [Eat07, Def. 10.2]):

ρCCA(φ, ψ) := tr(C)

k= 1 −1

2kE(φ(X)>Σ−1

φφ(X0)−ψ(X)>Σ−1

ψψ(X0))2,

where

is an independent copy of

; the last identity can be checked directly. From Lemma 1it

can be seen that our GULP distance d0(φ, ψ)with λ= 0 is a linear transformation of ρCCA.

It can be checked that

ρCCA

takes values in

[0,1]

, which has led researchers to simply propose

1−ρCCA

as a dissimilarity measure. Interestingly, this choice turns out to produce a valid (squared) metric, i.e.,

a dissimilarity measure that satisﬁes the triangle inequality. Indeed, we get that

CCA(φ, ψ)=1−ρCCA(φ, ψ) = 1

2kE(K(˜

φ(X),˜

φ(X0)) −K(˜

ψ(X),˜

ψ(X0)))2

where

K(u, v) = u>v

is the linear kernel over

and

φ:= Σ−1/2

φφ

(where

and

are the

whitened versions of

and

respectively). These identities have two consequences: (i) we see

from Lemma 1that

dCCA

corresponds to the GULP distance with

λ= 0

up to a scaling factor and

(ii)

dCCA

is a valid pseudometric on the space of representations, since we just exhibited an isometry

T:˜

f7→ K(˜

f(X),˜

f(X0))

with

L2(P⊗2

. We show in Appendix A.2 that

dCCA (φ, ψ)=0

iff

ψ(X) = Aφ(X)

a.s. for some matrix

. Note that the invariance of

ρCCA

to linear transformations

was previously known and criticized in [KNLH19] as arguably too strong.

CKA.

In fact, thanks to the additional structure of the Hilbert space

L2(P⊗2

, the

dCCA

distance

comes with an inner product

hT(˜

φ), T (˜

ψ)iCCA := 1

2kE[K(˜

φ(X),˜

φ(X0))K(˜

ψ(X),˜

ψ(X0))]

This observation allows us to connect CCA with CKA, another measure of similarity between distribu-

tions that is borrowed from classical literature on kernel methods [

CSTEK01

CMR12

] and that was

recently made popular by [

KNLH19

]. Under our normalization assumptions, CKA is a measure of

similarity given by

ρCKA (φ, ψ) = E[K(φ(X), φ(X0))K(ψ(X), ψ(X0))]

pE[K(φ(X), φ(X0))2]E[K(ψ(X), ψ(X0))2]

=hT(φ), T (ψ)iCCA

kT(φ)kCCAkT(ψ)kCCA

= cos (](T(φ), T (ψ))) ,

where

kTk2

CCA =hT, T iCCA

and

]

denotes the angle in the geometry induced by

h·,·iCCA

. In turn,

CKA

is chosen as

CKA = 1 −ρCKA

, which does not yield a pseudometric. This observation highlights

two major differences between CCA and CKA: the ﬁrst measures inner products and works with

whitened representations, while the second measures angles and works with raw representations. As

illustrated in the experimental section 4as well as in [

DDS21

], this additional whitening step appears

to be detrimental to the overall qualities of this distance measure.

The fact that GULP with

λ= 0

recovers

dCCA

(i.e.

0= 2kd2

CCA

) is illustrated in Figure 2. As shown,

although GULP has a roughly monotone relationship with CKA, they remain quite different.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GULP:aprediction-basedmetricbetweenrepresentationsEnricBoix-AdseràMITeboix@mit.eduHannahLawrenceMIThanlaw@mit.eduGeorgeStepaniantsMITgstepan@mit.eduPhilippeRigolletMITrigollet@math.mit.eduAbstractComparingtherepresentationslearnedbydifferentneuralnetworkshasrecentlyemergedasakeytooltounderstandvario...

展开>> 收起<<

GULP a prediction-based metric between representations Enric Boix-Adserà.pdf

共34页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

GULP a prediction-based metric between representations Enric Boix-Adserà

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: