Preprint NEURAL EIGENFUNCTIONS ARESTRUCTURED REPRESENTATION LEARNERS

2025-05-02 0 0 5.93MB 20 页 10玖币

侵权投诉

Preprint

NEURAL EIGENFUNCTIONS ARE STRUCTURED

REPRESENTATION LEARNERS

Zhijie Deng∗

Shanghai Jiao Tong University

zhijied@sjtu.edu.cn

Jiaxin Shi∗

Stanford University

jiaxins@stanford.edu

Hao Zhang

University of California, Berkeley

sjtu.haozhang@gmail.com

Peng Cui

Tsinghua University

xpeng.cui@gmail.com

Cewu Lu

Shanghai Jiao Tong University

lucewu@sjtu.edu.cn

Jun Zhu†

Tsinghua University

dcszj@tsinghua.edu.cn

ABSTRACT

This paper introduces a structured, adaptive-length deep representation called

Neural Eigenmap. Unlike prior spectral methods such as Laplacian Eigenmap that

operate in a nonparametric manner, Neural Eigenmap leverages NeuralEF (Deng

et al., 2022) to parametrically model eigenfunctions using a neural network. We

show that, when the eigenfunction is derived from positive relations in a data

augmentation setup, applying NeuralEF results in an objective function that re-

sembles those of popular self-supervised learning methods, with an additional

symmetry-breaking property that leads to structured representations where fea-

tures are ordered by importance. We demonstrate using such representations as

adaptive-length codes in image retrieval systems. By truncation according to fea-

ture importance, our method requires up to 16×shorter representation length than

leading self-supervised learning ones to achieve similar retrieval performance. We

further apply our method to graph data and report strong results on a node repre-

sentation learning benchmark with more than one million nodes.

1 INTRODUCTION

Automatically learning representations from unlabelled data is a long-standing challenge in machine

learning. Often, the motivation is to map data to a vector space where the geometric distance re-

ﬂects semantic closeness. This enables, for example, retrieving semantically related information via

ﬁnding nearest neighbors, or discovering concepts with clustering. One can also pass such represen-

tations as inputs to supervised learning procedures, which removes the need for feature engineering.

Traditionally, spectral methods that estimate the eigenfunctions of some integral operator (often

induced by a data similarity metric) were widely used to learn representations from data (Burges

et al., 2010). Examples of such methods include Multidimensional Scaling (Carroll & Arabie, 1998),

Laplacian Eigenmaps (Belkin & Niyogi, 2003), and Local Linear Embeddings (Roweis & Saul,

2000). However, these approaches are less commonly employed today than deep representation

learning methods that leverage deep generative models or a self-supervised training scheme (Oord

et al., 2018; Radford et al., 2018; Caron et al., 2020; Chen et al., 2020a).

There are two primary reasons we believe that contribute to the lesser use of spectral methods today.

First, many spectral algorithms operate in a nonparametric manner, such as computing the eigende-

composition of a full similarity matrix between all data points. This makes them difﬁcult to scale

to large datasets. Second, the performance of learned representations is highly dependent on the

similarity metric used to construct the integral operator. However, picking an appropriate metric for

high-dimensional data can itself be a very challenging problem.

In this work, we revisit the approach of using eigenfunctions for representation learning. Unlike past

efforts that estimated eigenfunctions in a nonparametric way, we take a different path by leveraging

∗Equal contribution.

†Corresponding author.

arXiv:2210.12637v3 [cs.LG] 8 Dec 2023

Preprint

the NeuralEF method (Deng et al., 2022) to parametrically approximate eigenfunctions. Speciﬁcally,

a deep neural network is trained to approximate dominant eigenfunctions from large-scale data.

This learned representation, which we term Neural Eigenmap, inherits the principled theoretical

motivation of eigenfunction-based representation learning while at the same time gains the ﬂexibility

and scalability advantages of deep learning methods.

Our contributions are three-fold:

• We uncover a formal connection between NeuralEF and self-supervised learning (SSL)—applying

NeuralEF with a similarity metric derived from data augmentation (Johnson et al., 2022) leads to

an objective function that resembles popular self-supervised learning (SSL) methods while also

exhibiting an additional symmetry-breaking property. This property enables learning structured

representations ordered by feature importance. This ordered structure is lost in other SSL algo-

rithms (HaoChen et al., 2021; Balestriero & LeCun, 2022; Johnson et al., 2022) and gives Neural

Eigenmap a key advantage in adaptively setting representation length for best quality-cost tradeoff.

In image retrieval tasks, it uses up to 16 times shorter code length than SSL-based representations

while achieving similar retrieval precision.

• We show that, even in representation learning benchmarks where the ordering of features is ig-

nored, our method still produces strong empirical performance—it consistently outperforms Bar-

low Twins (Zbontar et al., 2021), which can be seen as a less-principled approximation to our

objective, and is competitive with a range of strong SSL baselines on ImageNet (Deng et al.,

2009) for linear probe and transfer learning tasks.

• We establish the conditions when NeuralEF can learn eigenfunctions of indeﬁnite kernels, en-

abling a novel application of it to graph representation learning problems. On a large-scale node

property prediction benchmark (Hu et al., 2020), Neural Eigenmap outperforms classic Laplacian

Eigenmap and GCNs (Kipf & Welling, 2016) with decent margins, and its evaluation cost at test

time is substantially lower than GCNs.

2 NEURAL EIGENFUNCTIONS FOR REPRESENTATION LEARNING

Eigenfunctions are the central object of interest in many scientiﬁc and engineering domains, such

as solving partial differential equations (PDEs) and the spectral methods in machine learning. Typi-

cally, an eigenfunction ψof the linear operator Tsatisﬁes

T ψ =µψ, (1)

where µis a scalar called the eigenvalue associated with ψ. In this work, we focus on the kernel

integral operator Tκ:L2(X, p)→L2(X, p),1deﬁned as

(Tκf)(x) = ˆκ(x,x′)f(x′)p(x′)dx′.(2)

Here the kernel κcan be viewed as an inﬁnite-dimensional symmetric matrix and thereby Equa-

tion (1) for Tκclosely resembles a matrix eigenvalue problem.

In machine learning, the study of eigenfunctions and their relationship with representation learning

dates back to the work on spectral clustering (Shi & Malik, 2000) and Laplacian Eigenmaps (Belkin

& Niyogi, 2003). In these methods, the kernel κis derived from a graph that measures similarity

between data points—usually κis a variant of the graph adjacency matrix. Then, for each data point,

the outputs of eigenfunctions associated with the klargest eigenvalues are collected as a vector

ψ(x)≜[ψ1(x), ψ2(x), . . . , ψk(x)]. These vectors prove to be optimal embeddings that preserve

local neighborhoods on data manifolds. Moreover, the feature extractor ψjfor each dimension is

orthogonal to others in function space, so redundancy is desirably minimized. Following Belkin &

Niyogi (2003), we call ψ(x)the eigenmap of x.

Our work builds upon the observation that eigenmaps can serve as good representations. But, unlike

previous work that solves Equation (1) in a nonparametric way—by decomposing a gram matrix

computed on all data points—we approximate ψ(x)with a neural network. Our parametric approach

1Xdenotes the support of observations and p(x)is a distribution over X.L2(X, p)is the set of all square-

integrable functions w.r.t. p.

Preprint

makes it possible to learn eigenmaps for a large dataset like ImageNet, meanwhile also enabling

straightforward out-of-sample generalization. This is discussed further in the next section.

We leverage the NeuralEF algorithm, proposed by Deng et al. (2022) as a function-space gener-

alization of EigenGame (Gemp et al., 2020), to approximate the kprincipal eigenfunctions of a

kernel using neural networks (NNs). In detail, NeuralEF introduces kNNs ψi, i = 1, ..., k,2which

are ended with L2-BN layers (Deng et al., 2022), a variant of batch normalization (BN) (Ioffe &

Szegedy, 2015), and optimizes them simultaneously by:

max

ψj

Rj,j −α

j−1

i=1

i,j for j= 1,...,k (3)

where

Ri,j ≜Ep(x)Ep(x′)[κ(x,x′)ψi(x)ψj(x′)].(4)

In practice, there is no real obstacle for us to use a single shared neural network ψ:X → Rkwith k

outputs, each approximating a different eigenfunction. In this sense, we rewrite Rin a matrix form:

R≜Ep(x)Ep(x′)[κ(x,x′)ψ(x)ψ(x′)⊤].(5)

This work adopts this approach as it improves the scaling with kand network size.

Learning eigenfunctions provides a unifying surrogate objective for unsupervised deep representa-

tion learning. Moreover, the representation given by ψis ordered and highly structured—different

components are orthogonal in the function space and those associated with large eigenvalues pre-

serve more critical information from the kernel.

3 FROM NEURAL EIGENFUNCTIONS TO SELF-SUPERVISED LEARNING

Recent work on the theory of self-supervised learning (SSL) has noticed a strong connection be-

tween representations learned by SSL and spectral embeddings of data computed from a predeﬁned

augmentation kernel (HaoChen et al., 2021; Balestriero & LeCun, 2022; Johnson et al., 2022). In

these works, a clean data point ¯

xgenerates random augmentations (views) according to some aug-

mentation distribution p(x|¯

x). Neural networks are trained to maximize the similarity of represen-

tations across different augmentations. Johnson et al. (2022) deﬁned the following augmentation

kernel based on the augmentation graph constructed by HaoChen et al. (2021):

κ(x,x′)≜p(x,x′)

p(x)p(x′),(6)

where p(x,x′)≜Epd(¯

x)[p(x|¯

x)p(x′|¯

x)] and pdis the distribution of clean data. p(x,x′)char-

acterizes the probability of generating xand x′from the same clean data through augmentation,

which can be seen as a measure of semantic closeness. p(x), p(x′)are the marginal distributions of

p(x,x′). It is easy to show that this augmentation kernel is positive semideﬁnite.

Plugging the above deﬁnition of κ(x,x′)into Equation (4) yields

R=Ep(x,x′)[ψ(x)ψ(x′)⊤]≈1

b=1

ψ(xb)ψ(x+

b)⊤.(7)

Here, xband x+

bare two independent samples from p(x|¯

xb)with ¯

x1,¯

x2,...,¯

xBbeing a minibatch

of data points. Deﬁne ψXB

≜[ψ(x1), ψ(x2),· · · , ψ(xB)] ∈Rk×Band ψX+

Bsimilarly. The

optimization problems in Equation (3) for learning neural eigenfunctions can then be implemented

in auto-differentiation frameworks (Baydin et al., 2018) using a single “surrogate” loss—a function

that we can differentiate to obtain correct gradients for all maximization problems in Equation (3):

ℓ(XB,X+

B) = −

j=1 ψXBψ⊤

X+

Bj,j +α

j=1

j−1

i=1 ˆ

ψXBψ⊤

X+

B2

i,j .(8)

Here ˆ

ψXBdenotes a constant ﬁxed to the value of ψXBduring gradient computation, corresponding

to the ﬁxed ψiinvolved in the joptimization problem in Equation (3). Throughout this work, we will

use the hat symbol to denote a value that is regarded as constant when we are computing gradients.

In auto-differentiation libararies, this can be implemented with a stop-gradient operation.

2We abuse ψito represent the NN approximating the i-th principal eigenfunction if there is no misleading.

Preprint

Learning ordered representations. As proven by Deng et al. (2022), the above objective function

results in each component of ψconverging to a unique eigenfunction ordered by the corresponding

eigenvalue. E.g., the ﬁrst dimension of the output of ψaligns with the eigenfunction of the largest

eigenvalue. This bears similarity to PCA, where the principal components contain most information

of the kernel and are orthogonal to each other.

Linear probe evaluation. We can view the above optimization problem as a kind of SSL algo-

rithm as it learns representations from mutiple views (augmentations) of data. For SSL methods, a

gold standard for quantifying the quality of the learned representations is their linear probe perfor-

mance, where a linear head is employed to classify the representations to semantics categories. Yet,

the linear probe does not take advantage of ordered representations, as suggested by HaoChen et al.

(2021) as well. Even if the representation is replaced by the output of an arbitrary span of eigenfunc-

tions, the linear classiﬁer weight can be simply adjusted to produce the same classiﬁer. This implies

that replacing ˆ

ψXBwith ψXBin Equation (8) (which changes the optimal solution to arbitrary span

of eigenfunctions) does not affect the optimal classiﬁer and may actually ease optimization because

it relaxes the ordering constraints. So, we adapt the loss speciﬁcally for linear probe tasks as follows:

ℓlp(XB,X+

B) = −

j=1 ψXBψ⊤

X+

Bj,j +α

j=1

j−1

i=1 ψXBψ⊤

X+

B2

i,j .(9)

Connection to Barlow Twins (Zbontar et al., 2021). Interestingly, the SSL objective deﬁned in

Barlow Twins can be written using ψXBand ψX+

ℓBT(XB,X+

B) =

j=1

h1−ψXBψ⊤

X+

Bj,j i2+λ

j=1 X

i̸=jψXBψ⊤

X+

B2

i,j ,(10)

where λdenotes a trade-off coefﬁcient. This objective makes a close analogy to ours deﬁned in

Equation (9). For the ﬁrst term, our objective directly maximizes diagonal elements, but Barlow

Twins pushes these elements to 1. Although they have a similar effect, the gradients and optimal

solutions of the two problems can differ. For the second term, we penalize only the lower-diagonal

elements while Barlow Twins concerns all off-diagonal ones. With this, we argue the objective of

Barlow Twins is an approximation of our objective function for linear probe.

This section builds upon the kernels of HaoChen et al. (2021) and Johnson et al. (2022). The spectral

contrastive loss (SCL) of HaoChen et al. (2021) only recovers the subspace spanned by eigenfunc-

tions, so their learned representation does not exhibit an ordered structure as ours. Moreover, as will

be shown in Section 6.2, our method empirically beneﬁts more from a large kthan SCL. Concurrent

to our work, the extended conference version of Johnson et al. (2022) also applied NeuralEF to the

kernel of Equation (6) (Johnson et al., 2023). However, they focused on the optimality of the rep-

resentation obtained by kernel PCA and only tested NeuralEF as an alternative in synthetic tasks.

In contrast, our work extends NeuralEF to larger-scale problems such as ImageNet-scale SSL and

graph representation learning and discusses the beneﬁt of ordered representation for image retrieval.

4 GRAPH REPRESENTATION LEARNING WITH NEURAL EIGENFUNCTIONS

In a variety of real-world scenarios, the observations do not exist in isolation but are related to

each other. Their relations are often given as a graph. Assume we have a graph dataset (X,A),

where X≜{xi}n

i=1 denotes the node set and Ais the graph adjacency matrix. We deﬁne D=

diag(A1n)and the normalized adjacency matrix ¯

A≜D−1/2AD−1/2. In spectral clustering (Shi

& Malik, 2000), it was shown that the eigenmaps produced by principal eigenvectors of ¯

Aare

relaxed cluster assignments of nodes that minimize the graph cut. This motivates us to use them as

node representations in downstream tasks. However, computing these node representations requires

eigendecomposition of the n-by-nmatrix ¯

Aand hence does not scale well. Moreover, it cannot

handle out-of-sample predictions where we need the representation of a novel test example.

We propose to treat ¯

Aas the gram matrix of the kernel ˙κ(x,x′)on Xand apply NeuralEF to learn

its kprincipal eigenfunctions. However, unlike the augmentation kernel from the last section, the

normalized adjacency matrix can be indeﬁnite3for an arbitrary graph. Fortunately, we have the

3One might point out that the graph Laplacian is always positive semideﬁnite. However, in this case, the

eigenmaps should be generated by eigenfunctions with the ksmallest eigenvalues.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PreprintNEURALEIGENFUNCTIONSARESTRUCTUREDREPRESENTATIONLEARNERSZhijieDeng∗ShanghaiJiaoTongUniversityzhijied@sjtu.edu.cnJiaxinShi∗StanfordUniversityjiaxins@stanford.eduHaoZhangUniversityofCalifornia,Berkeleysjtu.haozhang@gmail.comPengCuiTsinghuaUniversityxpeng.cui@gmail.comCewuLuShanghaiJiaoTongUnive...

展开>> 收起<<

Preprint NEURAL EIGENFUNCTIONS ARESTRUCTURED REPRESENTATION LEARNERS.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Preprint NEURAL EIGENFUNCTIONS ARESTRUCTURED REPRESENTATION LEARNERS

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: