Preprint NEURAL EIGENFUNCTIONS ARESTRUCTURED REPRESENTATION LEARNERS

2025-05-02 0 0 5.93MB 20 页 10玖币
侵权投诉
Preprint
NEURAL EIGENFUNCTIONS ARE STRUCTURED
REPRESENTATION LEARNERS
Zhijie Deng
Shanghai Jiao Tong University
zhijied@sjtu.edu.cn
Jiaxin Shi
Stanford University
jiaxins@stanford.edu
Hao Zhang
University of California, Berkeley
sjtu.haozhang@gmail.com
Peng Cui
Tsinghua University
xpeng.cui@gmail.com
Cewu Lu
Shanghai Jiao Tong University
lucewu@sjtu.edu.cn
Jun Zhu
Tsinghua University
dcszj@tsinghua.edu.cn
ABSTRACT
This paper introduces a structured, adaptive-length deep representation called
Neural Eigenmap. Unlike prior spectral methods such as Laplacian Eigenmap that
operate in a nonparametric manner, Neural Eigenmap leverages NeuralEF (Deng
et al., 2022) to parametrically model eigenfunctions using a neural network. We
show that, when the eigenfunction is derived from positive relations in a data
augmentation setup, applying NeuralEF results in an objective function that re-
sembles those of popular self-supervised learning methods, with an additional
symmetry-breaking property that leads to structured representations where fea-
tures are ordered by importance. We demonstrate using such representations as
adaptive-length codes in image retrieval systems. By truncation according to fea-
ture importance, our method requires up to 16×shorter representation length than
leading self-supervised learning ones to achieve similar retrieval performance. We
further apply our method to graph data and report strong results on a node repre-
sentation learning benchmark with more than one million nodes.
1 INTRODUCTION
Automatically learning representations from unlabelled data is a long-standing challenge in machine
learning. Often, the motivation is to map data to a vector space where the geometric distance re-
flects semantic closeness. This enables, for example, retrieving semantically related information via
finding nearest neighbors, or discovering concepts with clustering. One can also pass such represen-
tations as inputs to supervised learning procedures, which removes the need for feature engineering.
Traditionally, spectral methods that estimate the eigenfunctions of some integral operator (often
induced by a data similarity metric) were widely used to learn representations from data (Burges
et al., 2010). Examples of such methods include Multidimensional Scaling (Carroll & Arabie, 1998),
Laplacian Eigenmaps (Belkin & Niyogi, 2003), and Local Linear Embeddings (Roweis & Saul,
2000). However, these approaches are less commonly employed today than deep representation
learning methods that leverage deep generative models or a self-supervised training scheme (Oord
et al., 2018; Radford et al., 2018; Caron et al., 2020; Chen et al., 2020a).
There are two primary reasons we believe that contribute to the lesser use of spectral methods today.
First, many spectral algorithms operate in a nonparametric manner, such as computing the eigende-
composition of a full similarity matrix between all data points. This makes them difficult to scale
to large datasets. Second, the performance of learned representations is highly dependent on the
similarity metric used to construct the integral operator. However, picking an appropriate metric for
high-dimensional data can itself be a very challenging problem.
In this work, we revisit the approach of using eigenfunctions for representation learning. Unlike past
efforts that estimated eigenfunctions in a nonparametric way, we take a different path by leveraging
Equal contribution.
Corresponding author.
1
arXiv:2210.12637v3 [cs.LG] 8 Dec 2023
Preprint
the NeuralEF method (Deng et al., 2022) to parametrically approximate eigenfunctions. Specifically,
a deep neural network is trained to approximate dominant eigenfunctions from large-scale data.
This learned representation, which we term Neural Eigenmap, inherits the principled theoretical
motivation of eigenfunction-based representation learning while at the same time gains the flexibility
and scalability advantages of deep learning methods.
Our contributions are three-fold:
We uncover a formal connection between NeuralEF and self-supervised learning (SSL)—applying
NeuralEF with a similarity metric derived from data augmentation (Johnson et al., 2022) leads to
an objective function that resembles popular self-supervised learning (SSL) methods while also
exhibiting an additional symmetry-breaking property. This property enables learning structured
representations ordered by feature importance. This ordered structure is lost in other SSL algo-
rithms (HaoChen et al., 2021; Balestriero & LeCun, 2022; Johnson et al., 2022) and gives Neural
Eigenmap a key advantage in adaptively setting representation length for best quality-cost tradeoff.
In image retrieval tasks, it uses up to 16 times shorter code length than SSL-based representations
while achieving similar retrieval precision.
We show that, even in representation learning benchmarks where the ordering of features is ig-
nored, our method still produces strong empirical performance—it consistently outperforms Bar-
low Twins (Zbontar et al., 2021), which can be seen as a less-principled approximation to our
objective, and is competitive with a range of strong SSL baselines on ImageNet (Deng et al.,
2009) for linear probe and transfer learning tasks.
We establish the conditions when NeuralEF can learn eigenfunctions of indefinite kernels, en-
abling a novel application of it to graph representation learning problems. On a large-scale node
property prediction benchmark (Hu et al., 2020), Neural Eigenmap outperforms classic Laplacian
Eigenmap and GCNs (Kipf & Welling, 2016) with decent margins, and its evaluation cost at test
time is substantially lower than GCNs.
2 NEURAL EIGENFUNCTIONS FOR REPRESENTATION LEARNING
Eigenfunctions are the central object of interest in many scientific and engineering domains, such
as solving partial differential equations (PDEs) and the spectral methods in machine learning. Typi-
cally, an eigenfunction ψof the linear operator Tsatisfies
T ψ =µψ, (1)
where µis a scalar called the eigenvalue associated with ψ. In this work, we focus on the kernel
integral operator Tκ:L2(X, p)L2(X, p),1defined as
(Tκf)(x) = ˆκ(x,x)f(x)p(x)dx.(2)
Here the kernel κcan be viewed as an infinite-dimensional symmetric matrix and thereby Equa-
tion (1) for Tκclosely resembles a matrix eigenvalue problem.
In machine learning, the study of eigenfunctions and their relationship with representation learning
dates back to the work on spectral clustering (Shi & Malik, 2000) and Laplacian Eigenmaps (Belkin
& Niyogi, 2003). In these methods, the kernel κis derived from a graph that measures similarity
between data points—usually κis a variant of the graph adjacency matrix. Then, for each data point,
the outputs of eigenfunctions associated with the klargest eigenvalues are collected as a vector
ψ(x)[ψ1(x), ψ2(x), . . . , ψk(x)]. These vectors prove to be optimal embeddings that preserve
local neighborhoods on data manifolds. Moreover, the feature extractor ψjfor each dimension is
orthogonal to others in function space, so redundancy is desirably minimized. Following Belkin &
Niyogi (2003), we call ψ(x)the eigenmap of x.
Our work builds upon the observation that eigenmaps can serve as good representations. But, unlike
previous work that solves Equation (1) in a nonparametric way—by decomposing a gram matrix
computed on all data points—we approximate ψ(x)with a neural network. Our parametric approach
1Xdenotes the support of observations and p(x)is a distribution over X.L2(X, p)is the set of all square-
integrable functions w.r.t. p.
2
Preprint
makes it possible to learn eigenmaps for a large dataset like ImageNet, meanwhile also enabling
straightforward out-of-sample generalization. This is discussed further in the next section.
We leverage the NeuralEF algorithm, proposed by Deng et al. (2022) as a function-space gener-
alization of EigenGame (Gemp et al., 2020), to approximate the kprincipal eigenfunctions of a
kernel using neural networks (NNs). In detail, NeuralEF introduces kNNs ψi, i = 1, ..., k,2which
are ended with L2-BN layers (Deng et al., 2022), a variant of batch normalization (BN) (Ioffe &
Szegedy, 2015), and optimizes them simultaneously by:
max
ψj
Rj,j α
j1
X
i=1
R2
i,j for j= 1,...,k (3)
where
Ri,j Ep(x)Ep(x)[κ(x,x)ψi(x)ψj(x)].(4)
In practice, there is no real obstacle for us to use a single shared neural network ψ:X Rkwith k
outputs, each approximating a different eigenfunction. In this sense, we rewrite Rin a matrix form:
REp(x)Ep(x)[κ(x,x)ψ(x)ψ(x)].(5)
This work adopts this approach as it improves the scaling with kand network size.
Learning eigenfunctions provides a unifying surrogate objective for unsupervised deep representa-
tion learning. Moreover, the representation given by ψis ordered and highly structured—different
components are orthogonal in the function space and those associated with large eigenvalues pre-
serve more critical information from the kernel.
3 FROM NEURAL EIGENFUNCTIONS TO SELF-SUPERVISED LEARNING
Recent work on the theory of self-supervised learning (SSL) has noticed a strong connection be-
tween representations learned by SSL and spectral embeddings of data computed from a predefined
augmentation kernel (HaoChen et al., 2021; Balestriero & LeCun, 2022; Johnson et al., 2022). In
these works, a clean data point ¯
xgenerates random augmentations (views) according to some aug-
mentation distribution p(x|¯
x). Neural networks are trained to maximize the similarity of represen-
tations across different augmentations. Johnson et al. (2022) defined the following augmentation
kernel based on the augmentation graph constructed by HaoChen et al. (2021):
κ(x,x)p(x,x)
p(x)p(x),(6)
where p(x,x)Epd(¯
x)[p(x|¯
x)p(x|¯
x)] and pdis the distribution of clean data. p(x,x)char-
acterizes the probability of generating xand xfrom the same clean data through augmentation,
which can be seen as a measure of semantic closeness. p(x), p(x)are the marginal distributions of
p(x,x). It is easy to show that this augmentation kernel is positive semidefinite.
Plugging the above definition of κ(x,x)into Equation (4) yields
R=Ep(x,x)[ψ(x)ψ(x)]1
B
B
X
b=1
ψ(xb)ψ(x+
b).(7)
Here, xband x+
bare two independent samples from p(x|¯
xb)with ¯
x1,¯
x2,...,¯
xBbeing a minibatch
of data points. Define ψXB
[ψ(x1), ψ(x2),· · · , ψ(xB)] Rk×Band ψX+
Bsimilarly. The
optimization problems in Equation (3) for learning neural eigenfunctions can then be implemented
in auto-differentiation frameworks (Baydin et al., 2018) using a single “surrogate” loss—a function
that we can differentiate to obtain correct gradients for all maximization problems in Equation (3):
(XB,X+
B) =
k
X
j=1 ψXBψ
X+
Bj,j +α
k
X
j=1
j1
X
i=1 ˆ
ψXBψ
X+
B2
i,j .(8)
Here ˆ
ψXBdenotes a constant fixed to the value of ψXBduring gradient computation, corresponding
to the fixed ψiinvolved in the joptimization problem in Equation (3). Throughout this work, we will
use the hat symbol to denote a value that is regarded as constant when we are computing gradients.
In auto-differentiation libararies, this can be implemented with a stop-gradient operation.
2We abuse ψito represent the NN approximating the i-th principal eigenfunction if there is no misleading.
3
Preprint
Learning ordered representations. As proven by Deng et al. (2022), the above objective function
results in each component of ψconverging to a unique eigenfunction ordered by the corresponding
eigenvalue. E.g., the first dimension of the output of ψaligns with the eigenfunction of the largest
eigenvalue. This bears similarity to PCA, where the principal components contain most information
of the kernel and are orthogonal to each other.
Linear probe evaluation. We can view the above optimization problem as a kind of SSL algo-
rithm as it learns representations from mutiple views (augmentations) of data. For SSL methods, a
gold standard for quantifying the quality of the learned representations is their linear probe perfor-
mance, where a linear head is employed to classify the representations to semantics categories. Yet,
the linear probe does not take advantage of ordered representations, as suggested by HaoChen et al.
(2021) as well. Even if the representation is replaced by the output of an arbitrary span of eigenfunc-
tions, the linear classifier weight can be simply adjusted to produce the same classifier. This implies
that replacing ˆ
ψXBwith ψXBin Equation (8) (which changes the optimal solution to arbitrary span
of eigenfunctions) does not affect the optimal classifier and may actually ease optimization because
it relaxes the ordering constraints. So, we adapt the loss specifically for linear probe tasks as follows:
lp(XB,X+
B) =
k
X
j=1 ψXBψ
X+
Bj,j +α
k
X
j=1
j1
X
i=1 ψXBψ
X+
B2
i,j .(9)
Connection to Barlow Twins (Zbontar et al., 2021). Interestingly, the SSL objective defined in
Barlow Twins can be written using ψXBand ψX+
B:
BT(XB,X+
B) =
k
X
j=1
h1ψXBψ
X+
Bj,j i2+λ
k
X
j=1 X
i̸=jψXBψ
X+
B2
i,j ,(10)
where λdenotes a trade-off coefficient. This objective makes a close analogy to ours defined in
Equation (9). For the first term, our objective directly maximizes diagonal elements, but Barlow
Twins pushes these elements to 1. Although they have a similar effect, the gradients and optimal
solutions of the two problems can differ. For the second term, we penalize only the lower-diagonal
elements while Barlow Twins concerns all off-diagonal ones. With this, we argue the objective of
Barlow Twins is an approximation of our objective function for linear probe.
This section builds upon the kernels of HaoChen et al. (2021) and Johnson et al. (2022). The spectral
contrastive loss (SCL) of HaoChen et al. (2021) only recovers the subspace spanned by eigenfunc-
tions, so their learned representation does not exhibit an ordered structure as ours. Moreover, as will
be shown in Section 6.2, our method empirically benefits more from a large kthan SCL. Concurrent
to our work, the extended conference version of Johnson et al. (2022) also applied NeuralEF to the
kernel of Equation (6) (Johnson et al., 2023). However, they focused on the optimality of the rep-
resentation obtained by kernel PCA and only tested NeuralEF as an alternative in synthetic tasks.
In contrast, our work extends NeuralEF to larger-scale problems such as ImageNet-scale SSL and
graph representation learning and discusses the benefit of ordered representation for image retrieval.
4 GRAPH REPRESENTATION LEARNING WITH NEURAL EIGENFUNCTIONS
In a variety of real-world scenarios, the observations do not exist in isolation but are related to
each other. Their relations are often given as a graph. Assume we have a graph dataset (X,A),
where X{xi}n
i=1 denotes the node set and Ais the graph adjacency matrix. We define D=
diag(A1n)and the normalized adjacency matrix ¯
AD1/2AD1/2. In spectral clustering (Shi
& Malik, 2000), it was shown that the eigenmaps produced by principal eigenvectors of ¯
Aare
relaxed cluster assignments of nodes that minimize the graph cut. This motivates us to use them as
node representations in downstream tasks. However, computing these node representations requires
eigendecomposition of the n-by-nmatrix ¯
Aand hence does not scale well. Moreover, it cannot
handle out-of-sample predictions where we need the representation of a novel test example.
We propose to treat ¯
Aas the gram matrix of the kernel ˙κ(x,x)on Xand apply NeuralEF to learn
its kprincipal eigenfunctions. However, unlike the augmentation kernel from the last section, the
normalized adjacency matrix can be indefinite3for an arbitrary graph. Fortunately, we have the
3One might point out that the graph Laplacian is always positive semidefinite. However, in this case, the
eigenmaps should be generated by eigenfunctions with the ksmallest eigenvalues.
4
摘要:

PreprintNEURALEIGENFUNCTIONSARESTRUCTUREDREPRESENTATIONLEARNERSZhijieDeng∗ShanghaiJiaoTongUniversityzhijied@sjtu.edu.cnJiaxinShi∗StanfordUniversityjiaxins@stanford.eduHaoZhangUniversityofCalifornia,Berkeleysjtu.haozhang@gmail.comPengCuiTsinghuaUniversityxpeng.cui@gmail.comCewuLuShanghaiJiaoTongUnive...

展开>> 收起<<
Preprint NEURAL EIGENFUNCTIONS ARESTRUCTURED REPRESENTATION LEARNERS.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:5.93MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注