
Preprint
Learning ordered representations. As proven by Deng et al. (2022), the above objective function
results in each component of ψconverging to a unique eigenfunction ordered by the corresponding
eigenvalue. E.g., the first dimension of the output of ψaligns with the eigenfunction of the largest
eigenvalue. This bears similarity to PCA, where the principal components contain most information
of the kernel and are orthogonal to each other.
Linear probe evaluation. We can view the above optimization problem as a kind of SSL algo-
rithm as it learns representations from mutiple views (augmentations) of data. For SSL methods, a
gold standard for quantifying the quality of the learned representations is their linear probe perfor-
mance, where a linear head is employed to classify the representations to semantics categories. Yet,
the linear probe does not take advantage of ordered representations, as suggested by HaoChen et al.
(2021) as well. Even if the representation is replaced by the output of an arbitrary span of eigenfunc-
tions, the linear classifier weight can be simply adjusted to produce the same classifier. This implies
that replacing ˆ
ψXBwith ψXBin Equation (8) (which changes the optimal solution to arbitrary span
of eigenfunctions) does not affect the optimal classifier and may actually ease optimization because
it relaxes the ordering constraints. So, we adapt the loss specifically for linear probe tasks as follows:
ℓlp(XB,X+
B) = −
k
X
j=1 ψXBψ⊤
X+
Bj,j +α
k
X
j=1
j−1
X
i=1 ψXBψ⊤
X+
B2
i,j .(9)
Connection to Barlow Twins (Zbontar et al., 2021). Interestingly, the SSL objective defined in
Barlow Twins can be written using ψXBand ψX+
B:
ℓBT(XB,X+
B) =
k
X
j=1
h1−ψXBψ⊤
X+
Bj,j i2+λ
k
X
j=1 X
i̸=jψXBψ⊤
X+
B2
i,j ,(10)
where λdenotes a trade-off coefficient. This objective makes a close analogy to ours defined in
Equation (9). For the first term, our objective directly maximizes diagonal elements, but Barlow
Twins pushes these elements to 1. Although they have a similar effect, the gradients and optimal
solutions of the two problems can differ. For the second term, we penalize only the lower-diagonal
elements while Barlow Twins concerns all off-diagonal ones. With this, we argue the objective of
Barlow Twins is an approximation of our objective function for linear probe.
This section builds upon the kernels of HaoChen et al. (2021) and Johnson et al. (2022). The spectral
contrastive loss (SCL) of HaoChen et al. (2021) only recovers the subspace spanned by eigenfunc-
tions, so their learned representation does not exhibit an ordered structure as ours. Moreover, as will
be shown in Section 6.2, our method empirically benefits more from a large kthan SCL. Concurrent
to our work, the extended conference version of Johnson et al. (2022) also applied NeuralEF to the
kernel of Equation (6) (Johnson et al., 2023). However, they focused on the optimality of the rep-
resentation obtained by kernel PCA and only tested NeuralEF as an alternative in synthetic tasks.
In contrast, our work extends NeuralEF to larger-scale problems such as ImageNet-scale SSL and
graph representation learning and discusses the benefit of ordered representation for image retrieval.
4 GRAPH REPRESENTATION LEARNING WITH NEURAL EIGENFUNCTIONS
In a variety of real-world scenarios, the observations do not exist in isolation but are related to
each other. Their relations are often given as a graph. Assume we have a graph dataset (X,A),
where X≜{xi}n
i=1 denotes the node set and Ais the graph adjacency matrix. We define D=
diag(A1n)and the normalized adjacency matrix ¯
A≜D−1/2AD−1/2. In spectral clustering (Shi
& Malik, 2000), it was shown that the eigenmaps produced by principal eigenvectors of ¯
Aare
relaxed cluster assignments of nodes that minimize the graph cut. This motivates us to use them as
node representations in downstream tasks. However, computing these node representations requires
eigendecomposition of the n-by-nmatrix ¯
Aand hence does not scale well. Moreover, it cannot
handle out-of-sample predictions where we need the representation of a novel test example.
We propose to treat ¯
Aas the gram matrix of the kernel ˙κ(x,x′)on Xand apply NeuralEF to learn
its kprincipal eigenfunctions. However, unlike the augmentation kernel from the last section, the
normalized adjacency matrix can be indefinite3for an arbitrary graph. Fortunately, we have the
3One might point out that the graph Laplacian is always positive semidefinite. However, in this case, the
eigenmaps should be generated by eigenfunctions with the ksmallest eigenvalues.
4