Spectral Regularization Allows Data-frugal Learning over Combinatorial Spaces Amirali Aghazadeh1 Nived Rajaraman2 Tony Tu1 and Kannan Ramchandran2

2025-05-03 0 0 1.13MB 37 页 10玖币
侵权投诉
Spectral Regularization Allows Data-frugal Learning over
Combinatorial Spaces
Amirali Aghazadeh1, Nived Rajaraman2, Tony Tu1, and Kannan Ramchandran2
1School of Electrical and Computer Engineering, Georgia Institute of Technology,
2Department of Electrical Engineering and Computer Sciences,
University of California, Berkeley
Abstract
Data-driven machine learning models are being increasingly employed in several im-
portant inference problems in biology, chemistry, and physics which require learning over
combinatorial spaces. Recent empirical evidence (see, e.g., [1, 2, 3]) suggests that regu-
larizing the spectral representation of such models improves their generalization power
when labeled data is scarce. However, despite these empirical studies, the theoretical
underpinning of when and how spectral regularization enables improved generalization
is poorly understood. In this paper, we focus on learning pseudo-Boolean functions and
demonstrate that regularizing the empirical mean squared error by the L1norm of the
spectral transform of the learned function reshapes the loss landscape and allows for
data-frugal learning, under a restricted secant condition on the learner’s empirical error
measured against the ground truth function. Under a weaker quadratic growth condi-
tion, we show that stationary points which also approximately interpolate the training
data points achieve statistically optimal generalization performance. Complementing
our theory, we empirically demonstrate that running gradient descent on the regularized
loss results in a better generalization performance compared to baseline algorithms in
several data-scarce real-world problems.
1 Introduction
Machine learning (ML) models have become increasingly commonplace in learning pseudo-
Boolean functions f(·), which map a d-dimensional binary vector x∈ {±1}dto a real
number f(x)R. In biology, ML models are being used to infer the functional properties of
macro-molecules (e.g., proteins) from a small set of mutations [4, 5]. In physics, ML models
are being used to infer the thermodynamic properties of combinatorial systems defined
over a set of binary (Ising) state variables [6, 7]. These highly nonlinear and complex ML
models are trained using modern techniques in continuous optimization, yet they inherit the
elegant properties of pseudo-Boolean functions over the discrete binary hypercube 1}d.
In particular, it is known that the spectral representation of a pseudo-Boolean function is
defined as the Walsh-Hadamard transform (WHT) of the resulting vector of combinatorial
1
arXiv:2210.02604v1 [stat.ML] 5 Oct 2022
function evaluations, that is, Hf(X), where His a 2d×2dWalsh matrix and the function
evaluation vector f(X)=[f(x) : xX]Tis sorted in the lexicographic ordering of all
the length-dbinary strings in X. The WHT enables writing the pseudo-Boolean function
f(x) as a multi-linear polynomial f(x) = PS[d]αSQi∈S xi, where [d] = {1,2, . . . , d}and
αSRis the WHT coefficient corresponding to the monomial Qi∈S xi[8].1
Recent studies in biology (e.g., protein function prediction) have measured the out-
put of several of these real-world functions f(x) to all the 2denumerations of the input
xand analyzed their spectral representation. These costly high-throughout experiments
on combinatorially complete datasets demonstrate a curious phenomenon: such pseudo-
Boolean functions often have low dimensional structures that manifest in the form of an
approximately-sparse polynomial representation (i.e., approximately-sparse WHT) with the
top-kcoefficients corresponding to physically-meaningful interactions [9, 10, 11] (see Ap-
pendix A for empirical evidences on protein functions).
The problem of learning pseudo-Boolean functions with sparse polynomial representa-
tions has a rich history from both theoretical and empirical viewpoints. In particular, since
the pseudo-Boolean functions being learned in their polynomial representations are linear
functions in their coefficients, one may think about this problem as a special instance of
sparse linear regression in dimension 2d. There are a number of algorithms for sparse linear
regression such as LASSO [12], OMP [13], AMP [14], and FISTA [15]. In particular, one
may apply LASSO to get statistical guarantees for this problem which only scale linearly in
the sparsity of the underlying ground truth polynomial kand logarithmicaly in the problem
dimension log(2d) [16]. Likewise, from another perspective, one may view the polynomial
instead as a vector of 2doutcomes f(X); the objective of the learner is to approximate
f(X), when the learner can observe any chosen set of a few entries of this vector (corrupted
by noise), under the assumption that f(X) itself has a sparse WHT. In the literature,
this problem has been referred to by a number of names, and we refer to it as the sparse
WHT problem. There are yet again a number of computationally and statistically efficient
algorithms for solving this problem, such as SPRIGHT [17] and Sparse-FFT [18], among
others.
A common issue with these alternate views of the problem such as sparse linear regression
or sparse WHT is that the resulting algorithms are not suited for function approximation.
In particular, real-world physical and biological functions often have additional structure
which is either unknown a priori or cannot be described succinctly, and which nonlinear
function classes are often implicitly biased towards learning. Indeed, several deep neural
networks (DNNs) have been shown to exhibit strong generalization performance in biolog-
ical and physical problems [5, 19, 20]. This leads to a fundamental disconnect between
algorithms for which strong theoretical guarantees are known (e.g., LASSO) and practi-
cal ML models based on nonlinear and complex function classes which are well-suited for
function approximation.
1We use these terms interchangeably for pseudo-Boolean functions: spectral, Fourier, and Walsh-
Hadamard.
2
Another issue is specific to algorithms for solving the sparse WHT problem: some of
these approaches require observing f(·) at any chosen input [17, 21]. In many practical
applications, where data is prohibitively expensive to collect, one is often forced to work
with a handful of random samples. For example, in the case of proteins, a common approach
to measure biological functions is through a procedure called random mutagenesis which
allows only a random sampling of the combinatorial sequence space [20]. This constraint
renders several algorithms for the sparse WHT problem ill-suited for learning, even though
they admit near-optimal computational and sample complexity guarantees.
From a more practical point of view, several of the recent approaches for learning over
combinatorial spaces (see, e.g., [1, 2, 3, 22]) follow similar ideas for improving empiri-
cal generalization. Instead of directly learning the 2d-dimensional polynomial, they have
converged on explicitly promoting sparsity in the spectral representation of the learned
compactly-represented nonlinear function—also known as spectral regularization. These
empirical studies motivate several important theoretical questions regarding the underlying
mechanism of spectral regularization in improving generalization in practice. In this paper,
we focus on the question: When and how does spectral regularization improve gen-
eralization? The answer to this question is important from both theoretical and practical
perspectives. Theoretically, it helps us understand how the loss landscape is reshaped as
the result of spectral regularization in favor of data-frugal learning. Practically, it tells us
when and how to use spectral regularization in learning real-world combinatorial functions.
Contributions. In this paper, we theoretically analyze spectral regularization of the
empirical risk from the perspective of learning pseudo-Boolean functions. We demonstrate
that regularizing with the L1norm of the spectrum of the learned function improves the gen-
eralization performance of the learner. In particular, under a particular Restricted Secant
Inequality (RSI) condition on the learner’s empirical error, measured against the ground
truth, stationary points of the regularized loss admit optimal generalization performance.
We relax this to a weaker quadratic growth (QG) condition and show that under an ad-
ditional data-interpolation condition, stationary points of the regularized loss also admit
statistical optimality. We empirically demonstrate that these conditions are satisfied for a
wide class of nonlinear functions. Our analysis provides a new generalization bounds when
the underlying learned functions are sparse in the spectral domain. Empirical demonstra-
tions on several real-world data-sets complements our theoretical findings.
2 Learning Pseudo-Boolean Functions
Problem statement. We consider the problem of learning structured pseudo-Boolean
functions on the binary hypercube 1}d. In the finite sample setting, we assume that
the learner has access to a dataset Dncomprised of nlabeled data points (xi, yi)n
i=1 where
xi∈ {±1}dare drawn from an input distribution Dand the real-valued output yiRis a
noisy realization of an unknown pseudo-Boolean function of the input 2. Namely we assume
2We use the subscript xito refer to the ith digit in the binary string x= (x1, x2,· · · , xd).
3
that there is a rich nonlinear function class F={fθ
θ
θ:θ
θ
θRm}parameterized by θ
θ
θ, where
fθ
θ
θ:1}dR. The labels are generated as yi=fθ
θ
θ(xi) + Ziwhere Ziis the noise in the
measurement for input xi, assumed to be independent and normally distributed ∼ N(0, σ2)
3. The objective of the learner is to learn θ
θ
θ.
Multi-linear polynomial representation. Any pseudo-Boolean function f(·) can be uniquely
represented as a multi-linear polynomial, f(x) = Pz∈{±1}dαzQi:zi=+1 xi, where the scalar
αzRis the coefficient corresponding to the monomial Qi:zi=+1 xi, with order Pd
i=1 1(zi=
+1) [8]. The problem of learning a pseudo-Boolean function f(·) is equivalent to finding the
2dunknown coefficients αz. In particular, the evaluations of fon all vertices of the binary
hypercube 1}2dresults in a linear system of equations over the αzvariables,
f(x) : x∈ {±1}dT=Hαz:z∈ {±1}dT,(1)
where the variables xand zenumerate the vertices of the binary hypercube 1}din
lexicographic order, and HH2ddenotes the 2d×2dHadamard matrix defined recursively
as,
H2d+1 ="H2dH2d
H2dH2d#,(2)
where H1=+1. Thus the vector of function evaluations can be obtained by taking the
WHT of the vector of coefficients in the polynomial representation of f. Furthermore, by
inverting the above linear system (note that His an orthonormal matrix), the vector of
polynomial coefficients [αz:z∈ {±1}d] can, in turn, be obtained by taking the WHT of the
vector f(x) : x∈ {±1}dT. For succinctness of notation, for a function f, we define f(X)
as f(x) : x∈ {±1}dTwhere in the LHS the binary strings are enumerated in lexicographic
order. The coefficients in the polynomial representation of f(·) can be collected in the vector
Hf(X). For functions, we use “sparsity in WHT” and “sparse polynomial representation”
interchangeably.
Sparsity in real-world functions. Even in the noiseless setting, in order to perfectly learn
a general pseudo-Boolean function f(·), its evaluations on all input binary vectors x∈ {±1}d
are required, which may be very expensive. In the presence of additional structures, this
significant statistical and computational cost can be mitigated. One such structure that has
been observed in real-world biological and physical functions is sparsity in WHT [9, 10, 11].
In particular, in the theoretical analysis of this paper, we assume that the ground-truth
function fθ
θ
θhas a sparse polynomial representation composed of at most kmonomials.
Namely, kHfθ
θ
θ(X)k0k, where k2dis typically unknown.
Spectral regularization. In attempting to learn pseudo-Boolean functions with a
sparse polynomial representation, a natural question to ask is: how can one encourage the
3In fact, our results only require the noise to be independent subgaussian random variables with variance
parameter σ2, but for ease of exposition, we impose the Gaussian noise condition.
4
learned functions to also have sparse polynomial representations. One solution is to regular-
ize the training loss with an additional functional which promotes sparsity in the polynomial
representation of the learned function b
f. Denoting the polynomial representation of b
fby
Pz∈{±1}dbαzQi:zi=+1 xi, a natural regularization functional would be kb
αk0≡ kHb
f(X)k0
which is also the sparsity of fin WHT.
However, since k·k0is not a continuous function, we proxy the k·k0by the L1norm.
The resulting regularization functional is, kHb
f(X)k1. When learning over a parameterized
function family, b
f=fθ
θ
θ, the resulting regularized ERM we consider in this paper is,
min
θ
θ
θLn(θ
θ
θ) + R(θ
θ
θ),where R(θ
θ
θ),λ
2dkHfθ(X)k1,(OBJ)
where Ln(θ
θ
θ),1
nPn
i=1 hfθ
θ
θ(xi)yi2iis the empirical mean squared error (MSE). Regu-
larizing by R(θ
θ
θ) of the above form is known as spectral regularization (SP).
Remark 1. The scaling of the regularization parameter, λ
2d, is motivated from the lin-
early parameterized setting of F={hθ
θ
θ, xi:θ
θ
θRd}. An explicit computation gives,
λ
2dkHfθ
θ
θ(X)k1=λkθ
θ
θk1, which is the scaling of the regularization parameter as used in
LASSO.
The regularization weight λ > 0 strikes a balance between the empirical MSE and the
spectral regularization, and is set empirically using cross validation. Since the aggregate
loss Ln(θ
θ
θ) + R(θ
θ
θ) is subdifferentiable with respect to the model parameters θ, we can apply
stochastic (sub-)gradient methods on the aggregate loss. Note however that, in general, as a
function of the parameter θ
θ
θ, both the MSE, as well as the SP regularization are non-convex
functions.
3 Theoretical Analysis
To develop some intuition about the general problem, consider the special case of learning
over the set of linear functions. Namely, F={hθ
θ
θ, ·i :θ
θ
θRd}. The polynomial repre-
sentation of any linear function fθ
θ
θ(x) = hθ
θ
θ, xiis simply Pi[d]θixiwhere only the order-1
coefficients are non-zero. Thus, the assumption that the polynomial representation of fθ
θ
θis
composed of at most kmonomials is equivalent to saying that θ
θ
θis k-sparse. Thus in the
linear setting, the problem boils down to sparse linear regression [16]. Furthermore, in the
linear setting, spectral regularization has an explicit representation in terms of its param-
eter θ
θ
θ. In particular, here R(θ
θ
θ) = λ
2dkHfθ
θ
θ(X)k1=λkθ
θ
θk1. Thus, the proposed objective
function, (OBJ), when specialized to the linear setting boils down to the LASSO objective,
1
nPn
i=1(hxi, θ
θ
θi − yi)2+λkθ
θ
θk1, which can be efficiently solved by gradient descent, and is
known to be statistically optimal in the finite sample regime [23].
5
摘要:

SpectralRegularizationAllowsData-frugalLearningoverCombinatorialSpacesAmiraliAghazadeh1,NivedRajaraman2,TonyTu1,andKannanRamchandran21SchoolofElectricalandComputerEngineering,GeorgiaInstituteofTechnology,2DepartmentofElectricalEngineeringandComputerSciences,UniversityofCalifornia,BerkeleyAbstractDat...

展开>> 收起<<
Spectral Regularization Allows Data-frugal Learning over Combinatorial Spaces Amirali Aghazadeh1 Nived Rajaraman2 Tony Tu1 and Kannan Ramchandran2.pdf

共37页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:37 页 大小:1.13MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 37
客服
关注