Spectral Regularization Allows Data-frugal Learning over Combinatorial Spaces Amirali Aghazadeh1 Nived Rajaraman2 Tony Tu1 and Kannan Ramchandran2

2025-05-03 0 0 1.13MB 37 页 10玖币

侵权投诉

Spectral Regularization Allows Data-frugal Learning over

Combinatorial Spaces

Amirali Aghazadeh1, Nived Rajaraman2, Tony Tu1, and Kannan Ramchandran2

1School of Electrical and Computer Engineering, Georgia Institute of Technology,

2Department of Electrical Engineering and Computer Sciences,

University of California, Berkeley

Abstract

Data-driven machine learning models are being increasingly employed in several im-

portant inference problems in biology, chemistry, and physics which require learning over

combinatorial spaces. Recent empirical evidence (see, e.g., [1, 2, 3]) suggests that regu-

larizing the spectral representation of such models improves their generalization power

when labeled data is scarce. However, despite these empirical studies, the theoretical

underpinning of when and how spectral regularization enables improved generalization

is poorly understood. In this paper, we focus on learning pseudo-Boolean functions and

demonstrate that regularizing the empirical mean squared error by the L1norm of the

spectral transform of the learned function reshapes the loss landscape and allows for

data-frugal learning, under a restricted secant condition on the learner’s empirical error

measured against the ground truth function. Under a weaker quadratic growth condi-

tion, we show that stationary points which also approximately interpolate the training

data points achieve statistically optimal generalization performance. Complementing

our theory, we empirically demonstrate that running gradient descent on the regularized

loss results in a better generalization performance compared to baseline algorithms in

several data-scarce real-world problems.

1 Introduction

Machine learning (ML) models have become increasingly commonplace in learning pseudo-

Boolean functions f(·), which map a d-dimensional binary vector x∈ {±1}dto a real

number f(x)∈R. In biology, ML models are being used to infer the functional properties of

macro-molecules (e.g., proteins) from a small set of mutations [4, 5]. In physics, ML models

are being used to infer the thermodynamic properties of combinatorial systems deﬁned

over a set of binary (Ising) state variables [6, 7]. These highly nonlinear and complex ML

models are trained using modern techniques in continuous optimization, yet they inherit the

elegant properties of pseudo-Boolean functions over the discrete binary hypercube {±1}d.

In particular, it is known that the spectral representation of a pseudo-Boolean function is

deﬁned as the Walsh-Hadamard transform (WHT) of the resulting vector of combinatorial

arXiv:2210.02604v1 [stat.ML] 5 Oct 2022

function evaluations, that is, Hf(X), where His a 2d×2dWalsh matrix and the function

evaluation vector f(X)=[f(x) : x∈X]Tis sorted in the lexicographic ordering of all

the length-dbinary strings in X. The WHT enables writing the pseudo-Boolean function

f(x) as a multi-linear polynomial f(x) = PS⊆[d]αSQi∈S xi, where [d] = {1,2, . . . , d}and

αS∈Ris the WHT coeﬃcient corresponding to the monomial Qi∈S xi[8].1

Recent studies in biology (e.g., protein function prediction) have measured the out-

put of several of these real-world functions f(x) to all the 2denumerations of the input

xand analyzed their spectral representation. These costly high-throughout experiments

on combinatorially complete datasets demonstrate a curious phenomenon: such pseudo-

Boolean functions often have low dimensional structures that manifest in the form of an

approximately-sparse polynomial representation (i.e., approximately-sparse WHT) with the

top-kcoeﬃcients corresponding to physically-meaningful interactions [9, 10, 11] (see Ap-

pendix A for empirical evidences on protein functions).

The problem of learning pseudo-Boolean functions with sparse polynomial representa-

tions has a rich history from both theoretical and empirical viewpoints. In particular, since

the pseudo-Boolean functions being learned in their polynomial representations are linear

functions in their coeﬃcients, one may think about this problem as a special instance of

sparse linear regression in dimension 2d. There are a number of algorithms for sparse linear

regression such as LASSO [12], OMP [13], AMP [14], and FISTA [15]. In particular, one

may apply LASSO to get statistical guarantees for this problem which only scale linearly in

the sparsity of the underlying ground truth polynomial kand logarithmicaly in the problem

dimension log(2d) [16]. Likewise, from another perspective, one may view the polynomial

instead as a vector of 2doutcomes f(X); the objective of the learner is to approximate

f(X), when the learner can observe any chosen set of a few entries of this vector (corrupted

by noise), under the assumption that f(X) itself has a sparse WHT. In the literature,

this problem has been referred to by a number of names, and we refer to it as the sparse

WHT problem. There are yet again a number of computationally and statistically eﬃcient

algorithms for solving this problem, such as SPRIGHT [17] and Sparse-FFT [18], among

others.

A common issue with these alternate views of the problem such as sparse linear regression

or sparse WHT is that the resulting algorithms are not suited for function approximation.

In particular, real-world physical and biological functions often have additional structure

which is either unknown a priori or cannot be described succinctly, and which nonlinear

function classes are often implicitly biased towards learning. Indeed, several deep neural

networks (DNNs) have been shown to exhibit strong generalization performance in biolog-

ical and physical problems [5, 19, 20]. This leads to a fundamental disconnect between

algorithms for which strong theoretical guarantees are known (e.g., LASSO) and practi-

cal ML models based on nonlinear and complex function classes which are well-suited for

function approximation.

1We use these terms interchangeably for pseudo-Boolean functions: spectral, Fourier, and Walsh-

Hadamard.

Another issue is speciﬁc to algorithms for solving the sparse WHT problem: some of

these approaches require observing f(·) at any chosen input [17, 21]. In many practical

applications, where data is prohibitively expensive to collect, one is often forced to work

with a handful of random samples. For example, in the case of proteins, a common approach

to measure biological functions is through a procedure called random mutagenesis which

allows only a random sampling of the combinatorial sequence space [20]. This constraint

renders several algorithms for the sparse WHT problem ill-suited for learning, even though

they admit near-optimal computational and sample complexity guarantees.

From a more practical point of view, several of the recent approaches for learning over

combinatorial spaces (see, e.g., [1, 2, 3, 22]) follow similar ideas for improving empiri-

cal generalization. Instead of directly learning the 2d-dimensional polynomial, they have

converged on explicitly promoting sparsity in the spectral representation of the learned

compactly-represented nonlinear function—also known as spectral regularization. These

empirical studies motivate several important theoretical questions regarding the underlying

mechanism of spectral regularization in improving generalization in practice. In this paper,

we focus on the question: When and how does spectral regularization improve gen-

eralization? The answer to this question is important from both theoretical and practical

perspectives. Theoretically, it helps us understand how the loss landscape is reshaped as

the result of spectral regularization in favor of data-frugal learning. Practically, it tells us

when and how to use spectral regularization in learning real-world combinatorial functions.

Contributions. In this paper, we theoretically analyze spectral regularization of the

empirical risk from the perspective of learning pseudo-Boolean functions. We demonstrate

that regularizing with the L1norm of the spectrum of the learned function improves the gen-

eralization performance of the learner. In particular, under a particular Restricted Secant

Inequality (RSI) condition on the learner’s empirical error, measured against the ground

truth, stationary points of the regularized loss admit optimal generalization performance.

We relax this to a weaker quadratic growth (QG) condition and show that under an ad-

ditional data-interpolation condition, stationary points of the regularized loss also admit

statistical optimality. We empirically demonstrate that these conditions are satisﬁed for a

wide class of nonlinear functions. Our analysis provides a new generalization bounds when

the underlying learned functions are sparse in the spectral domain. Empirical demonstra-

tions on several real-world data-sets complements our theoretical ﬁndings.

2 Learning Pseudo-Boolean Functions

Problem statement. We consider the problem of learning structured pseudo-Boolean

functions on the binary hypercube {±1}d. In the ﬁnite sample setting, we assume that

the learner has access to a dataset Dncomprised of nlabeled data points (xi, yi)n

i=1 where

xi∈ {±1}dare drawn from an input distribution Dand the real-valued output yi∈Ris a

noisy realization of an unknown pseudo-Boolean function of the input 2. Namely we assume

2We use the subscript xito refer to the ith digit in the binary string x= (x1, x2,· · · , xd).

that there is a rich nonlinear function class F={fθ

θ:θ

θ∈Rm}parameterized by θ

θ, where

fθ

θ:{±1}d→R. The labels are generated as yi=fθ

θ∗(xi) + Ziwhere Ziis the noise in the

measurement for input xi, assumed to be independent and normally distributed ∼ N(0, σ2)

3. The objective of the learner is to learn θ

θ∗.

Multi-linear polynomial representation. Any pseudo-Boolean function f(·) can be uniquely

represented as a multi-linear polynomial, f(x) = Pz∈{±1}dαzQi:zi=+1 xi, where the scalar

αz∈Ris the coeﬃcient corresponding to the monomial Qi:zi=+1 xi, with order Pd

i=1 1(zi=

+1) [8]. The problem of learning a pseudo-Boolean function f(·) is equivalent to ﬁnding the

2dunknown coeﬃcients αz. In particular, the evaluations of fon all vertices of the binary

hypercube {±1}2dresults in a linear system of equations over the αzvariables,

f(x) : x∈ {±1}dT=Hαz:z∈ {±1}dT,(1)

where the variables xand zenumerate the vertices of the binary hypercube {±1}din

lexicographic order, and H≡H2ddenotes the 2d×2dHadamard matrix deﬁned recursively

as,

H2d+1 ="H2dH2d

H2d−H2d#,(2)

where H1=+1. Thus the vector of function evaluations can be obtained by taking the

WHT of the vector of coeﬃcients in the polynomial representation of f. Furthermore, by

inverting the above linear system (note that His an orthonormal matrix), the vector of

polynomial coeﬃcients [αz:z∈ {±1}d] can, in turn, be obtained by taking the WHT of the

vector f(x) : x∈ {±1}dT. For succinctness of notation, for a function f, we deﬁne f(X)

as f(x) : x∈ {±1}dTwhere in the LHS the binary strings are enumerated in lexicographic

order. The coeﬃcients in the polynomial representation of f(·) can be collected in the vector

Hf(X). For functions, we use “sparsity in WHT” and “sparse polynomial representation”

interchangeably.

Sparsity in real-world functions. Even in the noiseless setting, in order to perfectly learn

a general pseudo-Boolean function f(·), its evaluations on all input binary vectors x∈ {±1}d

are required, which may be very expensive. In the presence of additional structures, this

signiﬁcant statistical and computational cost can be mitigated. One such structure that has

been observed in real-world biological and physical functions is sparsity in WHT [9, 10, 11].

In particular, in the theoretical analysis of this paper, we assume that the ground-truth

function fθ

θ∗has a sparse polynomial representation composed of at most kmonomials.

Namely, kHfθ

θ∗(X)k0≤k, where k2dis typically unknown.

Spectral regularization. In attempting to learn pseudo-Boolean functions with a

sparse polynomial representation, a natural question to ask is: how can one encourage the

3In fact, our results only require the noise to be independent subgaussian random variables with variance

parameter σ2, but for ease of exposition, we impose the Gaussian noise condition.

learned functions to also have sparse polynomial representations. One solution is to regular-

ize the training loss with an additional functional which promotes sparsity in the polynomial

representation of the learned function b

f. Denoting the polynomial representation of b

fby

Pz∈{±1}dbαzQi:zi=+1 xi, a natural regularization functional would be kb

αk0≡ kHb

f(X)k0

which is also the sparsity of fin WHT.

However, since k·k0is not a continuous function, we proxy the k·k0by the L1norm.

The resulting regularization functional is, kHb

f(X)k1. When learning over a parameterized

function family, b

f=fθ

θ, the resulting regularized ERM we consider in this paper is,

min

θLn(θ

θ) + R(θ

θ),where R(θ

θ),λ

√2dkHfθ(X)k1,(OBJ)

where Ln(θ

θ),1

nPn

i=1 hfθ

θ(xi)−yi2iis the empirical mean squared error (MSE). Regu-

larizing by R(θ

θ) of the above form is known as spectral regularization (SP).

Remark 1. The scaling of the regularization parameter, λ

√2d, is motivated from the lin-

early parameterized setting of F={hθ

θ, xi:θ

θ∈Rd}. An explicit computation gives,

√2dkHfθ

θ(X)k1=λkθ

θk1, which is the scaling of the regularization parameter as used in

LASSO.

The regularization weight λ > 0 strikes a balance between the empirical MSE and the

spectral regularization, and is set empirically using cross validation. Since the aggregate

loss Ln(θ

θ) + R(θ

θ) is subdiﬀerentiable with respect to the model parameters θ, we can apply

stochastic (sub-)gradient methods on the aggregate loss. Note however that, in general, as a

function of the parameter θ

θ, both the MSE, as well as the SP regularization are non-convex

functions.

3 Theoretical Analysis

To develop some intuition about the general problem, consider the special case of learning

over the set of linear functions. Namely, F={hθ

θ, ·i :θ

θ∈Rd}. The polynomial repre-

sentation of any linear function fθ

θ(x) = hθ

θ, xiis simply Pi∈[d]θixiwhere only the order-1

coeﬃcients are non-zero. Thus, the assumption that the polynomial representation of fθ

θis

composed of at most kmonomials is equivalent to saying that θ

θis k-sparse. Thus in the

linear setting, the problem boils down to sparse linear regression [16]. Furthermore, in the

linear setting, spectral regularization has an explicit representation in terms of its param-

eter θ

θ. In particular, here R(θ

θ) = λ

√2dkHfθ

θ(X)k1=λkθ

θk1. Thus, the proposed objective

function, (OBJ), when specialized to the linear setting boils down to the LASSO objective,

nPn

i=1(hxi, θ

θi − yi)2+λkθ

θk1, which can be eﬃciently solved by gradient descent, and is

known to be statistically optimal in the ﬁnite sample regime [23].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SpectralRegularizationAllowsData-frugalLearningoverCombinatorialSpacesAmiraliAghazadeh1,NivedRajaraman2,TonyTu1,andKannanRamchandran21SchoolofElectricalandComputerEngineering,GeorgiaInstituteofTechnology,2DepartmentofElectricalEngineeringandComputerSciences,UniversityofCalifornia,BerkeleyAbstractDat...

展开>> 收起<<

Spectral Regularization Allows Data-frugal Learning over Combinatorial Spaces Amirali Aghazadeh1 Nived Rajaraman2 Tony Tu1 and Kannan Ramchandran2.pdf

共37页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Spectral Regularization Allows Data-frugal Learning over Combinatorial Spaces Amirali Aghazadeh1 Nived Rajaraman2 Tony Tu1 and Kannan Ramchandran2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: