
function evaluations, that is, Hf(X), where His a 2d×2dWalsh matrix and the function
evaluation vector f(X)=[f(x) : x∈X]Tis sorted in the lexicographic ordering of all
the length-dbinary strings in X. The WHT enables writing the pseudo-Boolean function
f(x) as a multi-linear polynomial f(x) = PS⊆[d]αSQi∈S xi, where [d] = {1,2, . . . , d}and
αS∈Ris the WHT coefficient corresponding to the monomial Qi∈S xi[8].1
Recent studies in biology (e.g., protein function prediction) have measured the out-
put of several of these real-world functions f(x) to all the 2denumerations of the input
xand analyzed their spectral representation. These costly high-throughout experiments
on combinatorially complete datasets demonstrate a curious phenomenon: such pseudo-
Boolean functions often have low dimensional structures that manifest in the form of an
approximately-sparse polynomial representation (i.e., approximately-sparse WHT) with the
top-kcoefficients corresponding to physically-meaningful interactions [9, 10, 11] (see Ap-
pendix A for empirical evidences on protein functions).
The problem of learning pseudo-Boolean functions with sparse polynomial representa-
tions has a rich history from both theoretical and empirical viewpoints. In particular, since
the pseudo-Boolean functions being learned in their polynomial representations are linear
functions in their coefficients, one may think about this problem as a special instance of
sparse linear regression in dimension 2d. There are a number of algorithms for sparse linear
regression such as LASSO [12], OMP [13], AMP [14], and FISTA [15]. In particular, one
may apply LASSO to get statistical guarantees for this problem which only scale linearly in
the sparsity of the underlying ground truth polynomial kand logarithmicaly in the problem
dimension log(2d) [16]. Likewise, from another perspective, one may view the polynomial
instead as a vector of 2doutcomes f(X); the objective of the learner is to approximate
f(X), when the learner can observe any chosen set of a few entries of this vector (corrupted
by noise), under the assumption that f(X) itself has a sparse WHT. In the literature,
this problem has been referred to by a number of names, and we refer to it as the sparse
WHT problem. There are yet again a number of computationally and statistically efficient
algorithms for solving this problem, such as SPRIGHT [17] and Sparse-FFT [18], among
others.
A common issue with these alternate views of the problem such as sparse linear regression
or sparse WHT is that the resulting algorithms are not suited for function approximation.
In particular, real-world physical and biological functions often have additional structure
which is either unknown a priori or cannot be described succinctly, and which nonlinear
function classes are often implicitly biased towards learning. Indeed, several deep neural
networks (DNNs) have been shown to exhibit strong generalization performance in biolog-
ical and physical problems [5, 19, 20]. This leads to a fundamental disconnect between
algorithms for which strong theoretical guarantees are known (e.g., LASSO) and practi-
cal ML models based on nonlinear and complex function classes which are well-suited for
function approximation.
1We use these terms interchangeably for pseudo-Boolean functions: spectral, Fourier, and Walsh-
Hadamard.
2