
Theorem 1(Main Result).Let
X∈Rn×d
be a data matrix and
y∈Rn
be a target vector. Let
f
be
an
L
-Lipschitz non-linearity with
f(0) = 0
and let
OPT =minw∥f(Xw)−y∥2
2
. There is an algorithm
(Algorithm 1) that, based on the leverage scores of
X
, observes
m=Od2log(1/ε)
ε4
random entries from
y
and returns with probability >9/10 a vector ˆ
wsatisfying:
∥f(Xˆ
w)−y∥2
2≤C·OPT +εL2∥Xw∗∥2
2.
The assumption
f(0) = 0
in Theorem 1is without loss of generality. If
f(0)
is non-zero, we can simply
solve a transformed problem with
y′=y−f(0)
and
f′(x) = f(x)−f(0)
. The theorem mirrors previous
results in the linear setting, and in contrast to prior work on agnostically learning single neuron models,
does not require any assumptions on
X
[
20
,
47
]. In addition to multiplicative error
C
, the theorem has an
additive error term of
CεL2∥Xw∗∥2
2
. An additive error term is necessary; as we will show in Section 5, it is
provably impossible to achieve purely relative error with a number of samples polynomial in
d
. Similar
additive error terms arise in related work on leverage score sampling for problems like logistic regression
[
34
,
36
]. On the other hand, we believe the
d2
dependence in our bound is not necessary, and should be
improvable to linear in d. The dependence on εis also likely improvable.
In Section 4we support our main theoretical result with an empirical evaluation on both synthetic data
and several test problems that require approximating PDE quantity of interest surfaces. In all settings,
leverage score sampling outperforms ordinary uniform sampling, often improving on the error obtained
using a fixed number of samples by an order of magnitude or more.
1.2Related Work
Single neuron models have been widely studied in a number of communities, including machine learning,
computational science, and approximation theory. These models can be generalized to the “multi-index”
case, where
g(x) = f1(⟨x,w1) + . . . +fq(⟨x,w1)
[
5
,
31
,
47
] or to the case when
f
is not known in advance
(but might be from a parameterized function family, such as low-degree polynomials) [
11
,
15
,
30
]. While
we do not address these generalizations in this paper, we hope that our work can provide a foundation for
further work in this direction.
Beyond sample complexity, there has also been a lot of interest in understanding the computational
complexity of fitting single neuron models, including in the agnostic setting [
50
]. There are both known
hardness results for general data distributions [
18
,
19
,
27
], as well as positive results on efficient algorithms
under additional assumptions [
17
,
20
,
26
]. Our work differs from this setting in two ways: 1) we focus on
sample complexity; 2) we make no assumptions on the data distribution
D
(i.e., no assumptions on the
data matrix
X
); 3) we allow for active sampling strategies. Obtaining results under i.i.d. sampling, even
for well behaved distributions, inherently requires additional assumptions, like
w∗=minw∥f(Xw)−y∥2
being bounded in norm or Xhaving bounded condition number.
In computational science, there has been significant work on active learning and experimental design for
fitting other classes of functions adept at modeling high dimensional physical phenomena [
2
,
8
,
9
,
29
]. Most
such results focus on minimizing squared prediction error. For situations involving model mismatch, the
agnostic (adversarial noise) setting is more appropriate than assuming i.i.d. zero mean noise, which is more
typical in classical statistical results on experimental design [
42
]. To obtain results in the agnostic setting
for linear function families, much of the prior work in computational science uses “coherence motivated”
sampling techniques [10,28,44]. Such methods are equivalent to leverage score sampling [3].
Leverage score sampling and related methods have found widespread applications in the design of
efficient algorithms, including in the construction of graph sparsifiers, coresets, randomized low-rank
approximations, and in solving over-constrained linear regression problems [12,13,16,21,22,24,37,46].
Recently, there has been renewed attention on using leverage score sampling for solving the active linear
regression problem for various loss functions [
6
,
23
,
35
,
38
]. While all of these results are heavily tailored to
linear models, a small body of works addresses the problem of active learning for non-linear problems.
This includes problems of the form
∥f(Xw −y)∥2
2
, where
f
is Lipschitz [
34
,
36
]. While not equivalent to
our problem, this formulation captures important tasks like logistic regression. Our work can be viewed as
3