Active Learning for Single Neuron Models with Lipschitz Non-Linearities Aarshvi Gajjar Chinmay Hegde and Christopher Musco

2025-04-30 0 0 3.52MB 16 页 10玖币
侵权投诉
Active Learning for Single Neuron Models with Lipschitz
Non-Linearities
Aarshvi Gajjar, Chinmay Hegde, and Christopher Musco
New York University
Abstract
We consider the problem of active learning for single neuron models, also sometimes called “ridge
functions”, in the agnostic setting (under adversarial label noise). Such models have been shown to be
broadly effective in modeling physical phenomena, and for constructing surrogate data-driven models for
partial differential equations.
Surprisingly, we show that for a single neuron model with any Lipschitz non-linearity (such as the
ReLU, sigmoid, absolute value, low-degree polynomial, among others), strong provable approximation
guarantees can be obtained using a well-known active learning strategy for fitting linear functions in
the agnostic setting. Namely, we can collect samples via statistical leverage score sampling, which has
been shown to be near-optimal in other active learning scenarios. We support our theoretical results
with empirical simulations s howing that our proposed active learning strategy based on leverage score
sampling outperforms (ordinary) uniform sampling when fitting single neuron models.
1Introduction
This paper considers active learning methods for functions of the form
g(x) = f(w,x)
, where
w
is a
weight vector and
f
is a non-linearity. For a given distribution
D
on
Rd×R
, a random vector
(x,y)
sampled
from
D
, and scalar function
f:RR
, our goal is to find
w
which minimizes the expected squared error:
Ex,y∼D f(w,x)y2.
Functions of the form
f(w,x)
find applications in a variety of settings under various names: they are called
“single neuron” or “single index” models, “ridge functions”, and “plane waves” in different communities
[
5
,
40
,
41
,
43
,
50
]. Single neuron models are studied in machine learning theory as tractable examples of
simple neural networks [
17
,
26
]. Moreover, these models are known to be adept at modeling a variety of
physical phenomena [
14
] and for that reason are effective e.g., in building surrogate models for efficiently
solving parametric partial differential equations (PDEs) and for approximating quantity of interest (QoI)
surfaces for uncertainty quantification, model-driven design, and data assimilation [4,11,15,32,33,39].
In these applications, single neuron models are used to fit complex functions over
Rd
based on queries
from those functions. Often, the cost of obtaining a query
(x,y)
from the target function dominates the
computational cost of fitting the model: each training point collected requires numerically solving the PDE
under consideration with parameters given by
x
[
1
,
9
]. At the same time, we often have the freedom in
exactly how the query is obtained; we are not restricted to simply sampling from
D
, but rather can specify a
target location
x
and sample
y∼ D | x
(or compute
y
deterministically, since in most applications it is a
deterministic function of
x
). Given these considerations, the focus of our work is on developing efficient
active learning and experimental design methods
1
for fitting single neuron models using as few carefully
chosen (x,y)observations as possible.
1
We use “experimental design” to refer to methods that collect samples in a non-adaptive way. In other words, a set of points
x1, . . . , xs
are specified upfront and the corresponding
y
values are observed all at once. In contrast, in standard active learning
methods, the choice of xjcan depend on the response values of all prior points x1, . . . , xj1.
1
arXiv:2210.13601v4 [cs.LG] 18 Jul 2023
We study this active learning problem in the challenging agnostic learning or adversarial noise setting.
Again, this is motivated by applications of single neuron models in computational science. Typically, while
it can be approximated by a single neuron model, the QoI or surrogate function under consideration is not
itself of the form
f(w,x)
. For this reason, the agnostic setting has become the standard in work on PDE
models involving other common function families, like structured polynomials [
2
,
8
,
9
,
29
]. In the agnostic
setting, for a constant
C
(or more stringently,
C=1+ε
), our goal is always to return with high probability
some ˜
wsuch that:
Ex,y∼D f(˜
w,x)y2C·min
w
Ex,y∼D f(w,x)y2.
1.1Our Contributions
For ease of exposition, we consider the case when
D
is a uniform distribution over
n
points in
Rd
. This
is without loss of generality, since any continuous distribution can be approximated by the uniform
distribution over a sufficient large, but finite, subset of points in
Rd
.
2
. In this case, we have the following
problem statement.
Problem 1(Single Neuron Regression).Given a matrix
XRn×d
and query access to a target vector
yRn
, for a given function
f:RR
, find
wRd
to minimize
f(Xw)y
2
2
using as few queries from
yas possible.
When
f
is an identity function, Problem 1reduces to active least squares regression, which has received
a lot of recent attention in computer science and machine learning. In the agnostic setting, state-of-the-art
results can be obtained via “leverage score” sampling, also known as “coherence motivated” or “effective
resistance” sampling [
3
,
10
,
28
,
44
]. The idea behind leverage scores sampling methods is to collect samples
from
y
randomly but non-uniformly, using an importance sampling distribution based on the rows of
X
.
More “unique” rows are selected with higher probability. Formally, rows are selected with probability
proportional to their statistical leverage scores:
Definition 1(Statistical Leverage Score).The leverage score,
τi(X)
of the
ith
row,
xi
of a matrix,
XRn×d
is equal to:
τi(X) = xT
i(XTX)1xi=max
wRd
[Xw]2
i
Xw2
2
.
Here, [Xw]idenotes the ith entry of the vector Xw.
We always have that 0 τi(X)1, and a well-known property of the statistical leverage scores is that
n
i=1τi(X) = rank(X)d
. The leverage score of a row is large (close to
1
) if that row has large inner
product with some vector in
Rd
, as compared to that vector’s inner product with all other rows in the
matrix
X
. This means that the particular row enjoys significance in forming the row space of
X
. For linear
regression, it can be shown that when
X
has
d
columns, leverage score sampling yields a sample complexity
of
O(dlog d+d/ε)
to find
ˆ
w
satisfying
Xˆ
wy2
2(1+ε)minwXw y2
2
; moreover, this is optimal
up to the log factor [7].
Our main contribution is to establish that, surprisingly, when combined with a novel regularization
strategy, leverage score sampling also yields theoretical guarantees for the more general case (Problem
1) for a broad class of non-linearities
f
. In fact, we only require that
f
is
L
-Lipschitz for constant
L
, a
property that holds for most non-linearities used in practice (ReLU, sigmoid, absolute value, low-degree
polynomials, etc.). We prove the following main result, which shows that
˜
O(d2/ε4)
samples, collected via
leverage score sampling, suffice for provably learning a single neuron model with Lipschitz non-linearity.
2
For other function families (e.g. polynomials, or sparse Fourier functions) there has been recent work on active learning algorithms
based on leverage score sampling that skip the discrete approximation step by developing algorithms directly tailored to common
continuous distributions, like uniform or Gaussian [
23
]. We believe our techniques should be directly extendable to give comparable
results for single neuron models.
2
Theorem 1(Main Result).Let
XRn×d
be a data matrix and
yRn
be a target vector. Let
f
be
an
L
-Lipschitz non-linearity with
f(0) = 0
and let
OPT =minwf(Xw)y2
2
. There is an algorithm
(Algorithm 1) that, based on the leverage scores of
X
, observes
m=Od2log(1/ε)
ε4
random entries from
y
and returns with probability >9/10 a vector ˆ
wsatisfying:
f(Xˆ
w)y2
2C·OPT +εL2Xw2
2.
The assumption
f(0) = 0
in Theorem 1is without loss of generality. If
f(0)
is non-zero, we can simply
solve a transformed problem with
y=yf(0)
and
f(x) = f(x)f(0)
. The theorem mirrors previous
results in the linear setting, and in contrast to prior work on agnostically learning single neuron models,
does not require any assumptions on
X
[
20
,
47
]. In addition to multiplicative error
C
, the theorem has an
additive error term of
CεL2Xw2
2
. An additive error term is necessary; as we will show in Section 5, it is
provably impossible to achieve purely relative error with a number of samples polynomial in
d
. Similar
additive error terms arise in related work on leverage score sampling for problems like logistic regression
[
34
,
36
]. On the other hand, we believe the
d2
dependence in our bound is not necessary, and should be
improvable to linear in d. The dependence on εis also likely improvable.
In Section 4we support our main theoretical result with an empirical evaluation on both synthetic data
and several test problems that require approximating PDE quantity of interest surfaces. In all settings,
leverage score sampling outperforms ordinary uniform sampling, often improving on the error obtained
using a fixed number of samples by an order of magnitude or more.
1.2Related Work
Single neuron models have been widely studied in a number of communities, including machine learning,
computational science, and approximation theory. These models can be generalized to the “multi-index”
case, where
g(x) = f1(x,w1) + . . . +fq(x,w1)
[
5
,
31
,
47
] or to the case when
f
is not known in advance
(but might be from a parameterized function family, such as low-degree polynomials) [
11
,
15
,
30
]. While
we do not address these generalizations in this paper, we hope that our work can provide a foundation for
further work in this direction.
Beyond sample complexity, there has also been a lot of interest in understanding the computational
complexity of fitting single neuron models, including in the agnostic setting [
50
]. There are both known
hardness results for general data distributions [
18
,
19
,
27
], as well as positive results on efficient algorithms
under additional assumptions [
17
,
20
,
26
]. Our work differs from this setting in two ways: 1) we focus on
sample complexity; 2) we make no assumptions on the data distribution
D
(i.e., no assumptions on the
data matrix
X
); 3) we allow for active sampling strategies. Obtaining results under i.i.d. sampling, even
for well behaved distributions, inherently requires additional assumptions, like
w=minwf(Xw)y2
being bounded in norm or Xhaving bounded condition number.
In computational science, there has been significant work on active learning and experimental design for
fitting other classes of functions adept at modeling high dimensional physical phenomena [
2
,
8
,
9
,
29
]. Most
such results focus on minimizing squared prediction error. For situations involving model mismatch, the
agnostic (adversarial noise) setting is more appropriate than assuming i.i.d. zero mean noise, which is more
typical in classical statistical results on experimental design [
42
]. To obtain results in the agnostic setting
for linear function families, much of the prior work in computational science uses “coherence motivated”
sampling techniques [10,28,44]. Such methods are equivalent to leverage score sampling [3].
Leverage score sampling and related methods have found widespread applications in the design of
efficient algorithms, including in the construction of graph sparsifiers, coresets, randomized low-rank
approximations, and in solving over-constrained linear regression problems [12,13,16,21,22,24,37,46].
Recently, there has been renewed attention on using leverage score sampling for solving the active linear
regression problem for various loss functions [
6
,
23
,
35
,
38
]. While all of these results are heavily tailored to
linear models, a small body of works addresses the problem of active learning for non-linear problems.
This includes problems of the form
f(Xw y)2
2
, where
f
is Lipschitz [
34
,
36
]. While not equivalent to
our problem, this formulation captures important tasks like logistic regression. Our work can be viewed as
3
broadening the range of non-linear problems for which leverage score sampling yields natural worst-case
guarantees.
Finally, we mention a few papers that directly address the active learning problem for functions of
form
f(Xw)
. [
11
] studies adaptive query algorithms, but in a different setting than our paper. Specifically,
they address the easier (noiseless) realizable setting, where samples are obtained from a function of the
form
f(Xw)
for a ground truth
w
. They also make stronger smoothness assumptions on
f
, although
their algorithm can handle the case when
f
is not known in advance. Follow-up work also address the
multi-index problem in the same setting [
25
]. Also motivated by applications in efficient PDE surrogate
modeling, [
47
] study the multi-index problem, but again in the realizable setting. Their techniques can
handle mean centered i.i.d. noise, but not adversarial noise or model mismatch.
2Preliminaries
Notation. Throughout the paper, we use bold lower-case letters for vectors and bold upper-case letters for
matrices. We let
ei
denotes the
ith
standard basis vector (all zeros, but with a
1
in position
i
). The dimension
of
ei
will be clear from context. For a vector
xRn
with entries
x1, . . . , xn
,
x2= (n
i=1x2
i)1/2
denotes
the Euclidean norm.
Bd(r)
denotes a ball of radius
r
centered at
0
, i.e.
Bd(r) = {xRd:xr}
. For a
scalar function
f:RR
and vector
y
, we use
f(y)
to denote the vector obtained by applying
f
to each
entry of
y
. For a fixed matrix
X
, unobserved target vector
y
, and non-linearity
f
, we denote
f(Xw)y2
2
by OPT where warg minwf(Xw)y2.
Importance Sampling. As discussed, our approach is based on importance sampling according to statistical
leverage scores: we fix a set of probabilities
p1, . . . , pn
, use them to sample
m
rows from the regression
problem
f(Xw)y
2
2
, and solve a reweighted least squares problem to find a near optimal choice for
w
.
Formally, this can be implemented by defining a sampling matrices of the following form:
Definition 2(Importance Sampling Matrix).Let
{p1, . . . , pn} ∈ [0, 1]n
be a given set of probabilities (so that
ipi=1
). A matrix
S
is an
m×n
importance sampling matrix if each of its rows is chosen to equal
1
m·pi·ei
with probability proportional to pi.
To compute an approximate to
min
f(Xw)y
2
2
we will solve an optimization problem involving the
sub-sampled objective
Sf(Xw)Sy
2
2
. It is easily verified that for any choice of
p1, . . . , pn
and any vector
z,EhSz2
2i=z2
2. We will use this fact repeatedly.
Properties of Leverage Scores. Our importance sampling mechanism is based on sampling by the leverage
scores
τ1(X), . . . , τn(X)
of the design matrix
XRd×n
. For any full-rank
d×d
matrix
R
, we have that
τi(XR) = τi(X)
. This is clear from Definition 1and implies that
τi(X)
only depends on the column span
of
X
. In our proofs, this property will allow us to easily reduce to the setting where
X
is assumed to be a
matrix with orthonormal columns.
We will also use the following well-known fact about using leverage score sampling to construct a
“subspace embedding” for a matrix X.
Lemma 1(Subspace Embedding (see e.g. Theorem 17 in Woodruff
[49]
).Given
XRn×d
with leverage
scores
τ1(X), . . . , τn(X)
, let
pi=τi(X)/iτi(X)
. Let
SRm×n
be a sampling matrix constructed as in
Definition 2using the probabilities
p1, . . . , pn
. For any
0<γ<1
, as long as
mc·dlog(d/δ)/γ2
for some
fixed constant c, then with probability 1 δwe have that for all wRd,
(1γ)Xw2
2≤ ∥SXw2
2(1+γ)Xw2
2.
Lemma 1establishes that, with high probability, leverage score sampling preserves the norm of any
vector
Xw
in the column span of
X
. This guarantee can be proven using an argument that reduces to a
matrix Chernoff bound [
46
]. This is a critical component for previously known active learning guarantees
for fitting linear functions using leverage score sampling [45].
4
摘要:

ActiveLearningforSingleNeuronModelswithLipschitzNon-LinearitiesAarshviGajjar,ChinmayHegde,andChristopherMuscoNewYorkUniversityAbstractWeconsidertheproblemofactivelearningforsingleneuronmodels,alsosometimescalled“ridgefunctions”,intheagnosticsetting(underadversariallabelnoise).Suchmodelshavebeenshown...

展开>> 收起<<
Active Learning for Single Neuron Models with Lipschitz Non-Linearities Aarshvi Gajjar Chinmay Hegde and Christopher Musco.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:3.52MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注