Active Learning for Single Neuron Models with Lipschitz Non-Linearities Aarshvi Gajjar Chinmay Hegde and Christopher Musco

2025-04-30 0 0 3.52MB 16 页 10玖币

侵权投诉

Active Learning for Single Neuron Models with Lipschitz

Non-Linearities

Aarshvi Gajjar, Chinmay Hegde, and Christopher Musco

New York University

Abstract

We consider the problem of active learning for single neuron models, also sometimes called “ridge

functions”, in the agnostic setting (under adversarial label noise). Such models have been shown to be

broadly effective in modeling physical phenomena, and for constructing surrogate data-driven models for

partial differential equations.

Surprisingly, we show that for a single neuron model with any Lipschitz non-linearity (such as the

ReLU, sigmoid, absolute value, low-degree polynomial, among others), strong provable approximation

guarantees can be obtained using a well-known active learning strategy for ﬁtting linear functions in

the agnostic setting. Namely, we can collect samples via statistical leverage score sampling, which has

been shown to be near-optimal in other active learning scenarios. We support our theoretical results

with empirical simulations s howing that our proposed active learning strategy based on leverage score

sampling outperforms (ordinary) uniform sampling when ﬁtting single neuron models.

1Introduction

This paper considers active learning methods for functions of the form

g(x) = f(⟨w,x⟩)

, where

is a

weight vector and

is a non-linearity. For a given distribution

Rd×R

, a random vector

(x,y)

sampled

from

, and scalar function

f:R→R

, our goal is to ﬁnd

which minimizes the expected squared error:

Ex,y∼D f(⟨w,x⟩)−y2.

Functions of the form

f(⟨w,x⟩)

ﬁnd applications in a variety of settings under various names: they are called

“single neuron” or “single index” models, “ridge functions”, and “plane waves” in different communities

[

]. Single neuron models are studied in machine learning theory as tractable examples of

simple neural networks [

]. Moreover, these models are known to be adept at modeling a variety of

physical phenomena [

] and for that reason are effective e.g., in building surrogate models for efﬁciently

solving parametric partial differential equations (PDEs) and for approximating quantity of interest (QoI)

surfaces for uncertainty quantiﬁcation, model-driven design, and data assimilation [4,11,15,32,33,39].

In these applications, single neuron models are used to ﬁt complex functions over

based on queries

from those functions. Often, the cost of obtaining a query

(x,y)

from the target function dominates the

computational cost of ﬁtting the model: each training point collected requires numerically solving the PDE

under consideration with parameters given by

[

]. At the same time, we often have the freedom in

exactly how the query is obtained; we are not restricted to simply sampling from

, but rather can specify a

target location

and sample

y∼ D | x

(or compute

deterministically, since in most applications it is a

deterministic function of

). Given these considerations, the focus of our work is on developing efﬁcient

active learning and experimental design methods

for ﬁtting single neuron models using as few carefully

chosen (x,y)observations as possible.

We use “experimental design” to refer to methods that collect samples in a non-adaptive way. In other words, a set of points

x1, . . . , xs

are speciﬁed upfront and the corresponding

values are observed all at once. In contrast, in standard active learning

methods, the choice of xjcan depend on the response values of all prior points x1, . . . , xj−1.

arXiv:2210.13601v4 [cs.LG] 18 Jul 2023

We study this active learning problem in the challenging agnostic learning or adversarial noise setting.

Again, this is motivated by applications of single neuron models in computational science. Typically, while

it can be approximated by a single neuron model, the QoI or surrogate function under consideration is not

itself of the form

f(⟨w,x⟩)

. For this reason, the agnostic setting has become the standard in work on PDE

models involving other common function families, like structured polynomials [

]. In the agnostic

setting, for a constant

(or more stringently,

C=1+ε

), our goal is always to return with high probability

some ˜

wsuch that:

Ex,y∼D f(⟨˜

w,x⟩)−y2≤C·min

Ex,y∼D f(⟨w,x⟩)−y2.

1.1Our Contributions

For ease of exposition, we consider the case when

is a uniform distribution over

points in

. This

is without loss of generality, since any continuous distribution can be approximated by the uniform

distribution over a sufﬁcient large, but ﬁnite, subset of points in

. In this case, we have the following

problem statement.

Problem 1(Single Neuron Regression).Given a matrix

X∈Rn×d

and query access to a target vector

y∈Rn

, for a given function

f:R→R

, ﬁnd

w∈Rd

to minimize



f(Xw)−y



using as few queries from

yas possible.

When

is an identity function, Problem 1reduces to active least squares regression, which has received

a lot of recent attention in computer science and machine learning. In the agnostic setting, state-of-the-art

results can be obtained via “leverage score” sampling, also known as “coherence motivated” or “effective

resistance” sampling [

]. The idea behind leverage scores sampling methods is to collect samples

from

randomly but non-uniformly, using an importance sampling distribution based on the rows of

More “unique” rows are selected with higher probability. Formally, rows are selected with probability

proportional to their statistical leverage scores:

Deﬁnition 1(Statistical Leverage Score).The leverage score,

τi(X)

of the

ith

row,

of a matrix,

X∈Rn×d

is equal to:

τi(X) = xT

i(XTX)−1xi=max

w∈Rd

[Xw]2

∥Xw∥2

Here, [Xw]idenotes the ith entry of the vector Xw.

We always have that 0 ≤τi(X)≤1, and a well-known property of the statistical leverage scores is that

∑n

i=1τi(X) = rank(X)≤d

. The leverage score of a row is large (close to

) if that row has large inner

product with some vector in

, as compared to that vector’s inner product with all other rows in the

matrix

. This means that the particular row enjoys signiﬁcance in forming the row space of

. For linear

regression, it can be shown that when

has

columns, leverage score sampling yields a sample complexity

O(dlog d+d/ε)

to ﬁnd

satisfying

∥Xˆ

w−y∥2

2≤(1+ε)minw∥Xw −y∥2

; moreover, this is optimal

up to the log factor [7].

Our main contribution is to establish that, surprisingly, when combined with a novel regularization

strategy, leverage score sampling also yields theoretical guarantees for the more general case (Problem

1) for a broad class of non-linearities

. In fact, we only require that

-Lipschitz for constant

, a

property that holds for most non-linearities used in practice (ReLU, sigmoid, absolute value, low-degree

polynomials, etc.). We prove the following main result, which shows that

O(d2/ε4)

samples, collected via

leverage score sampling, sufﬁce for provably learning a single neuron model with Lipschitz non-linearity.

For other function families (e.g. polynomials, or sparse Fourier functions) there has been recent work on active learning algorithms

based on leverage score sampling that skip the discrete approximation step by developing algorithms directly tailored to common

continuous distributions, like uniform or Gaussian [

]. We believe our techniques should be directly extendable to give comparable

results for single neuron models.

Theorem 1(Main Result).Let

X∈Rn×d

be a data matrix and

y∈Rn

be a target vector. Let

-Lipschitz non-linearity with

f(0) = 0

and let

OPT =minw∥f(Xw)−y∥2

. There is an algorithm

(Algorithm 1) that, based on the leverage scores of

, observes

m=Od2log(1/ε)

ε4

random entries from

and returns with probability >9/10 a vector ˆ

wsatisfying:

∥f(Xˆ

w)−y∥2

2≤C·OPT +εL2∥Xw∗∥2

2.

The assumption

f(0) = 0

in Theorem 1is without loss of generality. If

f(0)

is non-zero, we can simply

solve a transformed problem with

y′=y−f(0)

and

f′(x) = f(x)−f(0)

. The theorem mirrors previous

results in the linear setting, and in contrast to prior work on agnostically learning single neuron models,

does not require any assumptions on

[

]. In addition to multiplicative error

, the theorem has an

additive error term of

CεL2∥Xw∗∥2

. An additive error term is necessary; as we will show in Section 5, it is

provably impossible to achieve purely relative error with a number of samples polynomial in

. Similar

additive error terms arise in related work on leverage score sampling for problems like logistic regression

[

]. On the other hand, we believe the

dependence in our bound is not necessary, and should be

improvable to linear in d. The dependence on εis also likely improvable.

In Section 4we support our main theoretical result with an empirical evaluation on both synthetic data

and several test problems that require approximating PDE quantity of interest surfaces. In all settings,

leverage score sampling outperforms ordinary uniform sampling, often improving on the error obtained

using a ﬁxed number of samples by an order of magnitude or more.

1.2Related Work

Single neuron models have been widely studied in a number of communities, including machine learning,

computational science, and approximation theory. These models can be generalized to the “multi-index”

case, where

g(x) = f1(⟨x,w1) + . . . +fq(⟨x,w1)

[

] or to the case when

is not known in advance

(but might be from a parameterized function family, such as low-degree polynomials) [

]. While

we do not address these generalizations in this paper, we hope that our work can provide a foundation for

further work in this direction.

Beyond sample complexity, there has also been a lot of interest in understanding the computational

complexity of ﬁtting single neuron models, including in the agnostic setting [

]. There are both known

hardness results for general data distributions [

], as well as positive results on efﬁcient algorithms

under additional assumptions [

]. Our work differs from this setting in two ways: 1) we focus on

sample complexity; 2) we make no assumptions on the data distribution

(i.e., no assumptions on the

data matrix

); 3) we allow for active sampling strategies. Obtaining results under i.i.d. sampling, even

for well behaved distributions, inherently requires additional assumptions, like

w∗=minw∥f(Xw)−y∥2

being bounded in norm or Xhaving bounded condition number.

In computational science, there has been signiﬁcant work on active learning and experimental design for

ﬁtting other classes of functions adept at modeling high dimensional physical phenomena [

]. Most

such results focus on minimizing squared prediction error. For situations involving model mismatch, the

agnostic (adversarial noise) setting is more appropriate than assuming i.i.d. zero mean noise, which is more

typical in classical statistical results on experimental design [

]. To obtain results in the agnostic setting

for linear function families, much of the prior work in computational science uses “coherence motivated”

sampling techniques [10,28,44]. Such methods are equivalent to leverage score sampling [3].

Leverage score sampling and related methods have found widespread applications in the design of

efﬁcient algorithms, including in the construction of graph sparsiﬁers, coresets, randomized low-rank

approximations, and in solving over-constrained linear regression problems [12,13,16,21,22,24,37,46].

Recently, there has been renewed attention on using leverage score sampling for solving the active linear

regression problem for various loss functions [

]. While all of these results are heavily tailored to

linear models, a small body of works addresses the problem of active learning for non-linear problems.

This includes problems of the form

∥f(Xw −y)∥2

, where

is Lipschitz [

]. While not equivalent to

our problem, this formulation captures important tasks like logistic regression. Our work can be viewed as

broadening the range of non-linear problems for which leverage score sampling yields natural worst-case

guarantees.

Finally, we mention a few papers that directly address the active learning problem for functions of

form

f(Xw)

. [

] studies adaptive query algorithms, but in a different setting than our paper. Speciﬁcally,

they address the easier (noiseless) realizable setting, where samples are obtained from a function of the

form

f(Xw∗)

for a ground truth

w∗

. They also make stronger smoothness assumptions on

, although

their algorithm can handle the case when

is not known in advance. Follow-up work also address the

multi-index problem in the same setting [

]. Also motivated by applications in efﬁcient PDE surrogate

modeling, [

] study the multi-index problem, but again in the realizable setting. Their techniques can

handle mean centered i.i.d. noise, but not adversarial noise or model mismatch.

2Preliminaries

Notation. Throughout the paper, we use bold lower-case letters for vectors and bold upper-case letters for

matrices. We let

denotes the

ith

standard basis vector (all zeros, but with a

in position

). The dimension

will be clear from context. For a vector

x∈Rn

with entries

x1, . . . , xn

∥x∥2= (∑n

i=1x2

i)1/2

denotes

the Euclidean norm.

Bd(r)

denotes a ball of radius

centered at

, i.e.

Bd(r) = {x∈Rd:∥x∥≤r}

. For a

scalar function

f:R→R

and vector

, we use

f(y)

to denote the vector obtained by applying

to each

entry of

. For a ﬁxed matrix

, unobserved target vector

, and non-linearity

, we denote

∥f(Xw∗)−y∥2

by OPT where w∗∈arg minw∥f(Xw)−y∥2.

Importance Sampling. As discussed, our approach is based on importance sampling according to statistical

leverage scores: we ﬁx a set of probabilities

p1, . . . , pn

, use them to sample

rows from the regression

problem



f(Xw)−y



, and solve a reweighted least squares problem to ﬁnd a near optimal choice for

Formally, this can be implemented by deﬁning a sampling matrices of the following form:

Deﬁnition 2(Importance Sampling Matrix).Let

{p1, . . . , pn} ∈ [0, 1]n

be a given set of probabilities (so that

∑ipi=1

). A matrix

is an

m×n

importance sampling matrix if each of its rows is chosen to equal

√m·pi·ei

with probability proportional to pi.

To compute an approximate to

min

f(Xw)−y



we will solve an optimization problem involving the

sub-sampled objective



Sf(Xw)−Sy



. It is easily veriﬁed that for any choice of

p1, . . . , pn

and any vector

z,Eh∥Sz∥2

2i=∥z∥2

2. We will use this fact repeatedly.

Properties of Leverage Scores. Our importance sampling mechanism is based on sampling by the leverage

scores

τ1(X), . . . , τn(X)

of the design matrix

X∈Rd×n

. For any full-rank

d×d

matrix

, we have that

τi(XR) = τi(X)

. This is clear from Deﬁnition 1and implies that

τi(X)

only depends on the column span

. In our proofs, this property will allow us to easily reduce to the setting where

is assumed to be a

matrix with orthonormal columns.

We will also use the following well-known fact about using leverage score sampling to construct a

“subspace embedding” for a matrix X.

Lemma 1(Subspace Embedding (see e.g. Theorem 17 in Woodruff

[49]

).Given

X∈Rn×d

with leverage

scores

τ1(X), . . . , τn(X)

, let

pi=τi(X)/∑iτi(X)

. Let

S∈Rm×n

be a sampling matrix constructed as in

Deﬁnition 2using the probabilities

p1, . . . , pn

. For any

0<γ<1

, as long as

m≥c·dlog(d/δ)/γ2

for some

ﬁxed constant c, then with probability 1 −δwe have that for all w∈Rd,

(1−γ)∥Xw∥2

2≤ ∥SXw∥2

2≤(1+γ)∥Xw∥2

Lemma 1establishes that, with high probability, leverage score sampling preserves the norm of any

vector

in the column span of

. This guarantee can be proven using an argument that reduces to a

matrix Chernoff bound [

]. This is a critical component for previously known active learning guarantees

for ﬁtting linear functions using leverage score sampling [45].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ActiveLearningforSingleNeuronModelswithLipschitzNon-LinearitiesAarshviGajjar,ChinmayHegde,andChristopherMuscoNewYorkUniversityAbstractWeconsidertheproblemofactivelearningforsingleneuronmodels,alsosometimescalled“ridgefunctions”,intheagnosticsetting(underadversariallabelnoise).Suchmodelshavebeenshown...

展开>> 收起<<

Active Learning for Single Neuron Models with Lipschitz Non-Linearities Aarshvi Gajjar Chinmay Hegde and Christopher Musco.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Active Learning for Single Neuron Models with Lipschitz Non-Linearities Aarshvi Gajjar Chinmay Hegde and Christopher Musco

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: