Improved Algorithms for Neural Active Learning Yikun Ban Yuheng Zhang Hanghang Tong Arindam Banerjee and Jingrui He University of Illinois Urbana-Champaign

2025-05-08 5 0 1.19MB 34 页 10玖币

侵权投诉

Improved Algorithms for Neural Active Learning

Yikun Ban∗, Yuheng Zhang∗, Hanghang Tong, Arindam Banerjee, and Jingrui He

University of Illinois Urbana-Champaign

{yikunb2, yuhengz2, htong, arindamb, jingrui}@illinois.edu

Abstract

We improve the theoretical and empirical performance of neural-network(NN)-

based active learning algorithms for the non-parametric streaming setting. In

particular, we introduce two regret metrics by minimizing the population loss that

are more suitable in active learning than the one used in state-of-the-art (SOTA)

related work. Then, the proposed algorithm leverages the powerful representation

of NNs for both exploitation and exploration, has the query decision-maker tailored

for

-class classiﬁcation problems with the performance guarantee, utilizes the

full feedback, and updates parameters in a more practical and efﬁcient manner.

These careful designs lead to an instance-dependent regret upper bound, roughly

improving by a multiplicative factor

O(log T)

and removing the curse of input

dimensionality. Furthermore, we show that the algorithm can achieve the same

performance as the Bayes-optimal classiﬁer in the long run under the hard-margin

setting in classiﬁcation problems. In the end, we use extensive experiments to eval-

uate the proposed algorithm and SOTA baselines, to show the improved empirical

performance.

1 Introduction

The Neural Network (NN) is one of the indispensable paradigms in machine learning and is widely

used in multifarious supervised-learning tasks [

]. As more and more complicated NNs are devel-

oped, the requirement of the training procedure on the labeled data grows, incurring signiﬁcant cost

of label annotation. Active learning investigates effective techniques on a much smaller labeled data

set while attaining the comparable generalization performance to passive learning [

]. In this paper,

we focus on the classiﬁcation problem in the streaming setting of active learning with NN models. At

every round, the learner receives an instance and is compelled to decide on-the-ﬂy whether or not to

observe the label associated with this instance. This problem seeks to maximize the generalization

capability of learned NNs in a sequence of rounds, such that the model has robust performance on the

unseen data from the same distribution [40].

In active learning, given access to the i.i.d. generated instances from a distribution

, suppose

there exist a class of functions

that formulate the mapping from instances to theirs labels. In the

parametric setting, i.e.,

has ﬁnite VC-dimension [

], existing works [

] have shown that the

active learning algorithms can achieve the convergence rate of

O(1/√N)

to the best population loss

, where

is the number of label queries. In the non-parametric setting, recent works [

]

provide the similar convergence results while suffering from the curse of input dimensionality.

Unfortunately, most of NN-based approaches to active learning do not come with the performance

guarantee, despite having powerful empirical results.

The ﬁrst performance guarantee for neural active learning has been established in a recent work

by [

], and the analysis is for over-parameterized neural networks with the assistance of Neural

Tangent Kernel (NTK). We carefully investigate the limitations of [

], which turn into the main

∗Both authors contribute equally.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.00423v3 [cs.LG] 16 Jan 2023

motivations of our paper. First, [

] transforms the classiﬁcation problem into a multi-armed bandit

problem [

], to minimize a pseudo regret metric. Yet, on the grounds that they seek to minimize

the conditional population loss on a sequence of given data, it is dubious that the pseudo regret used

in [

] can explicitly measure the generalization capability of given algorithms (see Remark 2.1).

Second, the training process for NN models is not efﬁcient, as [

] uses vanilla gradient descent and

starts from randomly initialized parameters in every round. Third, although [

] removes the curse of

input dimensionality

, the performance guarantee strongly suffers from another introduced term,

the effective dimensionality

, which can be thought of as the non-linear dimensionalities of Hilbert

space spanned by NTK. In the worse case, the magnitude of

can be an unacceptably large number

and thus the performance guarantee collapses.

1.1 Main contributions

In this paper, we propose a novel algorithm,

I-NeurAL

(

mproved Algorithms for

Neur

ctive

Learning), to tackle the above limitations. Our contributions can be summarized as follows: (1) We

consider the

-class classiﬁcation problem, and we introduce two new regret metrics to minimize

the population loss, which can directly reﬂect the generalization capability of NN-based algorithms.

(2)

I-NeurAL

has a neural exploration strategy with a novel component to decide whether or not

to query the label, coming with the performance guarantee.

I-NeurAL

exploits the full feedback

in active learning which is a subtle but effective idea. (3)

I-NeurAL

is designed to support mini-

batch Stochastic Gradient Descent (SGD). In particular, at every round,

I-NeurAL

does mini-batch

SGD starting with the parameters of the last round, i.e., with warm start, which is more efﬁcient

and practical compared to [

]. (4) Without any noise assumption on the data distribution, we

provide an instance-dependent performance guarantee of

I-NeurAL

for over-parameterized neural

networks. Compared to [

], we remove the curse of both the input dimensionality

and the effective

dimensionality

; Moreover, we roughly improve the regret by a multiplicative factor

log(T)

, where

is the number of rounds. (5) under a hard-margin assumption on the data distribution, we provide

that NN models can achieve the same generalization capability as Bayes-optimal classiﬁer after

O(log T)

number of label queries; (6) we conduct extensive experiments on real-world data sets to

demonstrate the improved performance of

I-NeurAL

over state-of-the-art baselines including the

closest work [48] which has not provided empirical validation of their proposed algorithms.

1.2 Related Work

Active learning has been extensively studied and applied to many essential applications [

]. Bayesian

active learning methods typically use a probabilistic regression model to estimate the improvement of

each query [

]. In spite of effectiveness on the small or moderate data sets, the Bayesian-based

approaches are difﬁcult to scale to large-scale data sets because of the batch sampling [

]. Another

important class, margin algorithms or uncertainty sampling [

], obtains considerate performance

improvement over passive learning and is further developed by many practitioners [

Margin algorithms are ﬂexible and can be adapted to both streaming and pool settings. In the

pool setting, a line of works utilize the neural networks in active learning to improve the empirical

performance [

]. However, they do not provide performance guarantee

for NN-based active learning algorithms. From the theoretical perspective, [

] provide

the performance guarantee with the speciﬁc classes of functions and [26, 22] present the theoretical

analysis of active learning algorithms with the surrogate loss functions for binary classiﬁcation.

However, their performance guarantee is restricted within hypothesis classes, i.e, the parametric

setting. In contrast, our goal is to derive an NN-based algorithm in the non-parametric setting that

performs well both empirically and theoretically. Neural contextual bandits[

]

provide the principled method to balance between the exploitation and exploration [

]. [

]

transforms active learning into neural contextual bandit problem and obtains a performance guarantee,

of which limitations are discussed above.

As [

] is the closest related work to our paper, we emphasize the differences of our techniques from

[

] throughout the paper. We introduce the problem deﬁnition and proposed algorithms in Section

2 and Section 3 respectively. Then, we provide performance guarantees in Section 4 and empirical

results in Section 5, ending with the conclusion in Section 6.

2 Problem Deﬁnition

In this paper, we study the streaming setting of active learning in the

-class classiﬁcation problem.

Let

denote the input space over

Y={1,2, . . . , k}

represent the label space, and

be some

unknown distribution over

X×Y

. At round

t∈[T] = {1,2, . . . , T }

, an instance

is drawn from the

marginal distribution

and accordingly

is drawn from the conditional distribution

DY|xt

. Here,

can be thought of as the index of the class that

belongs to. Inspired by [

], we ﬁrst transform

into

context vectors representing the

classes respectively:

xt,1= (x>

t,0>,...,0>)>,xt,2=

(0>,x>

t,...,0>)>,...,xt,k = (0>,0>,...,x>

t)>

and

xt,i ∈Rdk,∀i∈[k]

. In accordance with

context vectors, we construct the

label vectors representing the

possible prediction:

yt,1=

(1,0,...,0)>,yt,2= (0,1,...,0)>,...,yt,k = (0,0,...,1)>

and

yt,i ∈Rk,∀i∈[k]

. Thus,

yt,yt

is the ground-truth label vector for xt.

Under the non-parametric setting of active learning, we deﬁne an unknown function

to formulate

the conditional distribution DY|xt:Xk→[0,1], such that

∀i∈[k],P(yt,yt=yt,i|xt) = h(xt,i),(2.1)

which is subject to

i=1 h(xt,i)=1

. For simplicity, we consider the

-class classiﬁcation problem

with 0-1 loss. Given

, i.e.,

xt,i, i ∈[k]

, let

be the index of the class predicted by some hypothesis

fand thus yt,

iis the prediction. Then, we have the following loss:

L(yt,

i,yt,yt) = {yt,

i6=yt,yt} ∈ {0,1}.(2.2)

where is the indicator function.

Given the number of rounds

, at each round

t∈[T]

, the learner receives an instance

drawn i.i.d.

from DX. Then, the learner needs to make a prediction yt,

i, and at the same time, decide on-the-ﬂy

whether or not to query the label

yt,yt

where

is drawn i.i.d. from

DY|xt

. As the goal of active

learning tasks is often to minimize the population loss [

], we introduce the following two regret

metrics.

Deﬁnition 2.1

(Latest Population Regret)

Given the data distribution

, the number of rounds

the Latest Population Regret is deﬁned as

RT=E

xT∼DXE

yT∼DY|xT

[L(yT,

i,yT,yT)|xT]−E

xT∼DXE

yT∼DY|xT

[L(yT,i∗,yT,yT)|xT]

(2.3)

where

yT,i∗

is the prediction the Bayes-optimal classiﬁer would make on instance

, i.e.,

i∗=

arg maxi∈[k]h(xT,i)for yT,i∗.

Deﬁnition 2.2

(Cumulative Population Regret)

Given the data distribution

, the number of rounds

T, the Cumulative Population Regret is deﬁned as:

RT=

t=1 E

xt∼DXE

yt∼DY|xt

[L(yt,

i,yt,yt)|xt]−E

xt∼DXE

yt∼DY|xt

[L(yt,i∗,yt,yt)|xt]

(2.4)

where

yt,i∗

is the prediction the Bayes-optimal classiﬁer would make on instance

, i.e.,

i∗=

arg maxi∈[k]h(xt,i)for yt,i∗.

measures the performance at the last round

only, and

measures the overall performance in

rounds combined. Therefore, the goal of this problem is to minimize

, or both. At the

same time, we also aim to minimize the following expected query cost:

NT=

t=1

xt∼DX

[It|xt],(2.5)

where

is the indicator of the query decision in round

such that

It= 1

is observed;

It= 0

otherwise.

Remark 2.1.

Minimizing

shows the generalization capability of the learned hypothesis

on the distribution

. However, the problem deﬁned in [

] is to minimize the cumulative conditional

population regret as follows:

RT=

t=1 E

yt∼DY|xt

[L(yt,

i,yt,yt)|xt]−E

yt∼DY|xt

[L(yt,i∗,yt,yt)|xt].(2.6)

Eyt∼DY|xt[L(yt,

i,yt,yt)|xt]

is the population loss conditioned on

, unfortunately,

only

measures the performance of the learned hypothesis on the collected data

{xt}T

t=1

, and

cannot

directly measure the accuracy of the hypothesis on unseen data instances. Although

follows the

regret deﬁnition in multi-armed bandits [

], it is fair to say that

may not be a good metric in

active learning.

3 Proposed Algorithms

In this section, we elaborate on the proposed algorithm

I-NeurAL

(Algorithm 1). In contrast to the

directly comparable work [

I-NeurAL

has the following novel and advantageous aspects: (1)

I-NeurAL

incorporates a neural-based exploration strategy (Line 6) inspired by recent advances in

bandits [

] to solve the exploitation-exploration dilemma in the decision for whether or not to query

labels; (2)

I-NeurAL

includes a novel component (Line 11) to decide whether or not to query labels

in the

-class classiﬁcation problem; (3)

I-NeurAL

infers and exploits the feedback of all the contexts

(Lines 12-17), instead of only utilizing the feedback of the chosen context in [

]; (4)

I-NeurAL

conducts mini-batch SGD based on the parameters of the last round (Algorithm 2), which is more

practical, as opposed to conducting vanilla gradient descent from the initialization at every round in

[48]. Next, we will present the details of I-NeurAL.

Exploitation Network

.Given

xt,i, i ∈[k]

, to learn the unknown function

(Eq. (2.1)), we use a

fully-connected neural network f1with L-depth and m-width:

f1(xt,i;θ1) = W1

Lσ(W1

L−1σ(W1

L−2. . . σ(W1

1xt,i))),(3.1)

where

1∈Rm×kd,W1

l∈Rm×m

, for

2≤l≤L−1

L∈R1×m

θ1=

[vec(W1

1)>,...,vec(W1

L)>]>∈Rp1

, and

is the ReLU activation function

σ(x) = max{0,x}

In round

, given

xt,i, i ∈[k]

f1(xt,i;θ1

t−1)

is assigned to learn

h(xt,i)

. Based on the fact

h(xt,i) = E

yt∼DY|xt

[1 −L(yt,i,yt,yt)]

, it is natural to regard

1−L(yt,i,yt,yt)

as the label for

training

. Note that we take the basic fully-connected network as an example for the sake of

analysis in over-parameterized networks and

can be easily replaced with more complicated models

depending on the tasks.

Exploration Network

.In addition to the network

, we assign another network

to explore

uncertain information contained in incoming instances. First, we carefully design the input of

incorporate the context vectors of the instance and the discrimination-ability of

, to learn the error

between the Bayes-optimal probability h(xt,i)and the prediction f1(xt,i;θ1).

Deﬁnition 3.1

(Derivative-Context (DC) Embedding)

Given the exploitation network

f1(·;θ1

t−1)

and an input context xt,i, its DC embeding is deﬁned as

φ(xt,i) = vec Oxt,i f1(xt,i;θ1

t−1)>

√2kOxt,i f1(xt,i;θ1

t−1)k2

,xt,i>

√2!∈R2dk,(3.2)

where Oxt,i f1is the partial derivative of f1(xt,i;θ1

t−1)with respect to xt,i.

φ(xt,i)

is normalized so that

kφ(xt,i)k2= 1

. Note that the input for

in [

] is the gradient

with respect to

θ1

, denoted by

Oθ1f1(xt,i;θ1

t−1)∈Rp1

. Its dimensionality is much larger than

Oxt,i f1(xt,i;θ1

t−1)in Deﬁnition 3.1, may causing signiﬁcant computation cost.

Given the input φ(xt,i), similarly, we choose the fully-connected network to build f2:

f2(φ(xt,i); θ2) = W2

Lσ(W2

L−1σ(W2

L−2. . . σ(W2

1φ(xt,i)))),(3.3)

where

1∈Rm×2kd,W2

l∈Rm×m

, for

2≤l≤L−1

L=R1×m

and

θ2=

[vec(W2

1)>,...,vec(W2

L)>]>∈Rp2

. In round

, given

xt,i,∀i∈[k]

is to predict

h(xt,i)−

Algorithm 1 I-NeurAL

Input: T

(number of rounds)

f1, f2

(neural networks),

η1, η2

(learning rate),

(exploration parame-

ter), b(batch size), δ(conﬁdence level)

1: Initialize θ1

0,θ2

0;b

θ1

0=θ1

0;b

θ2

0=θ2

2: H1

0=∅;H2

0=∅

3: for t= 1,2, . . . , T do

4: Observe instance xt∈Rdand build xt,i,∀i∈[k]

5: for each i∈[k]do

6: f(xt,i;θt−1) = f1(xt,i;θ1

t−1)

Exploitation Score

+f2(φ(xt,i); θ2

t−1)

Exploration Score 

7: end for

8: bi= arg maxi∈[k]f(xt,i;θt−1)

9: i◦= arg maxi∈([k]\{

i})f(xt,i;θt−1)

10: Predict yt,

11: It={|f(xt,

i;θt−1)−f(xt,i◦;θt−1)|<2γβt} ∈ {0,1}

;

βt=q2c1

t+c23L

√2t+

q2 log(c3T k)/δ)

12: if It= 1 then

13: Query xtand observe yt

14: for i∈[k]do

15: r1

t,i = 1 −L(yt,i,yt,yt)(deﬁned in E.q. (2.2))

16: r2

t,i =r1

t,i −f1(xt,i;θ1

t−1)

17: end for

18: else

19: for i∈[k]do

20: r1

t,i = 1 −L(yt,i,yt,

21: r2

t,i =r1

t,i −f1(xt,i;θ1

t−1)

22: end for

23: end if

24: H1

t=H1

t−1∪ {(xt,i, r1

t,i), i ∈[k]}

25: H2

t=H2

t−1∪ {(xt,i, r2

t,i), i ∈[k]}

26: θ1

t,θ2

t= Mini-Batch-SGD-Warm-Start ( f1,f2,H1

t,H2

t,b)

27: end for

28: Return (θ1,θ2)uniformly from ((θ1

0,θ2

0),...,θ1

T−1,θ2

T−1)

f1(xt,i;θ1

t−1)

for exploration. Because

h(xt,i)−f1(xt,i;θ1

t−1) = E

yt∼DY|xt

[1 −L(yt,i,yt,yt)−

f1(xt,i;θ1

t−1)], we regard 1−L(yt,i,yt,yt)−f1(xt,i;θ1

t−1)as the label for training f2.

To sum up, in round

, given

xt,i,∀i∈[k]

, the prediction

(

yt,

) is made based on the sum of

exploitation and exploration scores, i.e., f1(xt,i;θ1

t−1) + f2(φ(xt,i); θ2

t−1)(Lines 5-10).

Query Decision-maker (Line 11). A label query is made when

I-NeurAL

is not conﬁdent enough to

discriminate the Bayes-optimal class from other classes.

2γβt

(

βt

is also deﬁned in Lemma 7.3)

can be thought of as a conﬁdence interval for the distance between the optimal class and second

optimal class, where

is the hyper-parameter to tune the sensitivity of the decision-maker in practice.

Given any

γ≥1, δ ∈(0,1)

, with probability at least

1−δ

, based on our analysis (Lemma 7.5),

E(xt,yt)∼D[L(yt,

i,yt,yt)] = E(xt,yt)∼D[L(yt,i∗,yt,yt)]

when

It= 0

, i.e.,

I-NeurAL

suffers no

regret. Thus, we use yt,

ias the pseudo-label in this case and we have the following update rules.

Utilize Full Feedback (Lines 14-25). Different from the bandit setting where the learner can only

observe the reward of the selected context, we can infer the rewards of all contexts in active learning, as

we know the speciﬁc class of the current instance. Thus, for each

xt,i, i ∈[k]

t,i = 1−L(yt,i,yt,yt)

is regarded as the "reward" of

xt,i

, predicted by

, and

t,i =r1

t,i −f1(xt,i;θ1)

is regarded as the

"residual reward" of

xt,i

, predicted by

. In summary, in round

, when

It= 1

yt,yt

is observed to

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImprovedAlgorithmsforNeuralActiveLearningYikunBan,YuhengZhang,HanghangTong,ArindamBanerjee,andJingruiHeUniversityofIllinoisUrbana-Champaign{yikunb2,yuhengz2,htong,arindamb,jingrui}@illinois.eduAbstractWeimprovethetheoreticalandempiricalperformanceofneural-network(NN)-basedactivelearningalgorithmsf...

展开>> 收起<<

Improved Algorithms for Neural Active Learning Yikun Ban Yuheng Zhang Hanghang Tong Arindam Banerjee and Jingrui He University of Illinois Urbana-Champaign.pdf

共34页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Improved Algorithms for Neural Active Learning Yikun Ban Yuheng Zhang Hanghang Tong Arindam Banerjee and Jingrui He University of Illinois Urbana-Champaign

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: