Improved Algorithms for Neural Active Learning Yikun Ban Yuheng Zhang Hanghang Tong Arindam Banerjee and Jingrui He University of Illinois Urbana-Champaign

2025-05-08 0 0 1.19MB 34 页 10玖币
侵权投诉
Improved Algorithms for Neural Active Learning
Yikun Ban, Yuheng Zhang, Hanghang Tong, Arindam Banerjee, and Jingrui He
University of Illinois Urbana-Champaign
{yikunb2, yuhengz2, htong, arindamb, jingrui}@illinois.edu
Abstract
We improve the theoretical and empirical performance of neural-network(NN)-
based active learning algorithms for the non-parametric streaming setting. In
particular, we introduce two regret metrics by minimizing the population loss that
are more suitable in active learning than the one used in state-of-the-art (SOTA)
related work. Then, the proposed algorithm leverages the powerful representation
of NNs for both exploitation and exploration, has the query decision-maker tailored
for
k
-class classification problems with the performance guarantee, utilizes the
full feedback, and updates parameters in a more practical and efficient manner.
These careful designs lead to an instance-dependent regret upper bound, roughly
improving by a multiplicative factor
O(log T)
and removing the curse of input
dimensionality. Furthermore, we show that the algorithm can achieve the same
performance as the Bayes-optimal classifier in the long run under the hard-margin
setting in classification problems. In the end, we use extensive experiments to eval-
uate the proposed algorithm and SOTA baselines, to show the improved empirical
performance.
1 Introduction
The Neural Network (NN) is one of the indispensable paradigms in machine learning and is widely
used in multifarious supervised-learning tasks [
23
]. As more and more complicated NNs are devel-
oped, the requirement of the training procedure on the labeled data grows, incurring significant cost
of label annotation. Active learning investigates effective techniques on a much smaller labeled data
set while attaining the comparable generalization performance to passive learning [
19
]. In this paper,
we focus on the classification problem in the streaming setting of active learning with NN models. At
every round, the learner receives an instance and is compelled to decide on-the-fly whether or not to
observe the label associated with this instance. This problem seeks to maximize the generalization
capability of learned NNs in a sequence of rounds, such that the model has robust performance on the
unseen data from the same distribution [40].
In active learning, given access to the i.i.d. generated instances from a distribution
D
, suppose
there exist a class of functions
F
that formulate the mapping from instances to theirs labels. In the
parametric setting, i.e.,
F
has finite VC-dimension [
25
], existing works [
24
,
14
,
7
] have shown that the
active learning algorithms can achieve the convergence rate of
e
O(1/N)
to the best population loss
in
F
, where
N
is the number of label queries. In the non-parametric setting, recent works [
34
,
35
]
provide the similar convergence results while suffering from the curse of input dimensionality.
Unfortunately, most of NN-based approaches to active learning do not come with the performance
guarantee, despite having powerful empirical results.
The first performance guarantee for neural active learning has been established in a recent work
by [
48
], and the analysis is for over-parameterized neural networks with the assistance of Neural
Tangent Kernel (NTK). We carefully investigate the limitations of [
48
], which turn into the main
Both authors contribute equally.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.00423v3 [cs.LG] 16 Jan 2023
motivations of our paper. First, [
48
] transforms the classification problem into a multi-armed bandit
problem [
55
], to minimize a pseudo regret metric. Yet, on the grounds that they seek to minimize
the conditional population loss on a sequence of given data, it is dubious that the pseudo regret used
in [
48
] can explicitly measure the generalization capability of given algorithms (see Remark 2.1).
Second, the training process for NN models is not efficient, as [
48
] uses vanilla gradient descent and
starts from randomly initialized parameters in every round. Third, although [
48
] removes the curse of
input dimensionality
d
, the performance guarantee strongly suffers from another introduced term,
the effective dimensionality
e
d
, which can be thought of as the non-linear dimensionalities of Hilbert
space spanned by NTK. In the worse case, the magnitude of
e
d
can be an unacceptably large number
and thus the performance guarantee collapses.
1.1 Main contributions
In this paper, we propose a novel algorithm,
I-NeurAL
(
I
mproved Algorithms for
Neur
al
A
ctive
Learning), to tackle the above limitations. Our contributions can be summarized as follows: (1) We
consider the
k
-class classification problem, and we introduce two new regret metrics to minimize
the population loss, which can directly reflect the generalization capability of NN-based algorithms.
(2)
I-NeurAL
has a neural exploration strategy with a novel component to decide whether or not
to query the label, coming with the performance guarantee.
I-NeurAL
exploits the full feedback
in active learning which is a subtle but effective idea. (3)
I-NeurAL
is designed to support mini-
batch Stochastic Gradient Descent (SGD). In particular, at every round,
I-NeurAL
does mini-batch
SGD starting with the parameters of the last round, i.e., with warm start, which is more efficient
and practical compared to [
48
]. (4) Without any noise assumption on the data distribution, we
provide an instance-dependent performance guarantee of
I-NeurAL
for over-parameterized neural
networks. Compared to [
48
], we remove the curse of both the input dimensionality
d
and the effective
dimensionality
e
d
; Moreover, we roughly improve the regret by a multiplicative factor
log(T)
, where
T
is the number of rounds. (5) under a hard-margin assumption on the data distribution, we provide
that NN models can achieve the same generalization capability as Bayes-optimal classifier after
O(log T)
number of label queries; (6) we conduct extensive experiments on real-world data sets to
demonstrate the improved performance of
I-NeurAL
over state-of-the-art baselines including the
closest work [48] which has not provided empirical validation of their proposed algorithms.
1.2 Related Work
Active learning has been extensively studied and applied to many essential applications [
44
]. Bayesian
active learning methods typically use a probabilistic regression model to estimate the improvement of
each query [
29
,
41
]. In spite of effectiveness on the small or moderate data sets, the Bayesian-based
approaches are difficult to scale to large-scale data sets because of the batch sampling [
43
]. Another
important class, margin algorithms or uncertainty sampling [
33
], obtains considerate performance
improvement over passive learning and is further developed by many practitioners [
20
,
28
,
37
,
15
].
Margin algorithms are flexible and can be adapted to both streaming and pool settings. In the
pool setting, a line of works utilize the neural networks in active learning to improve the empirical
performance [
36
,
42
,
5
,
17
,
30
,
45
,
47
,
54
,
4
]. However, they do not provide performance guarantee
for NN-based active learning algorithms. From the theoretical perspective, [
51
,
21
,
6
,
8
,
52
] provide
the performance guarantee with the specific classes of functions and [26, 22] present the theoretical
analysis of active learning algorithms with the surrogate loss functions for binary classification.
However, their performance guarantee is restricted within hypothesis classes, i.e, the parametric
setting. In contrast, our goal is to derive an NN-based algorithm in the non-parametric setting that
performs well both empirically and theoretically. Neural contextual bandits[
55
,
53
,
13
,
11
,
12
,
39
]
provide the principled method to balance between the exploitation and exploration [
9
,
10
]. [
48
]
transforms active learning into neural contextual bandit problem and obtains a performance guarantee,
of which limitations are discussed above.
As [
48
] is the closest related work to our paper, we emphasize the differences of our techniques from
[
48
] throughout the paper. We introduce the problem definition and proposed algorithms in Section
2 and Section 3 respectively. Then, we provide performance guarantees in Section 4 and empirical
results in Section 5, ending with the conclusion in Section 6.
2
2 Problem Definition
In this paper, we study the streaming setting of active learning in the
k
-class classification problem.
Let
X
denote the input space over
Rd
,
Y={1,2, . . . , k}
represent the label space, and
D
be some
unknown distribution over
X×Y
. At round
t[T] = {1,2, . . . , T }
, an instance
xt
is drawn from the
marginal distribution
DX
and accordingly
yt
is drawn from the conditional distribution
DY|xt
. Here,
yt
can be thought of as the index of the class that
xt
belongs to. Inspired by [
48
], we first transform
xt
into
k
context vectors representing the
k
classes respectively:
xt,1= (x>
t,0>,...,0>)>,xt,2=
(0>,x>
t,...,0>)>,...,xt,k = (0>,0>,...,x>
t)>
and
xt,i Rdk,i[k]
. In accordance with
context vectors, we construct the
k
label vectors representing the
k
possible prediction:
yt,1=
(1,0,...,0)>,yt,2= (0,1,...,0)>,...,yt,k = (0,0,...,1)>
and
yt,i Rk,i[k]
. Thus,
yt,yt
is the ground-truth label vector for xt.
Under the non-parametric setting of active learning, we define an unknown function
h
to formulate
the conditional distribution DY|xt:Xk[0,1], such that
i[k],P(yt,yt=yt,i|xt) = h(xt,i),(2.1)
which is subject to
Pk
i=1 h(xt,i)=1
. For simplicity, we consider the
k
-class classification problem
with 0-1 loss. Given
xt
, i.e.,
xt,i, i [k]
, let
bi
be the index of the class predicted by some hypothesis
fand thus yt,
b
iis the prediction. Then, we have the following loss:
L(yt,
b
i,yt,yt) = {yt,
b
i6=yt,yt} ∈ {0,1}.(2.2)
where is the indicator function.
Given the number of rounds
T
, at each round
t[T]
, the learner receives an instance
xt
drawn i.i.d.
from DX. Then, the learner needs to make a prediction yt,
b
i, and at the same time, decide on-the-fly
whether or not to query the label
yt,yt
where
yt
is drawn i.i.d. from
DY|xt
. As the goal of active
learning tasks is often to minimize the population loss [
40
], we introduce the following two regret
metrics.
Definition 2.1
(Latest Population Regret)
.
Given the data distribution
D
, the number of rounds
T
,
the Latest Population Regret is defined as
RT=E
xTDXE
yT∼DY|xT
[L(yT,
b
i,yT,yT)|xT]E
xTDXE
yT∼DY|xT
[L(yT,i,yT,yT)|xT]
(2.3)
where
yT,i
is the prediction the Bayes-optimal classifier would make on instance
xT
, i.e.,
i=
arg maxi[k]h(xT,i)for yT,i.
Definition 2.2
(Cumulative Population Regret)
.
Given the data distribution
D
, the number of rounds
T, the Cumulative Population Regret is defined as:
RT=
T
X
t=1 E
xtDXE
yt∼DY|xt
[L(yt,
b
i,yt,yt)|xt]E
xtDXE
yt∼DY|xt
[L(yt,i,yt,yt)|xt]
(2.4)
where
yt,i
is the prediction the Bayes-optimal classifier would make on instance
xt
, i.e.,
i=
arg maxi[k]h(xt,i)for yt,i.
RT
measures the performance at the last round
T
only, and
RT
measures the overall performance in
T
rounds combined. Therefore, the goal of this problem is to minimize
RT
or
RT
, or both. At the
same time, we also aim to minimize the following expected query cost:
NT=
T
X
t=1
E
xt∼DX
[It|xt],(2.5)
where
It
is the indicator of the query decision in round
t
such that
It= 1
if
yt
is observed;
It= 0
,
otherwise.
Remark 2.1.
Minimizing
RT
or
RT
shows the generalization capability of the learned hypothesis
on the distribution
D
. However, the problem defined in [
48
] is to minimize the cumulative conditional
3
population regret as follows:
e
RT=
T
X
t=1 E
yt∼DY|xt
[L(yt,
b
i,yt,yt)|xt]E
yt∼DY|xt
[L(yt,i,yt,yt)|xt].(2.6)
As
Eyt∼DY|xt[L(yt,
b
i,yt,yt)|xt]
is the population loss conditioned on
xt
, unfortunately,
e
RT
only
measures the performance of the learned hypothesis on the collected data
{xt}T
t=1
, and
e
RT
cannot
directly measure the accuracy of the hypothesis on unseen data instances. Although
e
RT
follows the
regret definition in multi-armed bandits [
55
], it is fair to say that
e
RT
may not be a good metric in
active learning.
3 Proposed Algorithms
In this section, we elaborate on the proposed algorithm
I-NeurAL
(Algorithm 1). In contrast to the
directly comparable work [
48
],
I-NeurAL
has the following novel and advantageous aspects: (1)
I-NeurAL
incorporates a neural-based exploration strategy (Line 6) inspired by recent advances in
bandits [
13
] to solve the exploitation-exploration dilemma in the decision for whether or not to query
labels; (2)
I-NeurAL
includes a novel component (Line 11) to decide whether or not to query labels
in the
k
-class classification problem; (3)
I-NeurAL
infers and exploits the feedback of all the contexts
(Lines 12-17), instead of only utilizing the feedback of the chosen context in [
48
]; (4)
I-NeurAL
conducts mini-batch SGD based on the parameters of the last round (Algorithm 2), which is more
practical, as opposed to conducting vanilla gradient descent from the initialization at every round in
[48]. Next, we will present the details of I-NeurAL.
Exploitation Network
f1
.Given
xt,i, i [k]
, to learn the unknown function
h
(Eq. (2.1)), we use a
fully-connected neural network f1with L-depth and m-width:
f1(xt,i;θ1) = W1
Lσ(W1
L1σ(W1
L2. . . σ(W1
1xt,i))),(3.1)
where
W1
1Rm×kd,W1
lRm×m
, for
2lL1
,
W1
LR1×m
,
θ1=
[vec(W1
1)>,...,vec(W1
L)>]>Rp1
, and
σ
is the ReLU activation function
σ(x) = max{0,x}
.
In round
t
, given
xt,i, i [k]
,
f1(xt,i;θ1
t1)
is assigned to learn
h(xt,i)
. Based on the fact
h(xt,i) = E
yt∼DY|xt
[1 L(yt,i,yt,yt)]
, it is natural to regard
1L(yt,i,yt,yt)
as the label for
training
f1
. Note that we take the basic fully-connected network as an example for the sake of
analysis in over-parameterized networks and
f1
can be easily replaced with more complicated models
depending on the tasks.
Exploration Network
f2
.In addition to the network
f1
, we assign another network
f2
to explore
uncertain information contained in incoming instances. First, we carefully design the input of
f2
to
incorporate the context vectors of the instance and the discrimination-ability of
f1
, to learn the error
between the Bayes-optimal probability h(xt,i)and the prediction f1(xt,i;θ1).
Definition 3.1
(Derivative-Context (DC) Embedding)
.
Given the exploitation network
f1(·;θ1
t1)
and an input context xt,i, its DC embeding is defined as
φ(xt,i) = vec Oxt,i f1(xt,i;θ1
t1)>
2kOxt,i f1(xt,i;θ1
t1)k2
,xt,i>
2!R2dk,(3.2)
where Oxt,i f1is the partial derivative of f1(xt,i;θ1
t1)with respect to xt,i.
φ(xt,i)
is normalized so that
kφ(xt,i)k2= 1
. Note that the input for
f2
in [
13
] is the gradient
with respect to
θ1
, denoted by
Oθ1f1(xt,i;θ1
t1)Rp1
. Its dimensionality is much larger than
Oxt,i f1(xt,i;θ1
t1)in Definition 3.1, may causing significant computation cost.
Given the input φ(xt,i), similarly, we choose the fully-connected network to build f2:
f2(φ(xt,i); θ2) = W2
Lσ(W2
L1σ(W2
L2. . . σ(W2
1φ(xt,i)))),(3.3)
where
W2
1Rm×2kd,W2
lRm×m
, for
2lL1
,
W2
L=R1×m
and
θ2=
[vec(W2
1)>,...,vec(W2
L)>]>Rp2
. In round
t
, given
xt,i,i[k]
,
f2
is to predict
h(xt,i)
4
Algorithm 1 I-NeurAL
Input: T
(number of rounds)
f1, f2
(neural networks),
η1, η2
(learning rate),
γ
(exploration parame-
ter), b(batch size), δ(confidence level)
1: Initialize θ1
0,θ2
0;b
θ1
0=θ1
0;b
θ2
0=θ2
0
2: H1
0=;H2
0=
3: for t= 1,2, . . . , T do
4: Observe instance xtRdand build xt,i,i[k]
5: for each i[k]do
6: f(xt,i;θt1) = f1(xt,i;θ1
t1)
Exploitation Score
+f2(φ(xt,i); θ2
t1)
Exploration Score
7: end for
8: bi= arg maxi[k]f(xt,i;θt1)
9: i= arg maxi([k]\{
b
i})f(xt,i;θt1)
10: Predict yt,
b
i
11: It={|f(xt,
b
i;θt1)f(xt,i;θt1)|<2γβt} ∈ {0,1}
;
βt=q2c1
t+c23L
2t+
q2 log(c3T k))
t
12: if It= 1 then
13: Query xtand observe yt
14: for i[k]do
15: r1
t,i = 1 L(yt,i,yt,yt)(defined in E.q. (2.2))
16: r2
t,i =r1
t,i f1(xt,i;θ1
t1)
17: end for
18: else
19: for i[k]do
20: r1
t,i = 1 L(yt,i,yt,
b
i)
21: r2
t,i =r1
t,i f1(xt,i;θ1
t1)
22: end for
23: end if
24: H1
t=H1
t1∪ {(xt,i, r1
t,i), i [k]}
25: H2
t=H2
t1∪ {(xt,i, r2
t,i), i [k]}
26: θ1
t,θ2
t= Mini-Batch-SGD-Warm-Start ( f1,f2,H1
t,H2
t,b)
27: end for
28: Return (θ1,θ2)uniformly from ((θ1
0,θ2
0),...,θ1
T1,θ2
T1)
f1(xt,i;θ1
t1)
for exploration. Because
h(xt,i)f1(xt,i;θ1
t1) = E
yt∼DY|xt
[1 L(yt,i,yt,yt)
f1(xt,i;θ1
t1)], we regard 1L(yt,i,yt,yt)f1(xt,i;θ1
t1)as the label for training f2.
To sum up, in round
t
, given
xt,i,i[k]
, the prediction
bi
(
yt,
b
i
) is made based on the sum of
exploitation and exploration scores, i.e., f1(xt,i;θ1
t1) + f2(φ(xt,i); θ2
t1)(Lines 5-10).
Query Decision-maker (Line 11). A label query is made when
I-NeurAL
is not confident enough to
discriminate the Bayes-optimal class from other classes.
2γβt
(
βt
is also defined in Lemma 7.3)
can be thought of as a confidence interval for the distance between the optimal class and second
optimal class, where
γ
is the hyper-parameter to tune the sensitivity of the decision-maker in practice.
Given any
γ1, δ (0,1)
, with probability at least
1δ
, based on our analysis (Lemma 7.5),
E(xt,yt)∼D[L(yt,
b
i,yt,yt)] = E(xt,yt)∼D[L(yt,i,yt,yt)]
when
It= 0
, i.e.,
I-NeurAL
suffers no
regret. Thus, we use yt,
b
ias the pseudo-label in this case and we have the following update rules.
Utilize Full Feedback (Lines 14-25). Different from the bandit setting where the learner can only
observe the reward of the selected context, we can infer the rewards of all contexts in active learning, as
we know the specific class of the current instance. Thus, for each
xt,i, i [k]
,
r1
t,i = 1L(yt,i,yt,yt)
is regarded as the "reward" of
xt,i
, predicted by
f1
, and
r2
t,i =r1
t,i f1(xt,i;θ1)
is regarded as the
"residual reward" of
xt,i
, predicted by
f2
. In summary, in round
t
, when
It= 1
,
yt,yt
is observed to
5
摘要:

ImprovedAlgorithmsforNeuralActiveLearningYikunBan,YuhengZhang,HanghangTong,ArindamBanerjee,andJingruiHeUniversityofIllinoisUrbana-Champaign{yikunb2,yuhengz2,htong,arindamb,jingrui}@illinois.eduAbstractWeimprovethetheoreticalandempiricalperformanceofneural-network(NN)-basedactivelearningalgorithmsf...

展开>> 收起<<
Improved Algorithms for Neural Active Learning Yikun Ban Yuheng Zhang Hanghang Tong Arindam Banerjee and Jingrui He University of Illinois Urbana-Champaign.pdf

共34页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:34 页 大小:1.19MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 34
客服
关注