motivations of our paper. First, [
48
] transforms the classification problem into a multi-armed bandit
problem [
55
], to minimize a pseudo regret metric. Yet, on the grounds that they seek to minimize
the conditional population loss on a sequence of given data, it is dubious that the pseudo regret used
in [
48
] can explicitly measure the generalization capability of given algorithms (see Remark 2.1).
Second, the training process for NN models is not efficient, as [
48
] uses vanilla gradient descent and
starts from randomly initialized parameters in every round. Third, although [
48
] removes the curse of
input dimensionality
d
, the performance guarantee strongly suffers from another introduced term,
the effective dimensionality
e
d
, which can be thought of as the non-linear dimensionalities of Hilbert
space spanned by NTK. In the worse case, the magnitude of
e
d
can be an unacceptably large number
and thus the performance guarantee collapses.
1.1 Main contributions
In this paper, we propose a novel algorithm,
I-NeurAL
(
I
mproved Algorithms for
Neur
al
A
ctive
Learning), to tackle the above limitations. Our contributions can be summarized as follows: (1) We
consider the
k
-class classification problem, and we introduce two new regret metrics to minimize
the population loss, which can directly reflect the generalization capability of NN-based algorithms.
(2)
I-NeurAL
has a neural exploration strategy with a novel component to decide whether or not
to query the label, coming with the performance guarantee.
I-NeurAL
exploits the full feedback
in active learning which is a subtle but effective idea. (3)
I-NeurAL
is designed to support mini-
batch Stochastic Gradient Descent (SGD). In particular, at every round,
I-NeurAL
does mini-batch
SGD starting with the parameters of the last round, i.e., with warm start, which is more efficient
and practical compared to [
48
]. (4) Without any noise assumption on the data distribution, we
provide an instance-dependent performance guarantee of
I-NeurAL
for over-parameterized neural
networks. Compared to [
48
], we remove the curse of both the input dimensionality
d
and the effective
dimensionality
e
d
; Moreover, we roughly improve the regret by a multiplicative factor
log(T)
, where
T
is the number of rounds. (5) under a hard-margin assumption on the data distribution, we provide
that NN models can achieve the same generalization capability as Bayes-optimal classifier after
O(log T)
number of label queries; (6) we conduct extensive experiments on real-world data sets to
demonstrate the improved performance of
I-NeurAL
over state-of-the-art baselines including the
closest work [48] which has not provided empirical validation of their proposed algorithms.
1.2 Related Work
Active learning has been extensively studied and applied to many essential applications [
44
]. Bayesian
active learning methods typically use a probabilistic regression model to estimate the improvement of
each query [
29
,
41
]. In spite of effectiveness on the small or moderate data sets, the Bayesian-based
approaches are difficult to scale to large-scale data sets because of the batch sampling [
43
]. Another
important class, margin algorithms or uncertainty sampling [
33
], obtains considerate performance
improvement over passive learning and is further developed by many practitioners [
20
,
28
,
37
,
15
].
Margin algorithms are flexible and can be adapted to both streaming and pool settings. In the
pool setting, a line of works utilize the neural networks in active learning to improve the empirical
performance [
36
,
42
,
5
,
17
,
30
,
45
,
47
,
54
,
4
]. However, they do not provide performance guarantee
for NN-based active learning algorithms. From the theoretical perspective, [
51
,
21
,
6
,
8
,
52
] provide
the performance guarantee with the specific classes of functions and [26, 22] present the theoretical
analysis of active learning algorithms with the surrogate loss functions for binary classification.
However, their performance guarantee is restricted within hypothesis classes, i.e, the parametric
setting. In contrast, our goal is to derive an NN-based algorithm in the non-parametric setting that
performs well both empirically and theoretically. Neural contextual bandits[
55
,
53
,
13
,
11
,
12
,
39
]
provide the principled method to balance between the exploitation and exploration [
9
,
10
]. [
48
]
transforms active learning into neural contextual bandit problem and obtains a performance guarantee,
of which limitations are discussed above.
As [
48
] is the closest related work to our paper, we emphasize the differences of our techniques from
[
48
] throughout the paper. We introduce the problem definition and proposed algorithms in Section
2 and Section 3 respectively. Then, we provide performance guarantees in Section 4 and empirical
results in Section 5, ending with the conclusion in Section 6.
2