two tasks, namely sentiment analysis and topic
classification. Baselines cover uncertainty, diver-
sity, and two SOTA hybrid active learning methods
(BADGE, ALPS).
Our main contributions are as follows:
•
We take Virtual Adversarial Perturbation to
measure model uncertainty in sentence under-
standing tasks for the first time. The local
smoothness is treated as model uncertainty,
which relies less on the poorly calibrated
model confidence scores.
•
We present VAPAL (Virtual Adversarial Per-
turbation for Active Learning) to combine un-
certainty and diversity in a combination frame-
work.
•
We show that VAPAL method is equally good
or better than the baselines in four tasks. Our
methods can successfully replace the gradient-
based representation of BADGE. Furthermore,
it does not rely on specific self-supervised loss,
unlike Masked Language Model Loss used in
ALPS.
2 Related Work
To reduce the cost of labeling, active learning seeks
to select the most informative data points from the
unlabeled data to require humans to obtain labels.
We then train the learner model on the new labeled
data and repeats. The prior active learning sam-
pling methods primarily focus on uncertainty or di-
versity. The uncertainty sampling methods are the
most popular and widely used strategies which se-
lect the difficult examples to label (Lewis and Gale,
1994;Joshi et al.,2009;Houlsby et al.,2011). Di-
versity sampling selects a subset of data points from
the pool to effectively represent the whole pool dis-
tribution (Geifman and El-Yaniv,2017;Sener and
Savarese,2017;Gissin and Shalev-Shwartz,2019).
A successful active learning method requires the
incorporation of both aspects, but the exact imple-
mentation is still open for discussion.
Recently, hybrid approaches that combine uncer-
tainty and diversity sampling have also been pro-
posed. Naive combination frameworks are shown
to be harmful to the test accuracy and rely on hy-
perparameters (Hsu and Lin,2015). Aiming for
sophisticated combination frameworks, Ash et al.
(2019) propose Batch Active Learning By Diverse
Gradient Embeddings (BADGE), and Yuan et al.
(2020) propose Active Learning by Processing Sur-
prisal (ALPS). They follow the same framework
that first builds uncertainty representations for un-
labeled data and clustering for diversity. BADGE
measures data uncertainty as the gradient magni-
tude with respect to parameters in the final (out-
put) layer and forms a gradient embedding based
data representation. However, according to Yuan
et al. (2020), BADGE has two main issues: reliance
on warm-starting and computational inefficiency.
ALPS builds data embeddings from self-supervised
loss (Masked Language Model loss) (Yuan et al.,
2020). Nevertheless, the MLM loss is an indirect
proxy for model uncertainty in the downstream
classification tasks, and ALPS might work only
with a pre-trained language model using MLM.
What else can be used as a model uncertainty
representation and can be efficiently combined to
achieve diversity sampling? Virtual adversarial per-
turbation from virtual adversarial training (Miyato
et al.,2019) is a promising option. Deep learning
methods often face possible over-fitting in model
generalization, especially when the training set is
relatively small. In adversarial training, adversar-
ial attacks are utilized to approximate the small-
est perturbation for a given latent state to cross
the decision boundary (Goodfellow et al.,2014;
Kurakin et al.,2016). It has proven to be an im-
portant proxy for assessing the robustness of the
model. Moreover, the labeled data is scarce. Vir-
tual adversarial training (VAT) does not require true
label information, thus fully using the unlabeled
data. VAT can be seen as a regularization method
based on Local Distributional Smoothness (LDS)
loss. LDS is defined as a negative measure of the
smoothness of the model distribution facing local
perturbation around input data points in the sense
of KL-divergence (Miyato et al.,2019). The vir-
tual adversarial perturbation can be crafted without
the label information, which can help alleviate the
warm-starting issue BADGE faces. Yu and Pao
(2020) roughly rank grouped examples of model
prediction by LDS score. Our method is inspired
by the same research line (Miyato et al.,2019) but
different from it in many folds. Our method aims
to project data into a model smoothness representa-
tion space rather than a rough scalar score, so it is
more effective. We introduce the virtual adversarial
perturbation as sentence representations by which
model uncertainty is inherently expressed. Further-
more, we consider both uncertainty and diversity