TiDAL Learning Training Dynamics for Active Learning Seong Min Kye1Kwanghee Choi2Hyeongmin Byun1Buru Chang3 1Hyperconnect2Carnegie Mellon University3Sogang University

2025-04-24 0 0 1.94MB 20 页 10玖币
侵权投诉
TiDAL: Learning Training Dynamics for Active Learning
Seong Min Kye1,Kwanghee Choi2,Hyeongmin Byun1Buru Chang3,
1Hyperconnect 2Carnegie Mellon University 3Sogang University
{harris,hyeongmin.byun}@hpcnt.com kwanghec@andrew.cmu.edu buru@sogang.ac.kr
Abstract
Active learning (AL) aims to select the most useful data
samples from an unlabeled data pool and annotate them to
expand the labeled dataset under a limited budget. Espe-
cially, uncertainty-based methods choose the most uncer-
tain samples, which are known to be effective in improving
model performance. However, previous methods often over-
look training dynamics (TD), defined as the ever-changing
model behavior during optimization via stochastic gradi-
ent descent, even though other research areas have empir-
ically shown that TD provides important clues for measur-
ing the data uncertainty. In this paper, we first provide the-
oretical and empirical evidence to argue the usefulness of
utilizing the ever-changing model behavior rather than the
fully trained model snapshot. We then propose a novel AL
method, Training Dynamics for Active Learning (TiDAL),
which efficiently predicts the training dynamics of unla-
beled data to estimate their uncertainty. Experimental re-
sults show that our TiDAL achieves better or comparable
performance on both balanced and imbalanced benchmark
datasets compared to state-of-the-art AL methods, which es-
timate data uncertainty using only static information after
model training.
1. Introduction
“There is a tide in the affairs of men. Which taken at the
flood, leads on to fortune.” — William Shakespeare
Active learning (AL) [5,31] aims to solve the real-
world problem of selecting the most useful data samples
from large-scale unlabeled data pools and annotating them
to expand labeled data under a limited budget. Since the
current deep neural networks are data-hungry, AL has in-
creasingly gained attention in recent years. Existing AL
methods can be divided into two mainstream categories:
diversity- and uncertainty-based methods. Diversity-based
methods [42,14] focus on constructing a subset that fol-
Equal contribution.
Corresponding author.
This work was done while all authors were affiliated with Hyperconnect.
Training Samples
Easy Hard
Predict
Actual TD
Certain Uncertain
𝑝(𝑦|𝑥)
If tracked Approximated by
Target
Classifier
TD Pred.
Module
TD Pred.
Module
Predicted TD
Certain
Skipped
Uncertain
Chosen
ҧ𝑝(𝑦|𝑥)
Class
Unlabeled Samples
Certain Uncertain
Training Dynamics (TD)
Epoch
Easy Hard
𝑝(𝑦|𝑥)
Figure 1: Our proposed TiDAL. TD of training samples
xmay differ even if they converge to the same final pre-
dicted probability p(y|x)(Upper row). Hence, we are mo-
tivated to utilize the readily available rich information gen-
erated during training, i.e., leveraging TD. We estimate TD
of large-scale unlabeled data using a prediction module in-
stead of tracking the actual TD of all the unlabeled samples
to avoid the computational overhead (Lower row).
lows the target data distribution. Uncertainty-based meth-
ods [13,6,52] choose the most uncertain samples, which
are known to be effective in improving model performance.
Hence, the most critical question for the latter becomes,
“How can we quantify the data uncertainty?”
In this paper, we leverage training dynamics (TD) to
quantify data uncertainty. TD is defined as the ever-
changing model behavior on each data sample during op-
timization via stochastic gradient descent. Recent stud-
ies [9,29,48,47] have provided empirical evidence that
TD provides important clues for measuring the contribution
of each data sample to model performance improvement.
Inspired by these studies, we argue that the data uncertainty
arXiv:2210.06788v3 [cs.LG] 28 Sep 2023
of unlabeled data can be estimated with TD. However, most
uncertainty-based methods quantify data uncertainty based
on static information (e.g., loss [52] or predicted probabil-
ity [45]) from a fully-trained model “snapshot,” neglecting
the valuable information generated during training. We fur-
ther argue that TD is more effective in separating uncertain
and certain data than static information from a model snap-
shot captured after model training. In §3, we provide both
theoretical and empirical evidence to support our argument
that TD is a valuable tool for quantifying data uncertainty.
Despite its huge potential, TD is not yet actively ex-
plored in the domain of AL. This is because AL assumes
a massive unlabeled data pool. Previous studies track TD
only for the training data every epoch as it can be recorded
easily during model optimization. On the other hand, AL
targets a large number of unlabeled data, where tracking
the TD for each unlabeled sample requires an impractical
amount of computation (e.g., inference all the unlabeled
samples every training epoch).
Therefore, we propose TiDAL (Training Dynamics for
Active Learning), a novel AL method that efficiently quan-
tifies the uncertainty of unlabeled data by estimating their
TD. We avoid tracking the TD of large-scale unlabeled data
every epoch by predicting the TD of unlabeled samples with
a TD prediction module. The module is trained with the TD
of labeled data, which is readily available during model op-
timization. During the data selection phase, we predict the
TD of unlabeled data with the trained module to quantify
their uncertainties. We efficiently obtain TD using the mod-
ule, which avoids inferring all the unlabeled samples every
epoch. Experimental results demonstrate that our TiDAL
achieves better or comparable performance to existing AL
methods on both balanced and imbalanced datasets. Ad-
ditional analyses show that our prediction module success-
fully predicts TD, and the predicted TD is useful in estimat-
ing uncertainties of unlabeled data. Our proposed method
are illustrated in Figure 1.
Contributions of our study: (1) We bridge the concept
of training dynamics and active learning with the theoretical
and experimental evidence that training dynamics is effec-
tive in estimating data uncertainty. (2) We propose a new
method that efficiently predicts the training dynamics of un-
labeled data to estimate their uncertainty. (3) Our proposed
method achieves better or comparable performance on both
balanced and imbalanced benchmark datasets compared to
existing active learning methods. For reproducibility, we
release the source code1.
2. Preliminaries
To better understand our proposed method, we first sum-
marize key concepts, including uncertainty-based active
1https://github.com/hyperconnect/TiDAL
learning, quantification of uncertainty, and training dynam-
ics.
Uncertainty-based active learning. In this work, we fo-
cus on uncertainty-based AL for multi-class classification
problems. We define the predicted probabilities of the given
sample xfor Cclasses as:
p= [p(1|x), p(2|x),··· , p(C|x)]T[0,1]C,(1)
where we denote the true label of xas yand the classifier
as f.Dand Dudenote a labeled dataset and an unlabeled
data pool, respectively. The general cycle of uncertainty-
based AL is in two steps: (1) train the target classifier f
on the labeled dataset Dand (2) select top-kuncertain data
samples from the unlabeled data pool Du. Selected samples
are then given to the human annotators to expand the labeled
dataset D, cycling back to the first step.
Quantifying uncertainty. The objective of this study is to
establish a connection between the concept of TD and the
field of AL. In order to clearly demonstrate the effective-
ness of utilizing TD to quantify data uncertainty, we have
employed two of the most prevalent and straightforward es-
timators, entropy [43] and margin [41], to measure data un-
certainty in this paper. Entropy His defined as follows:
H(p) = XC
c=1 p(c|x) log p(c|x),(2)
where the sample xis from the unlabeled data pool Du.
Entropy concentrates on the level of the model’s confidence
on the given sample xand gets bigger when the prediction
across the classes becomes uniform (i.e., uncertain). Margin
Mmeasures the difference between the probability of the
true label and the maximum of the others:
M(p) = p(y|x)max
c̸=ˆyp(c|x),(3)
where ydenotes the true label. The smaller the margin, the
lower the model’s confidence in the sample, so it can be
considered uncertain. Both entropy and margin are com-
puted with the predicted probabilities pof the fully trained
classifier f, only taking the snapshot of finto account.
Defining training dynamics. Our TiDAL targets to lever-
age TD of unlabeled data to estimate their uncertainties. TD
can be defined as any model behavior during optimization,
such as the area under the margin between logit values of
the target class and the other largest class [39] or the vari-
ance of the predicted probabilities generated at each epoch
[47]. In this work, we define the TD ¯
p(t)as the area under
the predicted probabilities of each data sample xobtained
(a) Entropy Distribution (b) Margin Distribution
Figure 2: Score distribution after long-tailed training. We
plot the marginal distributions using kernel density estima-
tion (KDE). It is difficult to separate major (certain) and mi-
nor (uncertain) samples by the model snapshot-based scores
(horizontal), unlike the TD-driven scores (vertical) that en-
able clearly separating the certain and uncertain samples.
during the ttime steps of optimizing the target classifier f:
p(i)= [p(i)(1|x), p(i)(2|x),··· , p(i)(C|x)]T,(4)
¯
p(t)= [¯p(t)(1|x),¯p(t)(2|x),··· ,¯p(t)(C|x)]T
=Xτp(τ)τXt
i=1 p(i)/t, (5)
where p(i)is the predicted probabilities of a target classifier
fat the i-th time step. τis the unit time step to normal-
ize the predicted probabilities. For simplicity, we record
p(i)every epoch and choose τ= 1/t, namely, averaging
the predicted probabilities during tepochs [47,46]. The
TD ¯
p(t)takes all the predicted probabilities during model
optimization into account. Hence, it encapsulates the over-
all tendency of the model during tepochs of optimization,
avoiding being solely biased towards the snapshot of p(t)in
the final epoch t.
3. Is TD Useful for Quantifying Uncertainty?
In this section, we provide empirical and theoretical ev-
idence to support our argument: TD is more effective in
separating uncertain data from certain data than the model
snapshot, where the latter is often utilized to quantify data
uncertainty in previous works [52,45].
3.1. Motivating Observation
Settings. We aim to observe and compare the behavior of
TD and the model snapshot for different sample difficulties.
However, it is nontrivial to directly measure sample-wise
difficulty, inhibiting the quantitative analysis of data uncer-
tainty. To avoid this, we borrow the theoretical and empiri-
cal results of long-tailed visual recognition [33,8,19]: it is
hard for the deep neural network-based model to train with
fewer samples. Hence, we regard major and minor class
samples to contain many certain and uncertain samples for
the model, respectively. We train the target classifier fon
the long-tailed dataset during Tepochs to obtain the TD and
the model snapshot. We apply both approaches to the com-
mon estimators, entropy and margin. We denote entropy
and margin scores from the model snapshot as Hand M.
In opposition, we denote the TD-driven scores as ¯
Hand ¯
M.
More details and discussions are described in Appendix B.
Results. Figure 2shows the distribution the scores calcu-
lated with TD (x-axis) and model snapshot (y-axis). We can
observe that scores from TD ( ¯
H, ¯
M) successfully separate
the major and the minor class samples, whereas scores from
the model snapshot (H, M) fail to do so. We conclude that
compared to model snapshots, TD is more helpful in sepa-
rating uncertain samples from certain samples.
3.2. Theoretical Evidence
Theorem 1. (Informal) Under the LE-SDE framework
[54], with the assumption of local elasticity [17], certain
samples and uncertain samples reveal different TD; es-
pecially, certain samples converge quickly than uncertain
samples.
The above theorem discusses different model behaviors
depending on the difficulty of the sample. Compared to
the uncertain sample, the certain sample has the same class
samples nearby, which is the fundamental idea of level set
estimation [22] and nearest neighbor [36] literature. We
suspect that, due to the local elasticity of deep nets, samples
close by have a bigger impact on the certain sample, hence
changing its predicted probability more rapidly. As the cer-
tain sample is quicker to converge, its TD is larger than that
of the uncertain sample. Intuitively, slower to train, strug-
gling the classifier is to learn, hence TD capturing the un-
certainty in the classifier’s perspective.
Theorem 2. (Informal) Estimators such as Entropy (Equa-
tion 10) and Margin (Equation 11) successfully capture the
difference of TD between easy and hard samples even for
the case where it cannot be distinguished via the predicted
probabilities of the model snapshot.
The above theorem discusses the validity of entropy and
margin on whether they can successfully differentiate be-
tween two samples of different TD but with the same final
prediction. With Theorem 1, one can conclude that the com-
mon estimators’ scores calculated with TD are effective in
capturing the data uncertainty. Due to the space constraints,
we provide the details of the above results in Appendix A.
4. Utilizing TD for Active Learning
As tracking the TD of all the unlabeled data is compu-
tationally infeasible, we devise an efficient method to es-
timate the TD of unlabeled samples. We train the module
that directly predicts the TD of each sample by feeding the
training samples, where its TD are freely available during
training. Then, based on the predicted TD of each unla-
beled sample, we use the common estimators, entropy or
margin, to determine which sample is the most uncertain
so that human annotators can label it. Hence, in this sec-
tion, we describe the details of the module that estimates
TD (§4.1) and how to train the module (§4.2). Finally, cal-
culating the uncertainties using the module predictions for
active learning is illustrated (§4.3).
4.1. Training Dynamics Prediction Module
As mentioned, it is not computationally feasible to track
TD for the large-scale unlabeled data as it requires model in-
ference on all the unlabeled data every training epoch. Thus,
we propose the TD prediction module mto efficiently pre-
dict the TD of unlabeled data at the t-th epoch. Being influ-
enced by the previous studies [11,52,45,25] that use addi-
tional modules to predict useful values such as loss or confi-
dence by the target model outputs, multi-scale feature maps
are aggregated and passed into our TD prediction module.
The module produces the C-dimensional predictions:
˜
p(t)
m= [˜p(t)
m(1|x),··· ,˜p(t)
m(C|x)]T[0,1]C(6)
estimating the actual TD ¯
p(t)of the given sample xin Equa-
tion 5. TD prediction module is jointly trained with the tar-
get classifier using a handful of parameters, having a neg-
ligible computational cost during training. The detailed ar-
chitecture of the module is described in Appendix C.
Even though the architecture is similar to previous works
[52,45,25], we observed that ours were much more stable
during optimization and easier to train. We suspect that it
is due to the target task difference; previous works trained
the module that outputs only a single value via regression,
whereas our module outputs C-dimensional probability dis-
tribution, which is similar to the main task of classifying
images.
4.2. Training Objectives
To train the target classifier fat the t-th epoch, we
use the cross-entropy loss function Ltarget on the predicted
probability p(t)and a one-hot encoded vector y∈ {0,1}C
of the true label y:
Ltarget =LCE(p(t),y) = log p(t)(y|x).(7)
Meanwhile, the prediction module mlearns the TD of a
sample xby minimizing the Kullback–Leibler (KL) diver-
gence between the predicted TD ˜
p(t)
mand the actual TD ¯
p(t):
Lmodule =LKL(¯
p(t)||˜
p(t)
m)
=XC
c=1 ¯p(t)(c|x) log ¯p(t)(c|x)
˜p(t)
m(c|x)!.(8)
The final objective function of our proposed method is de-
fined as follows:
L=Ltarget +λLmodule (9)
where λis a balancing factor to control the effect of Lmodule
during model training.
4.3. Quantifying Uncertainty with TD
We argue that uncertain samples can be effectively dis-
tinguished from unlabeled data using the predicted TD. To
verify the effectiveness of leveraging TD, we feed the pre-
dicted TD to entropy and margin (§2) by replacing snapshot
probability pwith the predicted TD ¯
p. We choose these es-
timators as they are widely used for quantifying uncertainty.
We feed ¯
p, replacing p, to the entropy ¯
H:
¯
H(¯
p) = XC
c=1 ¯p(c|x) log ¯p(c|x).(10)
Entropy ¯
His maximized when ¯
pis uniform, i.e., the sam-
ple is uncertain for the target classifier. Margin ¯
Mis also
similarly employed:
¯
M(¯
p) = ¯p(ˆy|x)max
c̸=ˆy¯p(c|x).(11)
Since we do not have true labels of unlabeled samples, we
use the predicted labels ˆyof the target classifier instead of
the true labels. There are several possible variants of ¯
M
depending on the definition of ˆy. We conduct experiments
to compare ¯
Mwith its variants. The experimental details
and results are in Appendix D.4.
At the data selection phase, we use the predicted TD ˜
p(T)
m
instead of the actual TD ¯
p(T)as in Equation 10 &11 to es-
timate the TD-driven uncertainties of the unlabeled sample
xat the final epoch T. By using the estimated uncertainty
with the predicted TD, we select the most informative sam-
ples for model training.
5. Experiments
In this section, we experimentally verify the effective-
ness of our method, TiDAL, which utilizes the estimated
training dynamics from the prediction module to discern
uncertain samples from unlabeled data. We describe the de-
tailed settings and the baseline methods for our experiments
(§5.1) and show the results on both balanced (§5.2) and im-
balanced datasets (§5.3). We further analyze whether the
TD prediction module is effective for AL performance and
can successfully estimate the TD (§5.4). We end the section
by discussing the potential limitations of our method (§5.5).
5.1. Experimental Setup
Datasets. To assess the performance of our proposed
method and baseline methods, we conduct experiments on
摘要:

TiDAL:LearningTrainingDynamicsforActiveLearningSeongMinKye1,†KwangheeChoi2,†HyeongminByun1BuruChang3,∗1Hyperconnect2CarnegieMellonUniversity3SogangUniversity{harris,hyeongmin.byun}@hpcnt.comkwanghec@andrew.cmu.eduburu@sogang.ac.krAbstractActivelearning(AL)aimstoselectthemostusefuldatasamplesfromanun...

展开>> 收起<<
TiDAL Learning Training Dynamics for Active Learning Seong Min Kye1Kwanghee Choi2Hyeongmin Byun1Buru Chang3 1Hyperconnect2Carnegie Mellon University3Sogang University.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:1.94MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注